Documentation for an unknown future

I found a great meme on Reddit that made me think of software documentation (probably because I think a lot about good documentation and communication).

It can be so hard to know the context that the audience will have when they read something, and I think it’s especially hard to know what can be “too obvious” to write (but might not be obvious to the reader). I think this can be especially challenging with super-strong engineers with deep domain expertise.

I always love a good cooking metaphor, and there was a comment that mentioned how we all do this today, maybe without realizing. When we look at a recipe for chocolate chip cookies, and it says to add an egg, we all know it means a chicken egg. We also know it doesn’t mean a duck egg, or an ostrich egg, or salmon roe.

This idea is core to communication, and I think it shows up a lot when writing technical documentation. It can be challenging to ensure that documentation is useful as a software feature matures. In as little as a few years, there can be subtle “aging” of the documentation, or the product, or the team, which makes the communication harder. This idea is especially highlighted by a project to communicate a message over 10,000 years like “we buried nuclear waste here, don’t dig it up”.

Why’d they do that?

TLDR: Documentation should be readily available. Plans should be written down, linked to tickets, and tickets linked to code via commit messages. This applies at all scales: global infrastructure, a single application, or a single package within a codebase.

“Why”. It’s the eternal question.

80s Breakfast Club edition
90s Friends edition
2000s Eric Andre Show meme edition

Take a minute to consider the old story about the three bricklayers. The third bricklayer is the focus of the story, the one with the most well-developed sense of meaning. I think this story is popular because deep down, we all feel a need to search for meaning. We obviously search for facts and information…the “what” and “how”, and we must answer these before we can ask/understand the “why”.

Also consider the Five Whys. Again, the focus is on the “why”…there’s no such thing as “five whats” or “five whos”.

My point with all this is that question “why” is a special one. Having the answer is incredibly valuable, but it can also be an exceptionally hard answer to get. Chesterton’s Fence feels like a corollary to this idea.

This matters for coding, because engineers jobs mostly come down to “changing code”. We want the code to keep doing all the things it was, but we want one thing to change. When an engineer is editing code before they can make any change, they need to understand two things:

  1. what is this code doing?
  2. why was it written this way?

Usually it’s easy to get an answer to #1, but it can be very difficult to get an answer to #2. It’s not always obvious why something was done a certain way. There should be documentation, but there often isn’t any, or it’s not easy to find.

I remember working on an older codebase where someone migrated from one Git repo to another, but instead of doing it in a way that preserved the commit history, they just copied the files into a new directory, and ran:

git add *; git commit -m "first commit"

Never do this. It completely removed the entire git history, and removed most of our ability to understand why things were done a certain way. We ended up moving slowly and breaking things. We also hated the code, and you know what they say about being considerate of the mental state of the people who maintain your code.

The Solution

…it is possible to look at a line of code and 60 seconds later, have access to the full history of that code, all the way to the business strategy document explaining why that line of code is valuable to the company.

If you follow a good process, it will be easy for your engineers to understand the system quickly, and they’ll get more work done, with higher quality. Here’s how it works for them:

  1. Read a line of code. Look at surrounding comments.
  2. Use “git blame” to view the commit message. From here the engineer can see notes from the person who wrote the code. Usually not a ton of information, but you can understand their thought process.
  3. The commit message is linked to a PR. This shows the engineer all of the other code changed at the same time, plus the notes from the review process, the PR description and a link to the ticket
  4. The ticket explains more information, and is linked to other tickets with more information. The ticket also establishes timeframes, and you can search for other tickets worked on around the same time.
  5. One of the tickets (usually some kind of parent/feature ticket) will have a link to a planning document with even more information.

If your team follows the right process, it is possible to look at a line of code and 60 seconds later, have access to the full history of that code, all the way to the business strategy document explaining why that line of code is valuable to the company.

In my experience, the hardest part of this process is cultivating the habit. The actual effort of linking things takes only a few seconds. Some people like to use linters as a safeguard, and that’s useful in a large-scale organization, but I don’t think there’s a substitute for understanding the reason for this process. Put another way, it’s important to understand the “why” of the process. šŸ˜‰

(Side note: Tim Berners-Lee realized the power of the hyperlink, and this is what created the modern web, Google’s PageRank, and so much more. Linking relevant information is a game-changer.)

Special consideration: infrastructure

Unfortunately not everything is code, and not everything is committed to Git. I mostly see this with configuration-instead-of-code systems, often from third-party infrastructure vendors, but sometimes internal tools as well.

Sometimes you will deal with a configuration file where you can’t include comments for context, or you have to deal with a file that can’t reasonably be version controlled so you don’t get a link to a ticket. Even worse, you might have a UI-only interface. It’s possible to automate these components, but that doesn’t always happen. Eventually you end up with 500 entries and no clue why they exist, or if they’re safe to edit/remove.

The solution I’ve found is that usually you get some type of text field, like in a firewall where you might be able to add a name to the firewall rule. Name things “inbound-for-database-TIX1234” where TIX1234 is the ID of the ticket for the work where someone can find more information.

Special consideration: hacky solutions

A long time ago, a very good engineer told me “it’s ok if there’s a mess, just document the mess”. I’ve never forgotten this advice. It’s acceptable to do hacky things under certain circumstances. I always get a laugh that a CPU is the ultimate hack: it is basically a rock (silicon) getting electrocuted.

If you have to do something unusual or hacky in the code, you should shine a spotlight on the hack. Put a big comment explaining why you’re doing something weird. This shows you know that you’re doing something unusual, and provides the context for why that unusual behavior might need to be preserved.

Also, if possible, keep the weird hack abstracted. If it’s a piece of code, put it in a library so it will be in a single place. It’s easier to remember and clean up one mess, it’s much harder if there are 100 messes.

Yellow Caution Wet Floor Safety Sign - 25 Inch

How to design a GraphQL API

This post talks about different aspects of creating a GraphQL API:

  • Schema design
  • Technical elements
  • SDLC

Schema design

GraphQL represents an abstraction which maps user queries onto code in the backend application, and the code pulls data from a source of truth. This mental model is important for designing a schema, because the abstraction should focus on the user and their use cases.

In contrast to a user-oriented schema, the GraphQL schema should not be autogenerated to mirror your database schema. This would create a deep coupling of your customer’s code to your data store. This coupling constrains future migrations/refactors, and increases the burden on the customer to understand how to correctly mutate a data model to accomplish a task.

A rough process for designing a schema:

  1. Write down the use cases
  2. List the objects and fields required to meet the use cases
  3. List the queries for reads, and include details of relationships between objects, both the edges of the graph, and any additional metadata belonging to the edge
  4. List the mutations needed for writes. Think of mutations as user behaviors and actions, not simply changes to data fields

There are a few technical concerns that are aligned with the data model design:

  • Don’t explicitly version your API. Clients can choose the data they receive, and new objects/queries/mutations can be created to provide seamless migrations. Explicit versioning should not be needed.
  • Use UUIDs. This can help with caching, and is generally a good practice instead of integer IDs
  • Use nullable fields thoughtfully. Prefer non-null fields unless a field is unavailable for historical data reasons, or if a field comes from a separate system that could be unavailable

Design your schema in a way that makes it more evolvable:

  • Group data with nested types instead of using prefixes or simple arrays
  • Be specific/verbose with naming to avoid future conflicts
  • Use GraphQL built-in types correctly, and prefer built-in types over custom types

Technical elements

This section covers technical aspects of the API itself.

Pagination

Pagination is important for both technical and user experience reasons. Think of a use case to “show all the comments from a user” where a user has 10,000 comments.

Pagination allows you to read “pages” of records, borrowing from the concept of memory pages in computer architecture. You can set your page size to 100 records, and load additional pages as a client consumes more user comments.

Without paginating, you will have to load literally all of the user’s comments, even if you only need the most recent ones. This results in a slow UI, increased bandwidth costs, and increased load on application and database servers.

Pagination can be implemented in many ways, but the GraphQL community generally accepts the Relay Connections spec as the method of choice.

An added benefit of pagination is that it moves closer to the “thinking in graphs” model of GraphQL. GraphQL types represent the nodes of a graph, and nodes are connected by edges. Pagination works by storing metadata on the edge — treating the edge as a first class citizen. You can add your own metadata on the edges about the relationship between the two nodes. For example, if a user is connected to another user, the edge could contain a timestamp for when the connection was created. There are several good resources for understanding how to think about connections [1, 2].

Caching

There are two types of caching in GraphQL: client-side and server-side. Client-side caching is typically handled with DataLoader, which performs both caching and batching of requests. Server-side caching in resolvers can operate using any normal backend caching strategies because resolvers are just normal application code.

There is one consideration if you’re coming from REST to GraphQL. REST generally gives each resource its own endpoint: /users, /users/comments, /users/posts. GraphQL only uses a single endpoint /graphql and the queries are passed as GET and POST params. This means that you cannot use URL-based caching, like a CDN.

If you need to deal with very large responses from your GraphQL API, there are some other strategies.

Security

GraphQL Armor is worth noting. I haven’t looked into this as deeply as I’d like, but security and rate limiting are overlapping concerns.

AuthN/Z

GraphQL does not care how you do client authentication. You should have a generally well-architected service where requests are going to come through an authN component first and GraphQL Resolvers will “know” who the caller is by the time they run.

Authorization logic should live in the resolvers to control access to specific records based on the caller. If a caller is a regular user, they might only see records tied to their user, but an admin user might be able to list all records.

AuthZ logic can be repetitive, and should be written in a well-factored way and not duplicated across resolvers. This is a generally good practice for maintainability reasons, but it is also valuable to have all your authZ rules in one place. Any company that manages access to sensitive data must consider security to be part of their “brand promise” to customers, and engineering can contribute by keeping all authZ logic in one place to provide effective controls and auditability.

Transport layer

The GraphQL community strongly recommends HTTP as the standard transport layer. Subscriptions are the exception to this rule, but are also rarely used.

Be sure to handle the usual HTTP and TCP concerns. Make sure your server has reasonable timeouts for long-running queries. Make sure your client handles retries, rate limits, HTTP error codes, HTTP headers, TCP connection errors, DNS, etc.

As I write this in 2023, the official GraphQL docs for serving over HTTP are generally the most authoritative resource, but there is a working group that has a draft spec.

Subscriptions

Subscriptions are GraphQL’s solution for servers to push updates to clients, usually via Websockets. The GraphQL community generally recommends avoiding subscriptions unless there’s a use case for incremental updates to large objects or for low-latency updates; a chat client is the canonical real-time example. Polling/pulling should be used for most cases.

Resolver design

In GraphQL, Resolvers represent the place where the query gets coupled to the underlying data store. The resolvers you need are a result of the schema you choose, but there are some things worth considering.

It’s important to remember a resolver is just a function containing normal application code that you write. GraphQL helps take a query and decide which resolver to call, but once the resolver is running, it’s just regular code. The resolver function takes the arguments from the query per the GraphQL schema, and returns an object matching the data requested.

You should generally think of a resolver as a public API, the same way that you might write a public method in object oriented code: defensive programming, observability, etc. A suggested pattern is:

  1. Validate and parse arguments
  2. Check user authZ
  3. Preflight checks, usually just checking a server-side cache
  4. Data access: API call, DB call, load from disk, etc
  5. Transform the data to match the output schema
  6. Postflight logic: metrics, logging, storing in a cache, etc
  7. Return the data

Since you’re probably accessing a database, you should think about all your normal database best practices, including avoiding N+1 queries.

Defining GraphQL operations

GraphQL supports three different types of operation: query, mutation, and subscription. Use the query/mutation semantics and only read during a query and write during a mutation. Technically speaking, nothing stops you from modifying data in a query resolver, but you should never do this.

You should always name your operations, to help with observability.

SDLC

There are some additional items that should be considered as part of general good technical hygiene.

Observability is especially important for a GraphQL API because your customers are being access to a very flexible API. This means customers can write complex queries with varying resource usage. You should understand the nature of your user’s requests, and the resources they consume. Observability also helps you understand every piece of data being used, which can make it easier to deprecate/remove functionality.

Testing is another important aspect. You should write unit tests for any data processing/transformation logic in your resolvers. You should also take your main use cases and produce high level tests to assert your core use cases work as expected. Integrate all tests to run during continuous integration.

Next steps

The GraphQL community has consolidated around https://principledgraphql.com/ and is worth a deep read.

Interfaces, abstractions, and developer productivity

I point out in the intro to Technical Debt that development has changed over the years. A core theme of the change is that computers used to be expensive and humans were cheap, and then this relationship reversed where now humans are expensive, and computers are cheap. (I’ll lovingly ignore the easy jokes about sky-high cloud vendor bills)

There are many reasons for change, but I’m going to zoom in on one specific aspect: abstractions. Abstractions (should) hide the details of complex interfaces, and take advantage of cheaper hardware to help engineers produce more business value.

The effects of this highlight a general theorem. Being able to do more with less code is good, because less code = less time and less code = less bugs. These equivalences aren’t universally true, but they’re mostly true.

Abstractions also help manage complexity, because instead of having to think about a million different machine instructions, you can think of a high-level capability like “get a user’s name from a database”.

You can visualize these abstractions as a hierarchy with machine code as the lowest level, followed by assembly language, followed by languages like C, and then higher-level languages like Java, Ruby, Python, and then followed by frameworks, libraries, and other DSL (domain-specific languages) equivalents.

As a side note, I think the existence and benefits of abstraction are universal. You could consider “specialization of labor” as a form of abstraction. Companies use specialization of labor when they hire employees, or hire vendors. There’s an equivalent “specialization” in biology, where high-level organisms like humans rely on microbes for life. The concept of composing complex things out of simple things is universal, but we’ve seen especially rapid change in software because it is easy to abstract ideas, and programming is (essentially) about managing ideas/information, and ideas can change faster than the physical world.

Abstractable Infrastructure

We’ve seen and demonstrated the practicality and value of abstraction with software, because creating abstractions for ideas is easy to do. As software eats the world, we see the same patterns when it comes to cloud infrastructure — infrastructure as code (IaC). It used to be mandatory that you dealt with low-level details for managing infrastructure like racking servers and installing software. Now we have interfaces and abstractions that help us accomplish the same goals with a few minutes of typing commands.

This is why I’m massively excited about Crossplane. I’ll ignore the smart way that it’s implemented on top of Kubernetes (so we get the benefits of gitops, standard resource definitions, reconciliation loops, and scalability), and just focus on it as a component of the developer platform.

As long as we’ve been able to write code to configure infrastructure, we’ve been writing abstractions on top of infrastructure, usually in the form of semi-structured bash scripts. So, the developer platform layer of abstraction on top of IaC has existed for a while.

The classic example of a capability of a developer platform is installing a database. Normally you’d have to think about the low-level steps: provision a server, install the software, and apply the configuration for your use case. It is better to have a “create database” capability that sets things up the way your team needs it done.

The developer platform layer is important. Even if we ignore the time savings, there is a complexity aspect: it’s not practical to expect a product engineer to know the details of infrastructure management. It’s far better to give reduce complexity by giving a product engineer an abstraction for a database instead of making them understand all the underlying details. By removing that burden from product teams, they can deliver business value faster, with higher quality.

But Crossplane isn’t just an abstraction. Crossplane is a tool to create new abstractions. This gives us a standard to easily create, understand, and maintain the developer platform abstractions, in contrast to the “semi-structured” nature of bash or similar automation where a new engineer needs to understand a unique implementation every time.

Crossplane brings ease of use and manageability to the developer platform layer.

Final thoughts

Nothing that I write here should imply that I don’t believe in the idea of full stack engineering, because I do. It’s important for engineers to understand the full stack, much as any domain expert should understand related domains. But there’s a difference between being able to understand a component and being able to produce that component. Producing something requires a deep understanding of the nuances of that domain, and it is impractical to expect people to stay current on the details of more than a certain number of domains.

This does highlight one of the subtler value propositions of Crossplane. Because it is a standard way to implement abstractions, it becomes easier for engineers at other levels of the stack to read the code and understand how their developer platform works.

More broadly, as I write this in 2023, everyone is speculating on how generative AI is going to change software development. Although some core dynamics are going to change in our industry, even as AI gets better at producing code, AI will need human oversight for the foreseeable future. Good interfaces and abstractions will continue to be required so that humans can partner with the AI systems and review their code.

On Writing…an RFC

There’s a book by Stephen King, called On Writing. I have a lot of respect for Stephen King as a creator. This makes me love the book because it is a great explanation of the process of writing, and his process has obviously produced results for him. Also, it is also a book about how to produce more books…and I love recursion. So in that spirit, I’m going to write about writing…a Request For Change (RFC).

While I use the term RFC, I don’t mean IETF RFCs, but I use the term to describe a planning doc that covers any kind of planning where you have to make a decision among a group of reasonable and logical people. RFCs are useful for communication, documentation, persuasion, and alignment.

The point of an RFC is to explain why a change is needed, what change is being proposed, and why the proposed change is the right one

An RFC is meant to explain a change you want to make. That change could be essentially anything: a new coding standard, replacing a database, or refactoring a component. Usually RFCs are for changes that will impact multiple components or multiple teams.

The point of doing this is to get enough information in writing so that some third party (executive, architect, newly hired engineer) could read the document and understand why a change is needed, what change is being proposed, and why the proposed change is the right one.

The RFC Lifecycle

In general, once you have an idea, you should write down a plan, and then implement it. Writing an RFC is just one part of the planning phase, which can also include prototyping and writing tickets.

Be mindful that this lifecycle isn’t always totally linear. Sometimes planning and implementing can blend together. Avoid this when possible, but if it’s impractical, you should at least write RFC drafts and document your work during the implementation.

In general, your process should feel like it makes sense:

  1. Write a first draft
  2. Share the draft with 1-2 close stakeholders to make sure you’re “directionally correct”
  3. Share the draft with other key stakeholders
  4. Share the RFC publicly and get it ratified
  5. Use the RFC to write tickets or any followup actions

The structure of an RFC

You should usually include all of these headings. If (for example) there are no “Key Dates”, it’s generally good to be explicit and write “There are no key dates” instead of removing a heading.

  • Title, RFC number, version number, authors, date
  • Goal
  • Stakeholders
  • Glossary
  • Background
  • Key Dates
  • Requirements/Assumptions
  • Recommendation
  • Alternatives
  • Appendix

Some general comments:

  • Be terse and write in a plain, fact-based style. You want people to read your RFC, and you should value your audience’s time.
  • Use diagrams
  • Cite your sources. Add links to dashboards, lines of code, tickets, etc.
  • Use consistent fonts, headings, etc. Having a polished document matters.
  • Assuming your RFC is on a wiki, encourage people to ask their questions in writing as comments.
  • RFCs don’t always have tickets associated with them, but they usually do. The RFC is not meant to be a place to store the full implementation plans, but you should have reciprocal links between the RFC and the parent ticket.

Notes on individual headings

I have additional notes on each of the headings below. But above all, keep in mind the whole point is just to write things down in a way that makes sense. If you have an idea that makes your document clearer for your audience, go ahead and make that change.

Title, RFC number, version number, authors, date

This is a general heading section for metadata about the RFC.

My one comment on this section is that it is important to have uniquely numbered RFCs. Go with something simple like RFC-1234 or CRFC-1234 (“CRFC” meaning “Company RFC” to distinguish yourself from IETF RFCs). Having a terse identifier is a subtlety that pays off over time, because you can write it down in many places.

Normally code is linked to a commit, which is linked to a ticket, which is linked to a parent ticket, which is linked to an RFC. When a team is consistently creating these links, it becomes possible to read any line of code and be able to get full context within 5 clicks. But sometimes you have non-code resources that don’t fit this model.

Let’s take the example of a firewall rule that isn’t managed via code. These types of resources usually get manually created because of an email, then forgotten. If there is a comment field, or even just a name field, you can add a string like “rfc1234”, and this serves as a pointer or breadcrumb for someone to find more information. Over time, this tiny habit will increase velocity and decrease errors because these non-code resources are usually difficult to manage.

Goal

This can be a tricky section, but it’s a great skill to be able to get good at. If you’re a senior engineer looking to go to staff or management, you need this skill.

Since the Goal section is one of the first sections in the RFC, you know it’s going to be read a bit more than others, so you should make sure it is well-written. In this context, “well-written” means two things:

  • First, it must be very clear. Think of an “elevator pitch”. If you explain the goal to someone and they don’t immediately understand the goal, you might need to make an edit. You’ll know you hit the sweet spot when a close stakeholder (an engineer working directly on the project) reads the goal and respond “of course, that’s obvious”, and when a more distant stakeholder (engineer on another team; product manager; business exec) reads it and says “this makes sense”. I mention this because you, as the author, will almost always feel like the goal is “too obvious”, because you have so much context around the project, but you need to explain it for people who have less context than you.
  • The other key element of “well-written” is that it should be terse. The goal is often just a few sentences. Resist the urge to put too much detail. As the author, you’ll have a huge amount of context, but you must avoid overwhelming people who are just looking to understand the big picture before getting into details.

My suggestion is to try and use the following format: “We need to A, because B, so that we can C, which is in line with D” where D is some kind of obvious company goal like “provide sufficient uptime to our customers”. As an example, “we need to ensure our order API pods have enough resources to stay healthy. The impact of this will be a reduction in downtime, so that our customers can keep placing orders.”

This section often benefits from having a high-level diagram, but if the diagram is more than ~5 boxes, it might be too much detail. Again, if it’s “too obvious” to you as the author, it’s probably the right level of detail for someone who is less familiar with the context.

One optional element in this section is explaining what business problems the RFC is not intending to solve. This element may or may not be required, but during early draft reviews, you should pay attention to any comments like “oh, you’re doing X? is that going to fix Y?” where Y is some “logically adjacent” business problem. The bigger the scope of your project, the more room there is for misunderstanding. It’s a bit of a self-inflicted mistake if someone thinks that you’re solving a problem that you’re not. Either you’ll look bad, or someone else might reprioritize their work based on a misunderstanding.

Stakeholders

Almost any meaningful project has multiple people who are involved with it. Make a bulleted list of their names. It helps share credit and ensures that you have a handy contact list for the project. This also makes it easy to be sure the right people approve the change.

Background

Explain why this change is needed. Your goal is usually to answer the implicit question “why now?”. If you can articulate why this RFC is relevant now, and what has led up to it becoming relevant, your readers will understand the technical context and the urgency.

Requirements/Assumptions

This is a section where you can start going into detail. For many RFCs, you can just have a bulleted list of sentences, but for a large RFC, you should have a table of IDs, summaries, and detailed descriptions.

I strongly suggest that if you have a large number of requirements, you should manually number them like R01, R02, etc. Don’t rely on rows in a spreadsheet or an auto-incrementing bulleted list where the IDs can change, because over time you’ll inevitably need to add/delete requirements, and you don’t want “Requirement #3” to point to different things at different times.

Recommendation

Most of the RFC should be written in a fact-based style, but the Recommendation section is where your opinions should be (supported by the facts).

Alternate Possibilities

I consider this one of the most important sections. If this is well-written, I know the rest of the document can probably be trusted.

In general the purpose of this section is to show you did your homework and didn’t just pick the first idea that came to mind. Sometimes there’s truly only one option and the RFC is required to document it. If that’s the case, write “There are no alternatives”. Usually there are some other options, and you should include enough detail to explain what the option is and why is it not a fit. This can sometimes be accomplished in a few sentences.

If you have to make a tough choice between multiple good options, this section may need to be longer to explain the nuance of why you’re making a certain recommendation. You should support this documentation with in-person conversations, prototypes, and other approaches.

Although I consider this an important section to show your thoroughness, you don’t need to go full ad absurdum. If there’s a possible solution that you expect a reader to think of, but it has an obvious shortcoming, you can write a sentence or two to address it. But as a contrasting example, if you’re considering a new database, you shouldn’t feel the need to do an analysis of every database on the market. Even for a very complex project, you’d probably have 3-5 options that you deeply consider, and as long as you document how you filtered down to those 3-5 options, that is sufficient for most situations.


If you want to go deeper on RFCs, check out the Pragmatic Programmer’s writeup on the subject which includes other templates that you can use for inspiration.

Draw a picture

I have a core memory of a meeting I was once in. We were thinking about the design of a system, and I happened to have made a simple diagram before the meeting for another reason. It was a reallysimple diagram — literally three boxes in a row.

A few minutes into the meeting, it seemed like there was some misunderstanding about the system. I realized that the diagram might be useful, and I put it on screen. By the end of the meeting every person had ended up commenting about what a helpful diagram it was. Except, it wasn’t a compliment like “what an amazing diagram!”. It was more like we were laughing to ourselves: “I can’t believe I’m referring back to this simple three box diagram again, but it’s really helping”.

I’m not telling this story because I’m super cool for having made a diagram. The story stuck with me because I was surprised that such a simple diagram ended up being legitimately useful to a group of smart people. It felt like putting a penny in a machine and getting $100 out.

The lesson is that visual communication is important. Half the human brain is involved in processing visual images. A very simple diagram can communicate a lot of information because our brains are so good at processing images. This is why we have a common expression of “a picture is worth a thousand words (although as an engineer I appreciate the position that the correct ratio is “one picture, 1.77 million words” assuming 256 values per pixel and so on…).

More evidence that visual communication is important and special is the example of cave paintings. Researchers consider cave paintings to be a form of educational technology that goes back at least 40,000 years. Drawing a diagram is one of the oldest communication tools that we still use today. Sometimes when I’m using a whiteboard to explain or brainstorm, I get a laugh that we’re still here, drawing on walls.

When it comes to software engineering, diagrams seem even more essential. We aren’t working on a physical machine where we can point at the parts. It’s possible to read code and develop a mental model of the system, but that takes days and weeks. If we create a representation of the code with a diagram, it can take minutes to develop that mental model. The added information bandwidth is like comparing dialup modem vs gigabit.

A good culture of diagrams is good for business. It makes meetings more efficient, avoids misunderstanding and conflict, insures against team members leaving, makes onboarding new team members easier, and helps scale communication for large (100-1,000 person teams).

One of the other reasons I think it’s important to feel comfortable producing diagrams is because doing so is a powerful learning tool. The act of creating the diagram causes you to process the information more deeply, and make it easier to remember because making diagrams engages the visual and spatial parts of your brain. This effect is so strong that it is the basis for a technique used to memorize large chunks of information, used for memorizing sequences of over a thousand unrelated numbers.

Do you have any tips for making good diagrams?

Yes.

If you aren’t routinely drawing diagrams when planning, just start doing it. That’s my #1 tip. Making diagrams is easy, but developing the habit might be harder. If you don’t feel fluent with diagramming software, just draw on a whiteboard or piece of paper and take a picture of it. The tool itself doesn’t matter very much, and it’s frankly ok to have low-fidelity diagrams. Remember our industry was built by drawing on bar napkins. Something is better than nothing, and you’ll constantly get better.

If you are making a diagram that’s going to be reused heavily, presented to people outside your immediate team, or has more than ~20 boxes, you should add extra formatting and other standard advice of “how to make a diagram” that you’ll find on Google. Use icons, colors, and headings which all improve clarity, and they also make your diagram look polished. In contrast, if it is a throwaway diagram for a single meeting, it’s ok to skip these things.

Most importantly, I’ve noticed that the number one blocker is “what do I put in the diagram” because the answer is “it depends”. If you’re talking about software components, should you include a firewall in the diagram? Do you need to mention the data center location or not? Do you need to specify a generic “storage” component, or do you need to be specific about the storage provider and configurations? It’s really easy to get stuck at this point…

The best answer I’ve found comes from Simon Brown who created the C4 Model. The central idea is that it’s impractical to create one diagram that can fully represent everything, in the same way that it’s not practical to make a single map that can represent a place. He compares it to Google Maps, where more detail starts to appear as you zoom farther in.

Simon’s logical conclusion is to create more than one diagram. Start with a high-level diagram that’s almost too obvious, and then pick the component you’re talking about and make another diagram that is “zoomed in” to that component, and then continue zooming in as many times as you need.

For example, I can tell you we need to talk about “updateComputedAttributes”, but that may or may not be meaningful. In contrast, a zoomable diagram makes it trivial for anyone to understand useful context at each level:

You might get a laugh that this is not a diagram representing a real system, but instead it is a diagram meant to represent a diagram of a real system.

You can already see that updateComputedAttributes is probably some kind of function that occurs during a create or update action while persisting to the DB.

The extra context (both vertical and horizontal) is an improvement in usability/readability for technical and non-technical stakeholders. Especially when you have to work across multiple teams who don’t know each other’s domains closely, a diagram like this can cut hours off complex meetings and prevent costly misunderstandings.

I think there is also a subtler process benefit. The C4 model represents a straightforward and logical process for producing diagrams. It’s easy to feel stuck if you have to make a complex diagram from scratch, but the idea of a zoomable diagram that starts at a high level makes it feel easy to start. Easy processes get used and followed more often than hard ones, which means having a simple process to make diagrams will result in better communication.

Is it working?

There’s an old story about Steve Jobs holding a meeting for a product that was filled with bugs and wasn’t working for the users. He asked “what is this product supposed to do?” and when someone responded by telling him about the features, he said “then why isn’t it doing that?”

I’ve always appreciated this story because it highlights a simple and fundamental question — “is it working?”

Do we care if our software is working?

It almost seems like a trick question to ask if we care if our software is working, because “yes it is working” is so obviously a goal. But, I’m going to suggest that we don’t always act on this.

This image below is a sample image that I made to avoid stealing someone else’s image, but the content is a representation of a very standard SDLC.


There are a few issues I have with these classic SDLCs.

  1. The first is that it looks like a line, so it’s easy to interpret as a straight-through process where you start at the beginning and finish at the end. Some people try and work around this by making the SDLC look like a circle to emphasize the “cycle” part of “software development lifecycle”, which is an improvement, but can still be misleading if someone hasn’t deeply internalized iterative development.
  2. “Maintenance” is the same size arrow as the rest. This is helpful to make things look nice for a diagram, but is misleading. A quick search shows a variety of claims about the exact percentages, but when considering the overall project cost and timelines, Maintenance is routinely estimated at over 50% of the total cost. Also, with long-lived SaaS products, the longer you expect to use software, the bigger the Maintenance category gets. If you really believe in iterative development, you might almost remove Maintenance entirely, and just point the arrow back at Planning.
  3. “Development” and “Testing” are shown as separate phases. Proper testing is fundamentally inseparable from Development. I believe that at the micro level, things like unit testing are deeply linked to the lines of code being written…unit testable code is a sign of well-written code. At the macro level, you almost certainly will need to develop some components to support your test plans, and structure other components to make it possible to test.

Separating “Development” from “Testing” is especially interesting…in theory you can try and isolate the phases, but you’d really have to plan ahead and it doesn’t always happen even though Testing is a phase clearly shown in the SDLC.

What happens when there’s a phase that’s not even shown at all?

I asked “do we care if our software is working?”, and this is why. These standard SDLCs don’t even list “Operating” even though it’s possibly the most important phase of all. Much like the airline industry’s saying that “airplanes only make money when they’re in the air”…software only makes money when it is operating.

You might suggest that it is under “Deployment” or “Maintenance” but there is an impedance mismatch there. Deployment is usually interpreted as “get the software into prod” and Maintenance is interpreted as “fix bugs and upgrade insecure libraries”. But if “Operating” is where we make our money, why is it missing?

A long time ago, someone responsible for a product at a major consumer internet brand told me “we don’t test, we just deploy changes and wait to see if users have problems”. I still cringe when I remember that comment. Under very special circumstances it might be possible to succeed with this philosophy, but it would be irresponsible to build critical and important software with this philosophy.

How do we know if our software is working?

Since we (hopefully) agree that we care if our software is working, the next question is “how do we do this?” Common answers are:

  • we write unit tests
  • we have automated integration tests
  • we have a QA team

These are all fine practices, but I’m going to sidestep the costs/benefits of each. Instead I’m going to point out that they don’t technically answer the question “is the software working?” Instead, they answer a question that’s related but different: “is the software likely to work in production?”

The way to figure out if the software is actually working is by directly watching it. This usually means using words like “monitoring” or “observability” or “metrics” or “telemetry”, but ultimately it means that you should look at data produced by the software which could only be produced by successful operation.

“you should look at data produced by the software which could only be produced by successful operation”

Two side notes:

  1. This is why simple “bolt on” monitoring doesn’t work well. A simple “does the service return HTTP 200 or 500” is not very helpful, because a 200 just means “the webpage loaded” not “the user achieved the thing they want”
  2. I’ve always been a fan of Missouri being the “show me” state. Ideally, if you want to prove a claim, you should show a person direct proof.

What should we look at?

There are some classic ways of doing this which I like because they are usually straightforward to understand and to monitor.

LETS USE RED

In addition to being an excuse to add some color to this blog post, LETS USE RED is actually three helpful acronyms, each of which provide a perspective on basic monitoring.

  • LETS: Latency, Errors, Traffic, Saturation
  • USE: Utilization, Saturation, Errors
  • RED: Rate, Error, Duration

I do like these metrics as a starting point. These pieces of data can be good proxies for the user experience, and you should supplement these with metrics based directly on user experience. Also, these metrics are easy to capture from many services without much effort. Of course, just because something is easy to measure doesn’t mean that it’s the right or best thing to measure.

What happens if you have a consumer product that only gets traffic during certain times of the day when people are awake? What happens if you run software that’s only needed during specific business hours? What happens when you want to do a deployment during low traffic hours, but you still want to know if the software is working?

Show me

The best way of proving that something is capable of doing something is to watch it do the thing. This philosophy should be applied to operations. Direct proof that a system is working helps in many situations: upgrades, new feature releases, debugging during nearly any issue, and so much more.

Using the SLA/SLO/SLI framework, you should think about your indicators (SLIs) using this “show me” philosophy. If you want to be sure a use case works, write some code to test that use case and emit a “success” metric. You should already have metrics generated by the activity, but you should also write custom expectations against the results to prove the success. You can apply this technique to any piece of software, whether you wrote it in-house or if it’s a third party black box.

I know I wrote earlier that using real user data is a gold mine to use as an indicator, but I think that there’s a lot of value in generating synthetic load on your application, and monitoring based on that. As an industry we already test in production at scale. Why not use smoke tests as part of routine observability?

This technique isn’t appropriate for every use case or SLO, and there can be some challenges. But when you absolutely, positively gotta know…it does work.

How to review code written by an engineer better than you

Doing code reviews is mostly about spending thoughtful time reading code. Sometimes people focus on what things should look for during a review, and there are plenty of suggested checklists out there, or you can look at other code reviews among your team and make a checklist by paying attention to the patterns. This one habit will get you through most normal code reviews.

I think it’s easy to review code of someone who is new to a language. They may not write idiomatic code, or they may misuse certain language features as they learn a new language. If you know more than them about the language, the review is easy.

It’s also mostly easy to review the structure of the code for logical arrangement — for example, are objects and data models and abstraction used in reasonable ways? Are there reasonable interfaces? Is there reasonable defensive programming and error handling? If you are thoughtful about each of these areas, the review is easy.

Whether or not the code changes as a result of the comment is not what make a comment valuable.

It’s even easy to review code for functionality…check it out and attempt to run it.

Are some engineers better at each of these things than others? Yes. Some people are experts in languages. Some people are experts in modeling. Some people are experts at writing automated tests. Different people are good at different things. Some people are good at lots of things.


Every engineer is composed of different strengths, and for ease of discussion, let’s assume there are 100 dimensions to being a good engineer. Most good engineers are probably strong in about 70 of the traits, and part of the value of a balanced team is that everyone has a different 70, and once you get a certain number of people on a team, the team has strengths in all 100 dimensions.

But what happens when you’re a normal, even above average engineer, and you’ve got to review the code of one engineer who singlehandedly has at least 95 traits out of the 100? What can you possibly tell someone who is just objectively a better engineer? Why even bother? They’re probably right anyways…


Any really good engineer knows they’re good. So do the people around them. This isn’t about ego, there’s usually just a gentle acknowledgement. A really good engineer is obvious to everyone. The unexpected side effect is that sometimes this results in the really good engineer not getting feedback on their work, because everyone has the same impression — they read the code and go “hey this passes my checklist, but of course it would…approved!”.

This is unfortunate, because I think virtually everyone who is good at something enjoys it when someone takes time to engage with their work and provide commentary, good, bad, or neutral. It’s not as much fun to get rubber stamp approvals. No one is accidentally a good engineer — if they consistently produce quality, it’s because they’re putting thought into their work.

As a result, I try to leave some type of comment every time I read a PR. Sometimes I’m reviewing code where the author has more domain experience than I do, but I’ve found some techniques that help.

One of my best techniques is that as I read through the code, I leave comments like “So in this method, you’re basically….” or “Is this because….” Essentially, I just write down observations and questions. There’s no attempt to “give feedback” per se, it’s more like I’m validating that I was able to understand their code.

This is a crucial step.

There are lots of things you can have an opinion on, and it’s better to post something than nothing. Even a “simple” observation is something that even the most junior member of the team can post.

Whether or not the code changes as a result of the comment is not what make a comment valuable. The goal should not be to only post comments about suggested changes. The goal is to have a discussion about the code. Posting a simple observation, almost “restating” their implementation in a sentence or two, can be a valuable thing.

When someone gets comments on their work, it makes the author feel good that someone’s paying attention, and it also adds an important validation step. By explaining your interpretation of the code, you’re validating that it can be maintained in the future. You’re also contributing to documentation. Years in the future when someone is trying to understand something, and they jump back to the PR, they see the discussion and can validate their own understanding. This “verbal confirmation” can be deeply valuable to you and to other people on your team.

Put another way, code written by a senior engineer shouldn’t only make sense to other senior engineers. Part of what makes a great senior engineer is that they produce solutions that are maintainable, which means that even a more junior engineer should be able to understand it.

I think this is really the core spirit of “how to give feedback to an engineer who you know is better than you”. Even if you don’t have a “critique” of something, you can add value by doing nothing more than add comments that just explain what the code is doing, or that explain your interpretation of it.


Before you start the review, think about the functionality and come up with a 60 second guess as to how you’d write the code. This will give you a starting point. You’ll have a perspective (even if it might be incomplete) and then you’ll find it easier to add comments like “why did/didn’t you do it this other way?”. This comment pattern is a gold mine for fostering a good discussion.


It’s worth your time to do reviews. At minimum, you become a better engineer by reading code. If you know a piece of code was written by someone whose skill you admire, why not take the opportunity to study it? And if you study it, why not write down what you take away from it? This is what starts a discussion, and this is where learning happens. Plus, it helps the author. It’s just good all around.


If nothing else, add comments of things you thought the author did well. If this engineer is as good as you say, compliment what you think they did well. Maybe they’ll even respond with some more “yeah I thought this would be a good idea because X” and they’ll even have some other reason you didn’t even think about, and then you learn even more. Again, the goal is to have a discussion about the code.


My overall point is that there’s basically always something you can write on a PR, no matter what kind of skill difference there is between the author and reviewer. The point of code reviews is not just finding bugs, or fixing problems. There’s always some kind of discussion that you can have to break down silos, improve understanding, and improve your individual skills as well.

Politics and people

I’ve noticed I seem to have a different take on politics and people than most people do. I think there are a few books that I’ve read which have shaped my thinking. I wasn’t necessarily trying to learn about politics when I read them, I didn’t read them in any specific order, and they’re not really political books either. The ideas in them just seem to come up a lot.

The first book is Society of the Spectacle, which is a mind-blowing book. It was written in 1967. I don’t read French, so I read the English translation. It’s the sort of book where you make it through a sentence and you genuinely have to stop and think about what you read, not just because the ideas themselves are interesting, but you constantly have to stop and think about how it was written in 1967 and you think about the past 50+ years. There’s a digested version if you want.

Society of the Spectacle practically predicts social media, Instagram, cable news, “fake news”, and more. I think it goes beyond that in predicting things we can’t see like the phenomenon of clickbait headlines, information/filter bubbles, and addictive technology like that described in Hooked (and also predicting the followup, Indistractable). I think it’s worth really considering how these things affect politics and how we talk with each other as a nation.

On the “fake news” subject, Amusing Ourselves to Death nails it. I read this in 2003, and I know this because I liked the book so much that I looked up the author, Neil Postman, only to find out that he had died the week before. It’s never cool to find out about a new band right before they break up.

If you’ve spent any amount of time thinking about fake news and the problems with entertainment-as-news, you should read Amusing Ourselves to Death. It’s similar to Society of the Spectacle, but is much easier to read.

This next one isn’t exactly a book, but more of a subject area: Semiotics. This is some of the most mind-bending stuff of all. The official explanation is that semiotics is the study of signs and symbols, but this doesn’t do it justice. The best way I can explain it is that it’s about how Sherlock Holmes sees the world. It’s almost like a book that’s not about the meaning of things, but how things could mean anything in the first place.

I think semiotics is a foundation for Society of the Spectacle and Amusing Ourselves to Death. Knowing how we create meaning for ourselves seems useful in thinking about how the media affects us. I personally bought an introductory textbook, but there could be better books out there. For me, a big part of the point is about how people interpret the same thing differently, which happens all the time in politics. It seems to require a very careful separation of opinions and facts, and semiotics seems like a way to unwind these.

Onto a different subject with Thinking, Fast and Slow. It’s a book written by psychologists that won a Nobel Prize in Economics. That’s like winning in two sports at the same time — not easy. I think it even dips into being a sociology book in a way with discussions about what the authors call WYSIATI — our brains do a bad job at remembering, and I think this is worth keeping in mind when thinking inside our own heads, as well as when dealing with other people as individuals and as groups.

Another sort-of-a-book-but-really-a-subject recommendation is systems thinking. I personally read a book called Thinking in Systems, which covered the subject from an environmental angle, but is really useful when thinking about social programs and how we should think about trying to change things.

I’ll end this with one of my favorite things Obama ever said. This is solid, level-headed advice.

ā€œOnce youā€™ve highlighted an issue and brought it to peopleā€™s attention and shined a spotlight, and elected officials or people who are in a position to start bringing about change are ready to sit down with you, then you canā€™t just keep on yelling at them,ā€ Mr. Obama said.

ā€œAnd you canā€™t refuse to meet because that might compromise the purity of your position,ā€ he continued. ā€œThe value of social movements and activism is to get you at the table, get you in the room, and then to start trying to figure out how is this problem going to be solved.ā€

ā€œYou then have a responsibility to prepare an agenda that is achievable, that can institutionalize the changes you seek, and to engage the other side, and occasionally to take half a loaf that will advance the gains that you seek, understanding that thereā€™s going to be more work to do, but this is what is achievable at this moment,ā€ he said.

https://www.nytimes.com/2016/04/24/us/obama-says-movements-like-black-lives-matter-cant-just-keep-on-yelling.html

When to split a data model

I had a discussion at work today where we were adding some fields to a model, and we were talking about whether it should be split into a separate data model. This made me wonder what type of guidance there was out there in the universe.

Turns out, there’s not much. I searched around for any posts about it, and couldn’t find any. There’s lots of info about how to model data upfront, but not a lot of advice about the ongoing maintenance of a data model. So I figured I’d write my own.

I’m using “data model” to describe a single class that gets data from a matching database table, and provides some small and common amounts of processing/filtering/transformation logic for that data. You could also refactor your usage of models to separate the concerns, but many projects don’t.

There are three things to consider when deciding if you should take one data model and split it into two, but I think the ideas can be applied to other designs.

  1. Data
  2. Logic
  3. Lifecycle

Data

You want your data model to be simple and easy to understand. One model should be equivalent to one concept.

The issue is when there’s another concept that’s similar, but not the same. For example, your Store table requires a mailing address, but what about an online store? Do you add a type column, and then validate that the address, or URL is present, depending on the type? Or, does it need to be an OnlineStore vs a PhysicalStore?

You end up with a table where you have 20 columns, only some of which are required under certain circumstances, but not others.

I think that validations with lots of conditionals are a warning sign that theĀ table might be modeling more than one thing.

Logic

Many model classes contain some amount of presentation logic.Ā By “presentation logic,”Ā I mean that it filters the data so that only certain attributes are returned, or that it does some type of transformation to make the data ready to use.

If you notice that you end up with substantial amounts of this logic for presenting data, you should consider if the data model can be improved. Is there some reason that you might be devoting a lot of code to filtering data out of a single table? Would it be better if it was split into a separate table?

Lifecycle

I think the lifecycle of an object matters. An extreme example, for explanation purposes, would be a table that stores a data for a TemporaryMessage and a LongLivedMessage. These two types of data are managed differently, their access is probably controlled differently, and they are purged out of the system according to different business logic.

I think this is especially nefarious because if one data model has different lifecycles, it means that every time an engineer is working with those models in new code, they need to remember that there are different types, and they need special treatment for each type. This can be avoided if you have different classes for different things.