How to design a GraphQL API

This post talks about different aspects of creating a GraphQL API:

Schema design
Technical elements
SDLC

Schema design

GraphQL represents an abstraction which maps user queries onto code in the backend application, and the code pulls data from a source of truth. This mental model is important for designing a schema, because the abstraction should focus on the user and their use cases.

In contrast to a user-oriented schema, the GraphQL schema should not be autogenerated to mirror your database schema. This would create a deep coupling of your customer’s code to your data store. This coupling constrains future migrations/refactors, and increases the burden on the customer to understand how to correctly mutate a data model to accomplish a task.

A rough process for designing a schema:

Write down the use cases
List the objects and fields required to meet the use cases
List the queries for reads, and include details of relationships between objects, both the edges of the graph, and any additional metadata belonging to the edge
List the mutations needed for writes. Think of mutations as user behaviors and actions, not simply changes to data fields

There are a few technical concerns that are aligned with the data model design:

Don’t explicitly version your API. Clients can choose the data they receive, and new objects/queries/mutations can be created to provide seamless migrations. Explicit versioning should not be needed.
Use UUIDs. This can help with caching, and is generally a good practice instead of integer IDs
Use nullable fields thoughtfully. Prefer non-null fields unless a field is unavailable for historical data reasons, or if a field comes from a separate system that could be unavailable

Design your schema in a way that makes it more evolvable:

Group data with nested types instead of using prefixes or simple arrays
Be specific/verbose with naming to avoid future conflicts
Use GraphQL built-in types correctly, and prefer built-in types over custom types

Technical elements

This section covers technical aspects of the API itself.

Pagination

Pagination is important for both technical and user experience reasons. Think of a use case to “show all the comments from a user” where a user has 10,000 comments.

Pagination allows you to read “pages” of records, borrowing from the concept of memory pages in computer architecture. You can set your page size to 100 records, and load additional pages as a client consumes more user comments.

Without paginating, you will have to load literally all of the user’s comments, even if you only need the most recent ones. This results in a slow UI, increased bandwidth costs, and increased load on application and database servers.

Pagination can be implemented in many ways, but the GraphQL community generally accepts the Relay Connections spec as the method of choice.

An added benefit of pagination is that it moves closer to the “thinking in graphs” model of GraphQL. GraphQL types represent the nodes of a graph, and nodes are connected by edges. Pagination works by storing metadata on the edge — treating the edge as a first class citizen. You can add your own metadata on the edges about the relationship between the two nodes. For example, if a user is connected to another user, the edge could contain a timestamp for when the connection was created. There are several good resources for understanding how to think about connections [1, 2].

Caching

There are two types of caching in GraphQL: client-side and server-side. Client-side caching is typically handled with DataLoader, which performs both caching and batching of requests. Server-side caching in resolvers can operate using any normal backend caching strategies because resolvers are just normal application code.

There is one consideration if you’re coming from REST to GraphQL. REST generally gives each resource its own endpoint: /users, /users/comments, /users/posts. GraphQL only uses a single endpoint /graphql and the queries are passed as GET and POST params. This means that you cannot use URL-based caching, like a CDN.

If you need to deal with very large responses from your GraphQL API, there are some other strategies.

Security

GraphQL Armor is worth noting. I haven’t looked into this as deeply as I’d like, but security and rate limiting are overlapping concerns.

AuthN/Z

GraphQL does not care how you do client authentication. You should have a generally well-architected service where requests are going to come through an authN component first and GraphQL Resolvers will “know” who the caller is by the time they run.

Authorization logic should live in the resolvers to control access to specific records based on the caller. If a caller is a regular user, they might only see records tied to their user, but an admin user might be able to list all records.

AuthZ logic can be repetitive, and should be written in a well-factored way and not duplicated across resolvers. This is a generally good practice for maintainability reasons, but it is also valuable to have all your authZ rules in one place. Any company that manages access to sensitive data must consider security to be part of their “brand promise” to customers, and engineering can contribute by keeping all authZ logic in one place to provide effective controls and auditability.

Transport layer

The GraphQL community strongly recommends HTTP as the standard transport layer. Subscriptions are the exception to this rule, but are also rarely used.

Be sure to handle the usual HTTP and TCP concerns. Make sure your server has reasonable timeouts for long-running queries. Make sure your client handles retries, rate limits, HTTP error codes, HTTP headers, TCP connection errors, DNS, etc.

As I write this in 2023, the official GraphQL docs for serving over HTTP are generally the most authoritative resource, but there is a working group that has a draft spec.

Subscriptions

Subscriptions are GraphQL’s solution for servers to push updates to clients, usually via Websockets. The GraphQL community generally recommends avoiding subscriptions unless there’s a use case for incremental updates to large objects or for low-latency updates; a chat client is the canonical real-time example. Polling/pulling should be used for most cases.

Resolver design

In GraphQL, Resolvers represent the place where the query gets coupled to the underlying data store. The resolvers you need are a result of the schema you choose, but there are some things worth considering.

It’s important to remember a resolver is just a function containing normal application code that you write. GraphQL helps take a query and decide which resolver to call, but once the resolver is running, it’s just regular code. The resolver function takes the arguments from the query per the GraphQL schema, and returns an object matching the data requested.

You should generally think of a resolver as a public API, the same way that you might write a public method in object oriented code: defensive programming, observability, etc. A suggested pattern is:

Validate and parse arguments
Check user authZ
Preflight checks, usually just checking a server-side cache
Data access: API call, DB call, load from disk, etc
Transform the data to match the output schema
Postflight logic: metrics, logging, storing in a cache, etc
Return the data

Since you’re probably accessing a database, you should think about all your normal database best practices, including avoiding N+1 queries.

Defining GraphQL operations

GraphQL supports three different types of operation: query, mutation, and subscription. Use the query/mutation semantics and only read during a query and write during a mutation. Technically speaking, nothing stops you from modifying data in a query resolver, but you should never do this.

You should always name your operations, to help with observability.

SDLC

There are some additional items that should be considered as part of general good technical hygiene.

Observability is especially important for a GraphQL API because your customers are being access to a very flexible API. This means customers can write complex queries with varying resource usage. You should understand the nature of your user’s requests, and the resources they consume. Observability also helps you understand every piece of data being used, which can make it easier to deprecate/remove functionality.

Testing is another important aspect. You should write unit tests for any data processing/transformation logic in your resolvers. You should also take your main use cases and produce high level tests to assert your core use cases work as expected. Integrate all tests to run during continuous integration.

Next steps

The GraphQL community has consolidated around https://principledgraphql.com/ and is worth a deep read.