When to split a data model

I had a discussion at work today where we were adding some fields to a model, and we were talking about whether it should be split into a separate data model. This made me wonder what type of guidance there was out there in the universe.

Turns out, there’s not much. I searched around for any posts about it, and couldn’t find any. There’s lots of info about how to model data upfront, but not a lot of advice about the ongoing maintenance of a data model. So I figured I’d write my own.

I’m using “data model” to describe a single class that gets data from a matching database table, and provides some small and common amounts of processing/filtering/transformation logic for that data. You could also refactor your usage of models to separate the concerns, but many projects don’t.

There are three things to consider when deciding if you should take one data model and split it into two, but I think the ideas can be applied to other designs.

  1. Data
  2. Logic
  3. Lifecycle


You want your data model to be simple and easy to understand. One model should be equivalent to one concept.

The issue is when there’s another concept that’s similar, but not the same. For example, your Store table requires a mailing address, but what about an online store? Do you add a type column, and then validate that the address, or URL is present, depending on the type? Or, does it need to be an OnlineStore vs a PhysicalStore?

You end up with a table where you have 20 columns, only some of which are required under certain circumstances, but not others.

I think that validations with lots of conditionals are a warning sign that the table might be modeling more than one thing.


Many model classes contain some amount of presentation logic. By “presentation logic,” I mean that it filters the data so that only certain attributes are returned, or that it does some type of transformation to make the data ready to use.

If you notice that you end up with substantial amounts of this logic for presenting data, you should consider if the data model can be improved. Is there some reason that you might be devoting a lot of code to filtering data out of a single table? Would it be better if it was split into a separate table?


I think the lifecycle of an object matters. An extreme example, for explanation purposes, would be a table that stores a data for a TemporaryMessage and a LongLivedMessage. These two types of data are managed differently, their access is probably controlled differently, and they are purged out of the system according to different business logic.

I think this is especially nefarious because if one data model has different lifecycles, it means that every time an engineer is working with those models in new code, they need to remember that there are different types, and they need special treatment for each type. This can be avoided if you have different classes for different things.

Dry Rubs

These are translations of various recipes into ratios. I haven’t made all of these recipes, I just wanted to get a bunch of dry rub ratios all in one place.

I like to have premade batches of dry rubs, and it’s harder to make batches when the recipe uses a small measure like a tablespoon. I also find it easier to compare different recipes and flavors when they’re in ratios like this.

In general, amazingribs.com seems to be a good resource. I especially appreciate the mention that if you taste the uncooked rub, that’s not what the final flavor is like. The only way to know is to cook something with it.

As a general rule, dry rubs are mostly paprika and sugar.

My personal preference is to make rubs with less or no salt. Pre-salting meat before cooking, brining, and different size pieces of meat all need different amounts of salt. Being able to measure out flavor and salt separately gives you more control over the flavors.

Also 1 cup = 16 tablespoons. Useful for when you make batches.

Easy Spice Rubs For Fish

  • 6 parts paprika
  • 6 parts light brown sugar
  • 4 parts dried oregano
  • 3 parts garlic powder
  • 2 parts cumin
  • 1 part cayenne pepper
  • 4 parts salt

beef & pork

  • 12 tablespoons firmly packed dark brown sugar
  • 12 tablespoons white sugar
  • 6 tablespoons American paprika
  • 3 tablespoons garlic powder
  • 2 tablespoons ground black pepper
  • 2 tablespoons ground ginger powder
  • 2 tablespoons onion powder
  • 2/3 tablespoon rosemary powder

beef & pork
(granulated garlic and onion are not the same as powder, and I’m not sure what a reasonable conversion is)

  • 16 parts brown sugar
  • 4 parts paprika
  • 3 parts granulated garlic
  • 2 parts granulated onion
  • 2 parts kosher salt
  • 2 parts black pepper
  • 2 parts cumin
  • 1 part ancho or chipotle
  • 2 parts mustard powder
  • 1 part cayenne pepper


  • 8 parts paprika (about 20% less if you use smoked paprika)
  • 4 parts kosher salt
  • 4 parts freshly ground black pepper
  • 4 parts brown sugar
  • 4 parts chile powder
  • 3 parts ground cumin
  • 2 parts ground coriander
  • 1 parts cayenne pepper, or to taste


  • 2 parts paprika
  • 1 part salt
  • 1 part white sugar
  • 1 part brown sugar
  • 1 part ground cumin (cumin powder)
  • 1 part chili powder
  • 1 part freshly ground black pepper


  • 6 parts brown sugar
  • 6 parts paprika
  • 4.5 parts salt
  • 3 parts black pepper
  • 1.5 parts cayenne
  • 1 part dry mustard


  • 4 parts paprika
  • 2 parts salt
  • 2 parts onion powder
  • 2 parts fresh ground black pepper
  • 1 part cayenne


  • 4 parts paprika
  • 2 parts kosher salt, finely ground
  • 2 parts sugar
  • 1 part mustard powder
  • 2 parts chili powder
  • 2 parts ground cumin
  • 1 part ground black pepper
  • 2 parts granulated garlic
  • 1 part cayenne

Testing Wisdom

I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence (I suspect this level of confidence is high compared to industry standards, but that could just be hubris). If I don’t typically make a kind of mistake (like setting the wrong variables in a constructor), I don’t test for it. I do tend to make sense of test errors, so I’m extra careful when I have logic with complicated conditionals. When coding on a team, I modify my strategy to carefully test code that we, collectively, tend to get wrong.

Different people will have different testing strategies based on this philosophy, but that seems reasonable to me given the immature state of understanding of how tests can best fit into the inner loop of coding. Ten or twenty years from now we’ll likely have a more universal theory of which tests to write, which tests not to write, and how to tell the difference. In the meantime, experimentation seems in order.

Kent Beck on unit test coverage, via Stack Overflow

I like this. Very straightforward, no complexity. “I write unit tests for things that are complicated or might break”, which is probably a sane strategy for all levels of testing — look for the level at which you’ve got enough complexity below, where it’s easy for something to break.

Biden on judgement

When the subject of Trump came up aboard Air Force Two, Biden referred to a well-worn story about how, as a freshman senator, he saw Jesse Helms, the archconservative North Carolina Republican, ripping into a piece of disabilities legislation. Biden was furious about it and began attacking Helms to Mike Mansfield, the Democratic Senate majority leader. Puffing on his pipe, Mansfield asked Biden if he knew that Helms and his wife had adopted a disabled 9-year-old boy no one else would take. “Question a man’s judgment, not his motives,” Mansfield instructed.

I wish to hell I’d just kept saying the exact same thing – Joe Biden

Dependency Inversion Principle

Dependency Inversion is one of the five SOLID OO principles that’s become so popular in recent years (sometimes referred to as DIP). My opinion is that it’s a highly valuable concept, and is not well-named. “Dependency inversion” doesn’t mean a great deal on its own, and a lot of attempts to explain it tend to get very heady.

(Speaking of unfortunate naming, “hexagonal architecture” is another concept that I find poorly named. Ironic that the concept is about using abstraction to avoid unnecessary coupling, yet the name itself couples the idea to the number 6, which is totally unrelated to the idea. Fortunately, it’s slowly being renamed as “ports and adapters” in most discussions, which is a much better description.)

My personal preference is towards explanations that are more intuitive, and I’d like to put my two cents in for Dependency Inversion.

To me, a way of describing Dependency Inversion is to use classes to separate the features of the application (the parts that a user might use) from the technology that makes the feature work.

For example, let’s say you have a feature where a user can sign themselves up, and once the user’s information is saved to the database, you have some more steps that need to take place. You could make an interactor called


and it will handle the actions necessary (instead of using a callback).

Let’s say one of the things it needs to do is send an email welcoming the user. Let’s also pretend that you use SendGrid to manage emails. That would mean we could end up with something like

class CompleteUserSetup
  def self.perform
    SendGridClient.email "subjectline", "body of email"


Will this run? Absolutely. Is it a good example of code that meets the criteria for Dependency Inversion? No.

What’s happening here is that the feature (user setup) is directly mentioning the technology that implements it (SendGrid). It literally has the name of the tool in the code that is defining the feature.

Literally Rob Lowe

Since there are probably lots of places in the code that send email, if you ever need to switch away from SendGrid, you need to change all those places. Usually it’s not as simple as find-replacing all the instances of “SendGrid” from the code, so a better way is to abstract all of the email interactions into your own class.

class CompleteUserSetup
  def self.perform

class OurEmailClient
  def self.send_welcome_email
    SendGridClient.email "subjectline", "body of email"


Now the feature doesn’t mention the implementation technology. If you kept this pattern going, “SendGrid” would only ever appear in your email client class, and this means that the technology would be decoupled from the features. This is an Adapter pattern, which is basically the first half of Dependency Inversion. (IMHO this is the most common implementation of DI, and this makes testing really easy)

To close the loop of “Dependency Inversion” is to actually pass the client an instance of an email client so it can use more than one.

class CompleteUserSetup
  def self.perform(client)

class OurEmailClient
  def send_welcome_email
    SendGridClient.email 'subjectline', 'body of email'

class OurOtherEmailClient
  def send_welcome_email
    MailChimpClient.email 'subject, 'body of email'

email_client = OurEmailClient.new

# or

email_client = OurOtherEmailClient.new

Now we’re actually passing the dependency into the location where it’s needed. This means the CompleteUserSetup interactor is totally decoupled from which messaging system it will use.

The reason this matters is that now we can choose any type of email provider we currently support, and we can also add new types of email providers that we didn’t previously use.

To be fair, in Ruby, this isn’t quite the same as in Java. Ruby will allow duck typing, which means that we don’t have to write an Interface. Also, Ruby doesn’t support Interfaces at all, so that aspect of this is missing. Still — the principle behind decoupling does make it very easy to write readable code, and testable code, and I always love that.

Highs and Lows

I was talking with an engineering manager the other day who told me one key technique he uses for his teams. I thought it was one of those simple-yet-powerful techniques that I love so much.

The simplicity is this — when he does his weekly checkin with his reports, he asks them “in the past week, not just at work but in your life as a whole, what was your high and what was your low?”

I think this is a great approach to management. It’s highly touchy-feely, but I think that’s an important part of managing others. If someone is having a good time in their personal life, it can and will have an effect on their work performance. Likewise, if someone is having a bad time in their personal life, it can and will affect their work.

I think it’s worth being mindful that the effect might not be what we intuitively expect. Someone who is having bad things happen in their personal life might react by being distracted at work, or they might use work as a means to distract themselves from the unpleasantness and focus on it strongly. Or, perhaps there might be another reaction. There’s no way to know without discussing it and paying attention.

(To be clear, I’m not suggesting that people should use a dark pattern to drive their employees by offering work as a distraction from things that might be difficult at home, only saying that the connection between personal and work does exist, and is unique for each person.)

By understanding the person as a unique individual, and as a whole, a great manager can work with the natural rhythms of the lives that their team members live. I think this maximizes happiness and results, and minimizes mistakes and turnover, which are all desirable.

How to specify the schema using Spark’s Java client

I’ve been working with Spark recently. It’s awesome. Unfortunately, the Spark community leans towards Scala, but the Java client was the best choice for our team. This means that sometimes the documentation and examples out there aren’t great. One thing that took me longer than I liked was figuring out how to specify the schema for the data.

Spark has a nice default behavior where it will infer the schema of your data when you load it. The trouble with this is that if you change the data over time (adding new attributes for example), you can run into issues with your code only working with certain versions of the data and not others.

Fortunately, you can specify the schema, so that the fields will exist as nulls. I found lots of examples for how to do this in Scala, but it was hard to find examples in Java. So, here’s how:

Let’s pretend the following file is the schema of `whatever.json`

  "address":{"street":"Spear St","city":"San Francisco"},
  "rank": 100 

This would correspond to the following code:

SparkConf conf = new SparkConf().setAppName("MyFunApp");
JavaSparkContext sparkCtx = new JavaSparkContext(conf);
HiveContext context = new HiveContext(sparkCtx.sc());
String sourceUrl = "whatever.json";

StructType schema = DataTypes.createStructType(Arrays.asList(
  DataTypes.createStructField("company_name", DataTypes.LongType, false),
  DataTypes.createStructField("address", DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("street", DataTypes.StringType, true),
    DataTypes.createStructField("city", DataTypes.StringType, true)
  )), true)

DataFrame changesRaw = context.read().schema(schema).json(sourceUrl).cache();

The third createStructField param is “can the value be null?”