When to split a data model

I had a discussion at work today where we were adding some fields to a model, and we were talking about whether it should be split into a separate data model. This made me wonder what type of guidance there was out there in the universe.

Turns out, there’s not much. I searched around for any posts about it, and couldn’t find any. There’s lots of info about how to model data upfront, but not a lot of advice about the ongoing maintenance of a data model. So I figured I’d write my own.

I’m using “data model” to describe a single class that gets data from a matching database table, and provides some small and common amounts of processing/filtering/transformation logic for that data. You could also refactor your usage of models to separate the concerns, but many projects don’t.

There are three things to consider when deciding if you should take one data model and split it into two, but I think the ideas can be applied to other designs.

  1. Data
  2. Logic
  3. Lifecycle


You want your data model to be simple and easy to understand. One model should be equivalent to one concept.

The issue is when there’s another concept that’s similar, but not the same. For example, your Store table requires a mailing address, but what about an online store? Do you add a type column, and then validate that the address, or URL is present, depending on the type? Or, does it need to be an OnlineStore vs a PhysicalStore?

You end up with a table where you have 20 columns, only some of which are required under certain circumstances, but not others.

I think that validations with lots of conditionals are a warning sign that the table might be modeling more than one thing.


Many model classes contain some amount of presentation logic. By “presentation logic,” I mean that it filters the data so that only certain attributes are returned, or that it does some type of transformation to make the data ready to use.

If you notice that you end up with substantial amounts of this logic for presenting data, you should consider if the data model can be improved. Is there some reason that you might be devoting a lot of code to filtering data out of a single table? Would it be better if it was split into a separate table?


I think the lifecycle of an object matters. An extreme example, for explanation purposes, would be a table that stores a data for a TemporaryMessage and a LongLivedMessage. These two types of data are managed differently, their access is probably controlled differently, and they are purged out of the system according to different business logic.

I think this is especially nefarious because if one data model has different lifecycles, it means that every time an engineer is working with those models in new code, they need to remember that there are different types, and they need special treatment for each type. This can be avoided if you have different classes for different things.

Testing Wisdom

I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence (I suspect this level of confidence is high compared to industry standards, but that could just be hubris). If I don’t typically make a kind of mistake (like setting the wrong variables in a constructor), I don’t test for it. I do tend to make sense of test errors, so I’m extra careful when I have logic with complicated conditionals. When coding on a team, I modify my strategy to carefully test code that we, collectively, tend to get wrong.

Different people will have different testing strategies based on this philosophy, but that seems reasonable to me given the immature state of understanding of how tests can best fit into the inner loop of coding. Ten or twenty years from now we’ll likely have a more universal theory of which tests to write, which tests not to write, and how to tell the difference. In the meantime, experimentation seems in order.

Kent Beck on unit test coverage, via Stack Overflow

I like this. Very straightforward, no complexity. “I write unit tests for things that are complicated or might break”, which is probably a sane strategy for all levels of testing — look for the level at which you’ve got enough complexity below, where it’s easy for something to break.

Dependency Inversion Principle

Dependency Inversion is one of the five SOLID OO principles that’s become so popular in recent years (sometimes referred to as DIP). My opinion is that it’s a highly valuable concept, and is not well-named. “Dependency inversion” doesn’t mean a great deal on its own, and a lot of attempts to explain it tend to get very heady.

(Speaking of unfortunate naming, “hexagonal architecture” is another concept that I find poorly named. Ironic that the concept is about using abstraction to avoid unnecessary coupling, yet the name itself couples the idea to the number 6, which is totally unrelated to the idea. Fortunately, it’s slowly being renamed as “ports and adapters” in most discussions, which is a much better description.)

My personal preference is towards explanations that are more intuitive, and I’d like to put my two cents in for Dependency Inversion.

To me, a way of describing Dependency Inversion is to use classes to separate the features of the application (the parts that a user might use) from the technology that makes the feature work.

For example, let’s say you have a feature where a user can sign themselves up, and once the user’s information is saved to the database, you have some more steps that need to take place. You could make an interactor called


and it will handle the actions necessary (instead of using a callback).

Let’s say one of the things it needs to do is send an email welcoming the user. Let’s also pretend that you use SendGrid to manage emails. That would mean we could end up with something like

class CompleteUserSetup
  def self.perform
    SendGridClient.email "subjectline", "body of email"


Will this run? Absolutely. Is it a good example of code that meets the criteria for Dependency Inversion? No.

What’s happening here is that the feature (user setup) is directly mentioning the technology that implements it (SendGrid). It literally has the name of the tool in the code that is defining the feature.

Literally Rob Lowe

Since there are probably lots of places in the code that send email, if you ever need to switch away from SendGrid, you need to change all those places. Usually it’s not as simple as find-replacing all the instances of “SendGrid” from the code, so a better way is to abstract all of the email interactions into your own class.

class CompleteUserSetup
  def self.perform

class OurEmailClient
  def self.send_welcome_email
    SendGridClient.email "subjectline", "body of email"


Now the feature doesn’t mention the implementation technology. If you kept this pattern going, “SendGrid” would only ever appear in your email client class, and this means that the technology would be decoupled from the features. This is an Adapter pattern, which is basically the first half of Dependency Inversion. (IMHO this is the most common implementation of DI, and this makes testing really easy)

To close the loop of “Dependency Inversion” is to actually pass the client an instance of an email client so it can use more than one.

class CompleteUserSetup
  def self.perform(client)

class OurEmailClient
  def send_welcome_email
    SendGridClient.email 'subjectline', 'body of email'

class OurOtherEmailClient
  def send_welcome_email
    MailChimpClient.email 'subject, 'body of email'

email_client = OurEmailClient.new

# or

email_client = OurOtherEmailClient.new

Now we’re actually passing the dependency into the location where it’s needed. This means the CompleteUserSetup interactor is totally decoupled from which messaging system it will use.

The reason this matters is that now we can choose any type of email provider we currently support, and we can also add new types of email providers that we didn’t previously use.

To be fair, in Ruby, this isn’t quite the same as in Java. Ruby will allow duck typing, which means that we don’t have to write an Interface. Also, Ruby doesn’t support Interfaces at all, so that aspect of this is missing. Still — the principle behind decoupling does make it very easy to write readable code, and testable code, and I always love that.

How to specify the schema using Spark’s Java client

I’ve been working with Spark recently. It’s awesome. Unfortunately, the Spark community leans towards Scala, but the Java client was the best choice for our team. This means that sometimes the documentation and examples out there aren’t great. One thing that took me longer than I liked was figuring out how to specify the schema for the data.

Spark has a nice default behavior where it will infer the schema of your data when you load it. The trouble with this is that if you change the data over time (adding new attributes for example), you can run into issues with your code only working with certain versions of the data and not others.

Fortunately, you can specify the schema, so that the fields will exist as nulls. I found lots of examples for how to do this in Scala, but it was hard to find examples in Java. So, here’s how:

Let’s pretend the following file is the schema of `whatever.json`

  "address":{"street":"Spear St","city":"San Francisco"},
  "rank": 100 

This would correspond to the following code:

SparkConf conf = new SparkConf().setAppName("MyFunApp");
JavaSparkContext sparkCtx = new JavaSparkContext(conf);
HiveContext context = new HiveContext(sparkCtx.sc());
String sourceUrl = "whatever.json";

StructType schema = DataTypes.createStructType(Arrays.asList(
  DataTypes.createStructField("company_name", DataTypes.LongType, false),
  DataTypes.createStructField("address", DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("street", DataTypes.StringType, true),
    DataTypes.createStructField("city", DataTypes.StringType, true)
  )), true)

DataFrame changesRaw = context.read().schema(schema).json(sourceUrl).cache();

The third createStructField param is “can the value be null?”

How to hire engineers

From IRC:


There’s a saying that you are the average of the 5 people you surround yourself with. If we apply that logic to hiring, it means that hiring is really important (but we knew that already).

My friend’s question was an interesting one. As an engineer, I know about writing code,  and engineering fit. If I put my bizdev/marketing hat on, I start to think about the process of interviewing, and how to best allocate my resources (time) in order to qualify a target and convert them (hiring).

There’s one concept that I think is important to share with my engineering friends is the idea of a conversion funnel. The gist is that in a lot of business situations, you have a group of people, you have an action you want some percentage of those people to do, and you have a series of steps that they go through (the stages of the funnel).

A (hilariously simple) conversion funnel for Apple might be

  1. Find out about new iPhone (begin funnel)
  2. Read about new iPhone on internet
  3. Try new iPhone in store
  4. Purchase iPhone (end of funnel)

Hiring engineers (or any role) can be thought of in this way. For example:

  1. Create awareness of your company/the job. This could be through blogging, buying ads, or posting on Craigslist.
  2. Create enough desire to get the person to submit an application. This could be talking up perks, describing the work environment and technologies used, or having a really great application process. Anything that makes your company attractive to work at goes in here.
  3. Review applications/resumes. This is early in the funnel, so you want to spend very little time on. Are they worth any time at all or are they totally not a fit? I recommend having a non-interviewer conceal the person’s name from the reviewers. There is strong evidence that people with good intentions can have something as simple as a name affect their judgement.
  4. Send them a screener problem, and review answers to it. Ideally you should be able to decide to move to the next step (or reply with “no thanks”) with about 15-20 minutes of effort. Again, do this with names concealed, the goal is to focus totally on the code.
  5. 1-2 hour interview. Be sure to talk about their screener problem and understand their engineering sense. Also, you should start getting a sense of what the person might be like to work with, but try and stay open minded about this one until the next step. I like to start the discussion by asking the person how they feel about interviewing — if someone is nervous about being on-the-spot, spending a minute or two to talk about that feeling of nervousness can help them get rid of the feeling so they can focus, and this means you will get a better picture of what they are really like. Also, I typically do this interview via Skype, and ask them to screenshare with me and write a blog using whatever tools or resources they normally would. You can quickly learn a lot about an engineer by spending 15 minutes watching them code in their own comfortable environment.
  6. 4-8 hour pairing session. This should be in-person, unless you’re hiring a remote engineer. The best way to find out what someone is like to work with is to work with them under the most realistic circumstances possible. If possible, ensure that you’ll encounter specific scenarios so that you can gauge skills consistently from one candidate to another. The more objective you can be here, the better your results will be.

Notice that the intent is to minimize effort at the beginning of the process, and do the more intense quality screening at the end.

Above all — customize your funnel in a way that makes sense for you and your situation. There’s not necessarily a right or wrong answer. Some people prefer take-home interviews, and I can see the merit of that as well. If you document your process for this, you can experiment with it over time and end up with a formula that gets engineers who are a great fit for what you need.

The difference that focus makes

Over the past 48 hours, we had some things happen at Ship.io with respect to email delivery. I found some of the takeaways interesting, and felt like writing about it, and how it connects to larger business and strategy ideas. With respect to Ship and the companies involved, I’m going to stay light on the details, and look more at the concepts I see behind the issue.

Mandrill and SendGrid are two very big players in the email delivery space.

Mandrill is owned by Rocket Science Group. Rocket Science also controls MailChimp, TinyLetter, Gather, which are all very marketing-focused products.

SendGrid is a company that focuses on developers as customers. They aggressively brand themselves as developer-focused, and show up at every hackathon they can (one of the Ship developers has three different SendGrid shirts that he’s been given at hackathons).

A marketing focused company is going to attract marketing people, who think of the universe through a marketing lens. A developer company is going to attract engineer-minded people, who think of the universe through an engineering lens.

As Porter taught us, one way of analyzing a company’s strategy (and their strengths and weaknesses) is to look at what their team has done in the past. Past experiences will inevitably shape future decisions.

If you were choosing an email service provider, which one would you choose? Which company do you think understands your view of the world? Which one do you think will create features that lend themselves to your use case?

The more software you have, the more software you need.

Jevons Paradox says that as a resource can be used more efficiently, more of it ends up being used overall. That is, the better the deal you’re getting, the more you’re interested in buying.

This example of real-world compounding reminded me of another example: software. There’s an interesting quirk about software — the more you have of it, the more you need.

Let’s trace through one recent path of the software industry. We start with a website.

A website is a great idea. Websites (code) make it easy to distribute information. There’s so much information that people create dynamic websites (more code).

Dynamic websites are useful, but once you’ve got dynamic data, you want the data available via API (even more code) so that outside developers can build applications (lots more code).

To help with this, you build an API management layer (tons more code), which produces information about all these other applications.

You want that data combined in a dashboard (still even more code) alongside data from all of the other software that relates to your business (which is even more code than all of the other code so far).


Anything that cannot go on forever must stop. But there doesn’t seem to be any reasonable end in sight. This is one small piece of the software industry that points to a bigger trend, which seems to contradict common sense. People intuitively understand supply and demand. But here, supply creates more demand.

It’s a good time to be a programmer.

Voxeo, Tropo, & ORUG

I went out to ORUG tonight. Voxeo was presenting a thing they’re working on, Tropo. Disclosure: they bought us dinner. Full disclosure: I think this thing is really tight.

I used to help set up phone systems in high school, and phone trees have always seemed like kind of a mystery. Tropo lets you build whole phone apps, and it’s ridiculously easy. It’s basically a phone system DSL. They handle text-to-speech, speech-to-text, playing recorded sound files; there’s lots of convenience things for capturing different types of inputs, handling error cases, recording calls, transferring calls, etc. They give local phone numbers in different area codes, they’ve also got Skype integration, and a few other ways to connect to the system. The very cool part is that it’s all free to play around with, but once you start using it for commercial reasons, then you have to pay.

Ever hear of Google’s Grand Central? With this, you could easily make your own. I’ve been playing around with a few things using Tropo’s Ruby setup, and I’ve put the demo code on GitHub. Very cool stuff.

You can write apps in Ruby, PHP, Python, Javascript, and Groovy (“Java++”). There’s a bunch of example code on their site, and development is really easy to do. For example:


digits = $currentCall.callerID.to_s.split('')

area_code = digits[0..2]
city_code = digits[3..5]
subscriber_number = digits[6..9]

# single dashes get spoken as 'dash', use doubles for a pause.
# Double commas don't work, neither do extra spaces
say "-- -- -- S-up. Your phone number is -- #{area_code.join(',')}--#{city_code.join(',')}--#{subscriber_number.join(',')}"


There is a debugger that you can print messages to. Right now there’s a *ton* of output to it, but you’ll find your messages in there.

One thing: I was getting a message that the caller was “not accepting calls at this time”. I realized this was a parse/compile error in my script. So, if you can’t get something to load, check it. The debugger doesn’t seem really helpful with this, I got a generic seeming Java Exception for a variable name typo. They use Java under the hood for tons of stuff, so even though I’m writing Ruby code, it gets interpreted in Java.

I did learn a cool fact about these phone trees. You know how a lot of phone trees suck when you try and talk to them? Well, for speech-to-text conversion, you can only get around an 80% success rate. The reason is that phones are only around 64kbps of data. There’s too much loss for the algorithms to work well. That’s why apps that run on the local computer/phone are able to do better — they embed part of the recognition algorithm in the client.

And, on a final note: skateboarding through downtown is awesome.