How to specify the schema using Spark’s Java client

I’ve been working with Spark recently. It’s awesome. Unfortunately, the Spark community leans towards Scala, but the Java client was the best choice for our team. This means that sometimes the documentation and examples out there aren’t great. One thing that took me longer than I liked was figuring out how to specify the schema for the data.

Spark has a nice default behavior where it will infer the schema of your data when you load it. The trouble with this is that if you change the data over time (adding new attributes for example), you can run into issues with your code only working with certain versions of the data and not others.

Fortunately, you can specify the schema, so that the fields will exist as nulls. I found lots of examples for how to do this in Scala, but it was hard to find examples in Java. So, here’s how:

Let’s pretend the following file is the schema of `whatever.json`

  "address":{"street":"Spear St","city":"San Francisco"},
  "rank": 100 

This would correspond to the following code:

SparkConf conf = new SparkConf().setAppName("MyFunApp");
JavaSparkContext sparkCtx = new JavaSparkContext(conf);
HiveContext context = new HiveContext(;
String sourceUrl = "whatever.json";

StructType schema = DataTypes.createStructType(Arrays.asList(
  DataTypes.createStructField("company_name", DataTypes.LongType, false),
  DataTypes.createStructField("address", DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("street", DataTypes.StringType, true),
    DataTypes.createStructField("city", DataTypes.StringType, true)
  )), true)

DataFrame changesRaw =;

The third createStructField param is “can the value be null?”

Keeping standards high when hiring engineers

My last post on hiring engineers got a lot of attention, and I wanted to do a followup with a technique that you can use to keep hiring standards high.

Typically engineering hires happen in a situation where there’s a serious need of another set of hands to do some work. I say this in contrast to the idea that sometimes gets mentioned in the tech industry of “if you find someone really good, just hire them and then figure out what to do with them”.

This idea matters because when there is a serious need, the people closest to the situation are the ones who know what skills are needed, and these are the people who are brought in to handle the interview. This is entirely reasonable, but it also means that the people tasked with the hiring decision all have an incentive for a hire to occur — they want the position to be filled so someone can take over the work. When incentives are strong enough, it sometimes leads people to make decisions that they otherwise might not — hiring a candidate who isn’t as good as they should be to avoid some short term pain.

If a team is really committed to hiring very good candidates, there is a straightforward way of minimizing this type of mistake. When the interview team is put together, you should include one person who isn’t directly impacted by the hire. This person won’t have the same time pressure as the people who are directly impacted. Their role is to be a detached observer and help keep standards high in the face of the (relatively) short term pressure to make any hiring decision.


How to hire engineers

From IRC:


There’s a saying that you are the average of the 5 people you surround yourself with. If we apply that logic to hiring, it means that hiring is really important (but we knew that already).

My friend’s question was an interesting one. As an engineer, I know about writing code,  and engineering fit. If I put my bizdev/marketing hat on, I start to think about the process of interviewing, and how to best allocate my resources (time) in order to qualify a target and convert them (hiring).

There’s one concept that I think is important to share with my engineering friends is the idea of a conversion funnel. The gist is that in a lot of business situations, you have a group of people, you have an action you want some percentage of those people to do, and you have a series of steps that they go through (the stages of the funnel).

A (hilariously simple) conversion funnel for Apple might be

  1. Find out about new iPhone (begin funnel)
  2. Read about new iPhone on internet
  3. Try new iPhone in store
  4. Purchase iPhone (end of funnel)

Hiring engineers (or any role) can be thought of in this way. For example:

  1. Create awareness of your company/the job. This could be through blogging, buying ads, or posting on Craigslist.
  2. Create enough desire to get the person to submit an application. This could be talking up perks, describing the work environment and technologies used, or having a really great application process. Anything that makes your company attractive to work at goes in here.
  3. Review applications/resumes. This is early in the funnel, so you want to spend very little time on. Are they worth any time at all or are they totally not a fit? I recommend having a non-interviewer conceal the person’s name from the reviewers. There is strong evidence that people with good intentions can have something as simple as a name affect their judgement.
  4. Send them a screener problem, and review answers to it. Ideally you should be able to decide to move to the next step (or reply with “no thanks”) with about 15-20 minutes of effort. Again, do this with names concealed, the goal is to focus totally on the code.
  5. 1-2 hour interview. Be sure to talk about their screener problem and understand their engineering sense. Also, you should start getting a sense of what the person might be like to work with, but try and stay open minded about this one until the next step. I like to start the discussion by asking the person how they feel about interviewing — if someone is nervous about being on-the-spot, spending a minute or two to talk about that feeling of nervousness can help them get rid of the feeling so they can focus, and this means you will get a better picture of what they are really like. Also, I typically do this interview via Skype, and ask them to screenshare with me and write a blog using whatever tools or resources they normally would. You can quickly learn a lot about an engineer by spending 15 minutes watching them code in their own comfortable environment.
  6. 4-8 hour pairing session. This should be in-person, unless you’re hiring a remote engineer. The best way to find out what someone is like to work with is to work with them under the most realistic circumstances possible. If possible, ensure that you’ll encounter specific scenarios so that you can gauge skills consistently from one candidate to another. The more objective you can be here, the better your results will be.

Notice that the intent is to minimize effort at the beginning of the process, and do the more intense quality screening at the end.

Above all — customize your funnel in a way that makes sense for you and your situation. There’s not necessarily a right or wrong answer. Some people prefer take-home interviews, and I can see the merit of that as well. If you document your process for this, you can experiment with it over time and end up with a formula that gets engineers who are a great fit for what you need.

Hard Work

I think this is a great representation of the hard work that goes into producing great results, and the lengths that it takes. This guy is working on a commercial for cereal, and to make sure that it looks good for the camera, he’s sorting pieces of cereal by hand.

I think that more often than we realize, great results are a matter of working hard at the most obvious stuff. Do you want a bowl of cereal to look good? Then sort through a few boxes until you have enough great-looking pieces of cereal to make a bowl. There’s no genius-moment-of-inspiration here, there’s no clever trick…just hard work.

Sorting cereal for commercial shoot

Time management for fathers

This is for the dads out there…

I read Getting Things Done years ago, and it had an immediate impact on my life. One of the tips saved me so much time each day that I decided to work out the savings over the course of my life. I found that with this one tip, I would save over two years of my life, a number that surprised me.

If a doctor walked into a room and told you they had found a drug that would, with no side effects, give you an extra two years of quality life, they would be hailed as a hero the world around. As a result, I’m always keeping an eye out for ideas that might help me get a happier, healthier, and more productive life.

I recently had my first kid, and while being a father is awesome in many ways, a child does require a lot of time, especially in the first 5-6 years. I quickly found that there were some ways in which I wasn’t making the most of my time.

I want to clarify — I don’t mean “making the most of my time” in an unpleasant sense of “grinding maximum productivity out of every moment”. Sometimes making the most of time can include drumming, watching TV, going for walks, or playing games (have you seen the new Arkham Knight?). These things are considered by some to be wastes of time, although there is a material benefit to them if they bring you joy. Even Steve Jobs went home and sat in front of the TV.

Given that being a parent has many unique constraints, a new book caught my eye that seemed like it might be relevant to me. The thing that surprised me a bit was that it was a book written for women: I Know How She Does It.

To me, there are many similarities when it comes to women and men who are trying to be successful in business and in their personal lives, so I decided to read it. I’ve got to say, it’s worth a read, even for men.

Any sociologist will tell you there’s a difference between how people self-assess their own behavior when compared to a more objective measurement. This is what made me so interested in this book. The author, Laura Vanderkam, gathered time logs from a large group of women who were both mothers and successful in their careers.

These time logs are recorded by her subjects every 30 minutes during their days, of what they were doing during that time. There’s no right or wrong here, just simply recording what it was. This is far more objective than anecdotal stories of how people spent their time, and in some cases, the women keeping the logs were surprised to find out how they spent their time.

Laura then takes this data, and presents her analysis of it, as well as some great case studies that both support her conclusions, as well as giving the reader an opportunity to mirror the techniques.

I don’t want to attempt to reproduce the contents of her book here, only to share how fresh and valuable I think her approach is, in taking hard data and applying it to the question of how to live a happy and productive life as a parent, for mothers and fathers.

Uber drivers are employees, doesn’t matter


Uber drivers are employees, not contractors -Calif. Labor Commission

There’s a difference between an Uber driver who drives 10-15 hours a week and an Uber driver who drives 40-50 hours a week.

Uber will make that case on appeal, then push hard for the limit to be set around 25-30.

6 months from now, Uber’s data scientists will find a statistical relationship that shows drivers who work less hours provide better customer service. 

Uber will say that in the interests of providing the best experience, its engineers have changed the algorithm that selects drivers to prefer drivers who are under the legally agreed “cap” that divides part-time from full-time drivers.

5 years from now, Uber will launch it’s fleet of self-driving cars, and the whole discussion will be irrelevant.

When a jack of all trades wins

Everyone knows the expression “jack of all trades, master of none”. I remember a talk by Adam Savage of Mythbusters, where he brings that phrase up, and says that the real phrase is “jack of all trades, master of none, though often better than a master of one”.

During the 90s, the term “T-shaped individual” became popular, and the tech industry fell in love with the concept. The idea there is that while a person might have a wide breadth of skills in many areas (the horizontal part of the T), there is one area that they have deep knowledge of (the vertical part of the T).

Technology and business are areas where I think being aware of (and respecting) other areas of expertise is important, because it’s possible to go very, very deep. It’s impossible for one person to be really deeply aware of all areas. To me, the solution is to cultivate a respect for other domains. A sign of someone who deeply respects other domains is that they try to build relationships with experts in those other areas.

This came to mind the other day, when two different articles popped up on my radar. One was about integrating salespeople into the rest of the business, and the other was about how designers need to understand the full depth of a business, and not just make nice looking pictures.

It’s all too easy to shoehorn a business function into “just do your task and don’t worry about the rest”. Unless you’re exceptionally world-class at one skill (and even then!), it’s worth being mindful of the others.

The difference that focus makes

Over the past 48 hours, we had some things happen at with respect to email delivery. I found some of the takeaways interesting, and felt like writing about it, and how it connects to larger business and strategy ideas. With respect to Ship and the companies involved, I’m going to stay light on the details, and look more at the concepts I see behind the issue.

Mandrill and SendGrid are two very big players in the email delivery space.

Mandrill is owned by Rocket Science Group. Rocket Science also controls MailChimp, TinyLetter, Gather, which are all very marketing-focused products.

SendGrid is a company that focuses on developers as customers. They aggressively brand themselves as developer-focused, and show up at every hackathon they can (one of the Ship developers has three different SendGrid shirts that he’s been given at hackathons).

A marketing focused company is going to attract marketing people, who think of the universe through a marketing lens. A developer company is going to attract engineer-minded people, who think of the universe through an engineering lens.

As Porter taught us, one way of analyzing a company’s strategy (and their strengths and weaknesses) is to look at what their team has done in the past. Past experiences will inevitably shape future decisions.

If you were choosing an email service provider, which one would you choose? Which company do you think understands your view of the world? Which one do you think will create features that lend themselves to your use case?

An innate understanding of ROI

I think humans have an innate concept of many business concepts. For example, “return on investment”. Business schools will complicate the idea of ROI in any number of creative ways, as they do many concepts. I once saw an accounting professor confuse an entire room of college students on the topic of averaging numbers with an overly complicated formula and greek symbols.

Let’s imagine that you are dropped into a fictional time in the past, in a Rousseau-ish land of humans who lie around all day, plucking fruit from trees when they are hungry. Rousseau calls your fellow humans “noble savages”.

You look to your left, and see a savage MBA sitting, calculating how many calories they use per day, how many calories are contained in a piece of fruit, and how many calories they can allocate to climbing a particular tree to allow them to meet their caloric needs with a certain number of minutes spent gathering fruit each day…

…then you look right and see a tree whose branches have sunk low to the ground, heavy with fruit, and you walk over, and (savagely) grab a piece, because you don’t feel like climbing up and down the same tree all day. Return on investment.