I’ve been working with Spark recently. It’s awesome. Unfortunately, the Spark community leans towards Scala, but the Java client was the best choice for our team. This means that sometimes the documentation and examples out there aren’t great. One thing that took me longer than I liked was figuring out how to specify the schema for the data.
Spark has a nice default behavior where it will infer the schema of your data when you load it. The trouble with this is that if you change the data over time (adding new attributes for example), you can run into issues with your code only working with certain versions of the data and not others.
Fortunately, you can specify the schema, so that the fields will exist as nulls. I found lots of examples for how to do this in Scala, but it was hard to find examples in Java. So, here’s how:
Let’s pretend the following file is the schema of `whatever.json`
{ "company_name":"", "address":{"street":"Spear St","city":"San Francisco"}, "rank": 100 }
This would correspond to the following code:
SparkConf conf = new SparkConf().setAppName("MyFunApp"); JavaSparkContext sparkCtx = new JavaSparkContext(conf); HiveContext context = new HiveContext(sparkCtx.sc()); String sourceUrl = "whatever.json"; StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("company_name", DataTypes.LongType, false), DataTypes.createStructField("address", DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("street", DataTypes.StringType, true), DataTypes.createStructField("city", DataTypes.StringType, true) )), true) )); DataFrame changesRaw = context.read().schema(schema).json(sourceUrl).cache();
The third createStructField param is “can the value be null?”