Write pojo's to parquet file using reflection

Write pojo's to parquet file using reflection - apache

HI Looking for APIs to write parquest with Pojos that I have.
I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter.
Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files.
Any suggestions ?

If you want to go through avro you have two options:
1) Let avro generate your pojos (see the tutorial here). The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.
2) Write the conversion from your pojo to GenericRecord yourself. You can do this either manually or a more generic solution would be to use reflection. However, I encountered difficulties with this approach when I tried to read the data. Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. Because of this reason I went with option 1.
Parquet also supports now writing pojo directly. Here is the pull request on parquet github page. However, I think this is not part of an official release yet. In another words, I did not find this code in maven.

DISCLAIMER: The following code was written when I was in a hurry. It is not efficient and future versions of parquet will surely fix this more directly. That being said, this is a lightweight inefficient approach to what you need. The strategy is POJO -> AVRO -> PARQUET
POJO -> AVRO: Declare a schema via reflection. Declare writers and readers based on the schema. At the time of conversion write the object to byte stream and read it back as avro.
AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project.
private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);
public GenericRecord toAvroGenericRecord() throws IOException {
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}
One more thing: it seems the parquet writers are currently very strict about null fields. Make sure none of your fields are null before attempting to write to parquet

I wasn't able to find an existing solution, so I implemented it myself. Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
In short, it does the following:
uses reflection to get Avro schema from the pojo
using the schema and reflection it converts pojos to GenericRecord objects
reflection is applied recursively if the pojo contains other pojos or list of pojos

Related

Kafka, Avro and Schema Registry

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.

In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

Queries on schema and JSON data conversion

We already have flatbuffer library embedded in our software code for simple schemas with JSON output data generation.
More update: We are generating the header files using flatc compiler against the schema and integrate these files inside of our code along with FB library for further serialization/deserialization.
Now we also need to have the following schema tree to be supported.
namespace SampleNS;
/// user defined key value pairs to add custom metadata
/// key namespacing is the responsibility of the user
table KeyValue {
key:string (key, required);
value:string (required);
}
enum SchemaVersion:byte {
V1,
V2
}
table Sometable {
value1:ubyte;
value2:ushort (key);
}
table ComponentData {
inputs: [Sometable];
outputs: [Sometable];
}
table Node {
name:string (key);
/// IO definition
data:ComponentData;
/// nested child
child:[Components];
}
table Components {
type:ubyte;
index:ubyte;
nodes:[Node];
}
table GroupMasterData {
schemaversion:SchemaVersion = sampleNS::SchemaVersion::V1;
metainfo:[KeyValue];
/// List of expected components in the system
components:[Components];
}
root_type GroupMasterData;
As from above, table Components is nested recursively. The intention is components may have childs that have the same fields.
I have few queries:
Flatc didnt gave me any error during schema compilation for such
recursive nested tables. But is this supported during the field
access for such tables?
I tried to generate a sample json data file based on above data but I
could not see the field for schemaversion. I learned FB doesn't
serialize the default values. so, I removed the default value that I
assigned in the schema. But, it still doesnt write into the json data
file. On this I also learned we can forcefully write into the file
using force_defaults option. I don't know where is this is to be
put: in the attribute or elsewhere?
Can I create a struct of enum field?
Is their any API to set Flatbuffer options that we otherwise pass to the compiler arguments? or if not, may be I think we have to tinker with the FB library code. Please suggest.
Method 1:
In our serialization method, we do this:
flatbuffers::Parser* parser = new flatbuffers::Parser();
parser->opts.output_default_scalars_in_json = true;
Is this the right method or should I use any other API?

Yes, trees (and even DAG) structures are fully supported. The type definition is recursive, but the data will eventually have leaf nodes with an empty vector of children, presumably.
The integer value for V1 is 0, and that is also the default value for all fields with no explicit default assigned. Use --defaults-json to see this field when converting. Note that explicit versions in a schema is an anti-pattern, since schemas are naturally evolvable without breaking backwards compatibility.
You can put enum fields in structs, yes. Is that what you mean?

How to load a Apache Commons Math RealMatrix

It is overly complicated to load a matrix with Apache Common Math, using the utility:
MatrixUtils.deserializeRealMatrix(Object instance, String fieldName, ObjectInputStream ois)
Since you have to implement a new class to store the result of "fieldname".
Do you know of a better way? All I want to do is:
RealMatrix A = loadMatrix("myrealmatrix.dat");

The serialization / deserialization methods are for Java object serialization. The simplest way to load a RealMatrix from a file is to use code like this (modified to handle whatever format you are using to represent the source data) to load the file data into a double[][]array and then use the createRealMatrix method in MatrixUtils.

Best approach to File Conversion problems

I have a task that i have to complete. I know the solution but i want to make sure that my solution is according to proper OOP/Design Patterns. Here is the scenerio
1- I have 2 files with different formats. (Lets say FormatA and FormatB).
2- I want to convert data in FormatA to FormatB.
3- FormatA is plain text file with keys. Each new line is a new
key/value pair. FormatB is XML.
4- The keys in FormatA file can be same as the keys in FormatB but can also be different. Sometimes we might need to do some calculations to convert the
value to FormatB.
5- There are some chances that more keys will
be added in future to either the old one or the new one.
My solution:
I want the solution to be generic and no hardcoding. So, that if in future some key mapping change then i should not change the code.
1- First i created a "Mapping" XML file that has all the data that which key in FormatA maps to which field in FormatB. The XML structure is something like this
//oldKey = name of the key in the old file format
//newKey = name of the key in the new format
//ignore = optional. set it to true if you want to ignore this field during conversion
//function = optional.Name of a function that will be called. This function will have all the logic to do the calculations
//functionparams = optinal .key names from the old file that need calculation
//defaultvalue = optional parameter. This value will be replaced no matter what if given.
<field oldKey="abc" newKey="def" ignore="false" function="MultiplyBy2" functionparams="abc" defaultvalue="4">
2- I created a class named "TextFileParser" that loads the text file and creates a dictionary with all the keys.
3- I created a class named "MappingXMLParser" which loads the Mapping XML file and populates a dictionary with all the data.
3- I created a class "TextFileToXML" that uses the above 2 classes to write the data in the XML file. No composition is used.
4- I created a class named "Conversion". If the Mapping XML file has declared some function (like "MultiplyBy2") then the definition of functions will be in this class. I will use reflection to call the methods of this class from the class "TextFileToXML".
This is my design but i donot know that it is correct in terms of OOP/Design Patterns/architecture. Can you point out the mistakes ? What can be done better or any good approach ?

As you probably understand, there is no such thing as correct design. I would suggest following standard practices. I would follow these steps:
Translate the key-value file to a simple XML file following the simplest possible XML schema
I would use an XSLT to describe the translation between the simple XML files generated in step 1, and the final result.
Execute the transformation described in XSLT using an XML operations library. I suppose you could find one, on the specific programming language you use.
This way, in case of changing something (adding more keys in future, changing the target XML schema etc), you would only have to change the translation process described in XSLT, which is not application code but an XML file. You have nothing regarding the specific translation process hard-coded in your application.
Regarding the overall design, I would choose to use the Factory pattern. I think it matches perfectly your situation.
Hope I helped!

Hadoop Serializer Not Found Exception

I have a job whose output format is SequenceFileOuputFormat.
I set the output key and value class like this:
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(SplitInfo.class);
The SplitInfo class implements Serializable,Writable
I set the io.serializations property as follows:
conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
+ "org.apache.hadoop.io.serializer.WritableSerialization");
However, on the reducer side I get this error, telling me that Hadoop couldn't find a serializer:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:961)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:892)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:569)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
Can anyone help, please ?

The problem was that I was making a stupid mistake: I was not updating a jar. So, basically SplitInfo was not implementing the Writable interface in the old (in use) jar.
As a general observation: the error specified in the OP has as underlying cause the fact that HADOOP can't find a Serializer for a specific type which you're trying to serialize (being directly or indirectly, e.g. by using that type as an output key/value). Hadoop cannot find a Serilizer for one of the 2 reasons:
your type is not serializable (i.e. it doesn't implement Writable or Serializable)
There is no Serializer available to Hadoop for the type of serialization your type implements (e.g.: your type implements Writable but hadoop for one reason or another cannot use the org.apache.hadoop.io.serializer.WritableSerialization class)

I think you're trying to do something you don't need to. Your output value only needs to implement the Writable interface and you should just set the output format.
conf.setOutputFormatClass(SequenceFileOutputFormat.class);
You only use the "io.serializations" configuration if you want to use a different serialization framework, which it doesn't look like you need.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Write pojo's to parquet file using reflection - apache

Related

Kafka, Avro and Schema Registry

Queries on schema and JSON data conversion

How to load a Apache Commons Math RealMatrix

Best approach to File Conversion problems

Hadoop Serializer Not Found Exception

Categories

Resources