How to load a Apache Commons Math RealMatrix - apache

It is overly complicated to load a matrix with Apache Common Math, using the utility:
MatrixUtils.deserializeRealMatrix(Object instance, String fieldName, ObjectInputStream ois)
Since you have to implement a new class to store the result of "fieldname".
Do you know of a better way? All I want to do is:
RealMatrix A = loadMatrix("myrealmatrix.dat");

The serialization / deserialization methods are for Java object serialization. The simplest way to load a RealMatrix from a file is to use code like this (modified to handle whatever format you are using to represent the source data) to load the file data into a double[][]array and then use the createRealMatrix method in MatrixUtils.

Related

Kafka, Avro and Schema Registry

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.
In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

Write pojo's to parquet file using reflection

HI Looking for APIs to write parquest with Pojos that I have.
I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter.
Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files.
Any suggestions ?
If you want to go through avro you have two options:
1) Let avro generate your pojos (see the tutorial here). The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.
2) Write the conversion from your pojo to GenericRecord yourself. You can do this either manually or a more generic solution would be to use reflection. However, I encountered difficulties with this approach when I tried to read the data. Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. Because of this reason I went with option 1.
Parquet also supports now writing pojo directly. Here is the pull request on parquet github page. However, I think this is not part of an official release yet. In another words, I did not find this code in maven.
DISCLAIMER: The following code was written when I was in a hurry. It is not efficient and future versions of parquet will surely fix this more directly. That being said, this is a lightweight inefficient approach to what you need. The strategy is POJO -> AVRO -> PARQUET
POJO -> AVRO: Declare a schema via reflection. Declare writers and readers based on the schema. At the time of conversion write the object to byte stream and read it back as avro.
AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project.
private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);
public GenericRecord toAvroGenericRecord() throws IOException {
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}
One more thing: it seems the parquet writers are currently very strict about null fields. Make sure none of your fields are null before attempting to write to parquet
I wasn't able to find an existing solution, so I implemented it myself. Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
In short, it does the following:
uses reflection to get Avro schema from the pojo
using the schema and reflection it converts pojos to GenericRecord objects
reflection is applied recursively if the pojo contains other pojos or list of pojos

Hadoop Serializer Not Found Exception

I have a job whose output format is SequenceFileOuputFormat.
I set the output key and value class like this:
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(SplitInfo.class);
The SplitInfo class implements Serializable,Writable
I set the io.serializations property as follows:
conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
+ "org.apache.hadoop.io.serializer.WritableSerialization");
However, on the reducer side I get this error, telling me that Hadoop couldn't find a serializer:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:961)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:892)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:569)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
Can anyone help, please ?
The problem was that I was making a stupid mistake: I was not updating a jar. So, basically SplitInfo was not implementing the Writable interface in the old (in use) jar.
As a general observation: the error specified in the OP has as underlying cause the fact that HADOOP can't find a Serializer for a specific type which you're trying to serialize (being directly or indirectly, e.g. by using that type as an output key/value). Hadoop cannot find a Serilizer for one of the 2 reasons:
your type is not serializable (i.e. it doesn't implement Writable or Serializable)
There is no Serializer available to Hadoop for the type of serialization your type implements (e.g.: your type implements Writable but hadoop for one reason or another cannot use the org.apache.hadoop.io.serializer.WritableSerialization class)
I think you're trying to do something you don't need to. Your output value only needs to implement the Writable interface and you should just set the output format.
conf.setOutputFormatClass(SequenceFileOutputFormat.class);
You only use the "io.serializations" configuration if you want to use a different serialization framework, which it doesn't look like you need.

design pattern for parsing object to specific model and vice versa

I have a file uploading logic and a very specific business rules. And according them I should parse my filemodel to row, which looks like "Header:{processed field1},{processed field2},{processed field3},{processed field4},{processed field5},{processed field6},{processed field7},{processed field8} and so on for 19 params" It's initially custom serialization.
And I also should have possibility to parse this row back to object. So, the question is what is a common idea to codding such staff?
Because now for parsing model to row I just use string.format with many options, and for parsing row to model I split the row by ',' and then manipulation with parts of information assign it to models fields. But in this implementation there are a lot of low level work, some hard coded position and also a lot of things that do not look pretty for me.
There's not going to be any magic involved here, particularly since you are serializing the object to a non standard format. You're probably going to have to live with the 'ugly' code.
You should put your serialization / deserialization inside a custom serializer. You can follow the same pattern as the other serializers in the .net library and implement the IFormatter interface . This would provide you with a common interface that you can use to stream to and from a file (or any stream):
using (var fileStream = new FileStream(fileName, FileMode.Create))
{
var formatter = new CustomFormatter();
formatter.Serialize(fileStream, objectToSerialize);
}
using (var fileStream = new FileStream(fileName, FileMode.Read))
{
var formatter = new CustomFormatter();
return (CustomType)formatter.DeSerialize(fileStream);
}
You can see an example of a custom formatter in this download

vb.net object persisted in database

How can I go about storing a vb.net user defined object in a sql database. I am not trying to replicate the properties with columns. I mean something along the lines of converting or encoding my object to a byte array and then storing that in a field in the db. Like when you store an instance of an object in session, but I need the info to persist past the current session.
#Orion Edwards
It's not a matter of stances. It's because one day, you will change your code. Then you will try de-serialize the old object, and YOUR PROGRAM WILL CRASH.
My Program will not "CRASH", it will throw an exception. Lucky for me .net has a whole set of classes dedicated for such an occasion. At which time I will refresh my stale data and put it back in the db. That is the point of this one field (or stance, as the case may be).
You can use serialization - it allows you to store your object at least in 3 forms: binary (suitable for BLOBs), XML (take advantage of MSSQL's XML data type) or just plain text (store in varchar or text column)
Before you head down this road towards your own eventual insanity, you should take a look at this (or one day repeat it):
http://thedailywtf.com/Articles/The-Mythical-Business-Layer.aspx
Persisting objects in a database is not a good idea. It kills all the good things that a database is designed to do.
You could use the BinaryFormatter class to serialize your object to a binary format, then save the resulting string in your database.
The XmlSerializer or the DataContractSerializer in .net 3.x will do the job for you.
#aku, lomaxx and bdukes - your solutions are what I was looking for.
#1800 INFORMATION - while i appreciate your stance on the matter, this is a special case of data that I get from a webservice that gets refreshed only about once a month. I dont need the data persisted in db form because thats what the webservice is for. Below is the code I finally got to work.
Serialize
#'res is my object to serialize
Dim xml_serializer As System.Xml.Serialization.XmlSerializer
Dim string_writer As New System.IO.StringWriter()
xml_serializer = New System.Xml.Serialization.XmlSerializer(res.GetType)
xml_serializer.Serialize(string_writer, res)
Deserialize
#'string_writer and xml_serializer from above
Dim serialization As String = string_writer.ToString
Dim string_reader As System.IO.StringReader
string_reader = New System.IO.StringReader(serialization)
Dim res2 As testsedie.EligibilityResponse
res2 = xml_serializer.Deserialize(string_reader)
What you want to do is called "Serializing" your object, and .Net has a few different ways to go about it. One is the XmlSerializer class in the System.Xml.Serialization namespace.
Another is in the System.Runtime.Serialization namespace. This has support for a SOAP formatter, a binary formatter, and a base class you can inherit from that all implement a common interface.
For what you are talking about, the BinaryFormatter suggested earlier will probably have the best performance.
I'm backing #1800 Information on this one.
Serializing objects for long-term storage is never a good idea
while i appreciate your stance on the matter, this is a special case of data that I get from a webservice that gets refreshed only about once a month.
It's not a matter of stances. It's because one day, you will change your code. Then you will try de-serialize the old object, and YOUR PROGRAM WILL CRASH.
If it crashes (or throws an exception) all you are left with is a bunch of binary data to try and sift through to recreate your objects.
If you are only persisting binary why not just save straight to disk. You also might want to look at using something like xml as, as has been mentioned, if you alter your object definition you may not be able to unserialise it without some hard work.