I am very new to Hadoop and have to delve into its serialization. I know that Hadoop comes with its own serializer called Writables. I was curious to know whether Avro (or protobuf, thrift) replaces the Writables interface or Avro is just meant for serializing the MR client data but not the internal communication between say namenode and datanode.
AVRO is a serialization library (with apis for a number of languages). AVRO is an alternative to using/implementing your Key / Value objects as Writables, but hadoop still uses it's own RPC data structures when communicating between the various services (datanodes, namenodes, job and task trackers).
I've read somewhere that Avro may well end up being the standard internal data exchange mechanism/serialization framework within Hadoop, which makes sense as it is based on inheritance, much like the "new" Hadoop API (the one that uses the mapreduce namespace for its libraries), whereas the "old" API (mapred libraries) is based on interfaces. That means, in practice, that you can certainly use avro with both APIs, although one or two things may require custom code if you're using the mapred libs (e.g. multiple output formats, chain mappers).
But Avro offers far more than "just" doing away with the need for your own writables (although that is, in my view, a considerable plus): it offers fairly efficient serialization, the choice between serializing against generated entity classes (like thrift requires) or using a so-called GenericRecord structure instead, and not having to have tagged data. This is possible as Avro always has its data schema available at read and write time (it's actually saved in json format as a header in the data file) which means you have the option of "projecting" from one set of fields to a subset of those fields by simply providing this information implicitly in the schema used to read the data. You can then adapt to changes in input data structure by tweaking your schemas, rather than changing your code in multiple places. You can also change the way your data is sorted by defining your schema appropriately (as there is an optional ORDER attribute avalailable).
Related
Avro supports schema reusing. One might put one custom type in file a.avsc and reuse it in the another file b.avsc. That's fatefully important for more or less complicated projects to avoid having identical entities under different namespaces and overflood with stupid mappings between them and protect. But I see no such possibility mentioned in the official documentantion of the Confluent Kafka platform. Does it mean that there is no way to reuse schemas?
I am using tensorflow data validation and I am trying to build schemas around my datasets. I've built the initial schemas and I can see/edit them in notepad, but I'm having a hard time actually finding a resource that shows me exactly what kind of parameters I can set in the file for a given data type (ie min or max values or data shapes).
Does anyone know of a good resource or even a comprehensive schema I can use to further edit my schema file?
Schemas are just a kind of protocol buffers message, and they are defined within TensorFlow Metadata. You can find the protocol buffers definition in tensorflow_metadata/proto/v0/schema.proto, which describes and documents all the possible properties and options.
I would like to (re)-start again with GemStone/S. I have done multiple ETL transformations for relational databases but I'm still fuzzy on how this is done at GemStone/S.
I would like to load data into GemStone from different sources. It could be files (csv, excel, xml, plain text, etc.) or other DBs like SQL Server, Postgres, Oracle, etc.
From what I saw at the pages there is GemConnect which connects to Oracle databases. How do you do it from other databases or files? Is there any option to connect via ODBC? Is there any data pump to do so or you "just" have to one yourself?
In the end I'm asking is how do you create a staging area, where you would clean-up, transform and then load the data into GemStone DB. Are there any examples or documentation how it is done?
Note: Only similar answer I have found is on SO - from Stephan Eggermont, but that was short and without any "real" information.
Staging
I suspect that the reason that most environments have "ETL/staging" as a separate step is because the two endpoints are somewhat rigid and don't have a good programming language for data manipulation. That is, if you have TXT, CSV, XML, JSON, or SQL, and need it in another format/schema, then someone has to do the "transformation." But if you are working in GemStone, then you can do the transformation in Smalltalk--there is no need for a separate step.
Files
If you have files (TXT, CSV, XML, JSON, etc.), then use GsFile. In fact, if the other endpoint can deal with files, then just export from one source in an agreed format and then import in another (with GemStone doing the "heavy lifting" of transforming). Files are simpler, they avoid the communications layer, and they makes debugging trivial (if the source hasn't created the file, then it is the source's problem; if it is in the pending directory then haven't processed it yet (destination problem); if it is in the completed directory then the destination has processed it).
With this approach you start (one or more) background jobs in GemStone to watch a directory, open a file for read, process the file, and then move it to another directory. Other than basic string manipulation, you only need to work with GsFile. Then you create and update your objects in the database.
ODBC
While it would be possible to make FFI calls from GemStone to an ODBC library (or to a database's native library as is done with GemConnect), this would probably be unnecessarily complex. Instead, I'd create another layer using tools that have better interaction with the foreign system. This layer could write text files (as described above), or, with the proper interface, could communicate with GemStone directly. My inclination would be to use Dolphin to extract the data (good ODBC support), then communicate directly to GemStone from Dolphin. You could do something similar with other client Smalltalk dialects (Pharo, VA, or VW), or even from another language (I have a student working on a Python interface to GemStone).
O/R Mapping
Here again you are left with needing a way to take data in one format and translate it to another. These tend to be highly domain-specific and we find it easier to just write Smalltalk code. Alternatively, you could use something like GLORP in Pharo, VA, VW, etc.
Best Practices
I think you haven't found any "best practices" for ETL in GemStone because it isn't something we think of as an external process or separate step. There is just how to communicate with a file (GsFile), a socket (GsSocket), a library (CLibrary), or a client (GCI). From here we can look at internal processing issues like multiple producers and one consumer (RcQueue), or one producer and multiple consumers (locking).
So, it isn't that GemStone applications don't do ETL, they just do it internally and the situations are much more situation-specific.
Hi I am new to HBase and I wonder what is the best approach to serialize and store the data to HBase. Is there any convenient way how to transform "business objects" at application level to HBase objects (Put) - transformation to byte[]. I doubt that it has to be converted manually via helpers methods like .toByte etc.
What are the best practices and experiences?
I read about Avro, Thrift, n-orm, ...
Can someone share his knowledge?
I would go with the default Java API and enable compression on HDFS rather than using a framework for serializing / deserializing efficiently during RPC calls.
Apparently, updates like addition of a column to records in Avro/Thrift would be difficult as you are forced to delete and recreate.
Secondly, I don't see support for Filters in thrift/avro. In case you need to filter data at the source.
My two cents .
For a ORM solution, kindly have a look at https://github.com/impetus-opensource/Kundera .
How many software projects have you worked on used object serialization? I personally never came across a scenario where object serialization was used. One use case i can think of is, a server software storing objects to disk to save memory. Are there other types of software where object serialization is essential or preferred over a database?
I've used object serialization in a lot of my projects. Sometimes we use it to store computer-specific settings locally. I have also used XML serialization to simplify interaction and generation of XML documents. It is also very beneficial in communication protocols. Serialize on one end and re-inflate on the other end.
Well, converting objects to XML or JSON is a form of serialization that is quite common on the web. I've also worked on a project where objects were created and serialized to a binary file in one application and then imported into another custom application (though that's fragile since it uses C# and serialization has broken in the past between versions of the .NET framework). Also, application settings that have a complex structure may be useful to serialize. I also think remoting APIs use serialization to communicate. Basically, serialization in general is simply a way to store the states of your objects, and this has many different uses.
Here are few uses I can think of :
Send an object across network, the most common example is serializing objects across a cluster
Serialize object for (sort of) caching, ie save the state in a file and read it back later
Serialize passive/huge data to a file to minimize the memory consumption and read it back whenever required.
I'm using serialization to pass objects across a TCP socket. You put XmlSerializers on either side, and it parses your data into readily available objects. If you do a little ground work, you can get it so that you're basically passing objects back and forth, and it makes socket communication extremely easy, reducing it to nothing more than socket.Send(myObject);.
Interprocess communication is a biggie.
you can combine db & serialization. f.ex. when you have to store an object with a lot of attributes (often dynamic, i.e. one object attribute set will be different from another one) to the relational DB, and you don't want to create a new column per each attribute
We started out with a system that serialized all of the thousands of in-memory objects to disk every 15 minutes or so. When that started taking too long we switched over to a mixed mode of saving the objects into a relational db and pickle file (this was a python system btw). Eventually the majority of the data was stored in a relational database. Interestingly, the system was written in such a way that all of the application code couldn't care less what was going on down there. It was all done using XP and thousands of automated tests.
Document based applications such as word processors and vector graphics editors will often serialize the document model to disk when the user invokes the Save command. Serialization is often preferred over complex databases in these apps.
Using serialization saves you time each time you want to implement an import/export functionality.
Every time you need to export your system's data, create backups or store some kind of settings, you could use serialization instead and just save the state of the objects that represent the actual config, data or whatever else.
Only when you need a specific format of the exported/imported data, there is a sense in building a custom parser and exporter/importer.
Serialization is also change-proof. Whenever you change the format of the object that is involved in the exchange functionality, it is automatically exportable and you don't have to change the logic behind your export/import parts.
We used it for a backup & update functionality. It was basically serialized hibernate objects being backed up, then the DB schema is altered through the update and we delivered a helper class that "coverted" the old objects to the new DB schema. This way we had a pretty solid update mechanism that wouldnt break easily and does an automatic backup at the same time.
I've used XML serialization heavily on one project. The technique was used to persist to database data structures that had no common structure, so the data couldn't be stored directly. I also used serialization to separate application settings that could be changed at runtime.