Avro schema storage - serialization

We are evaluating avro v/s thrift for storage. At this point Avro seems to be our choice, however the documentation states that the schema is stored alongside the data when serialized, is there a way to avoid this, since we are incharge of both producing and consuming the data, we want to see if we can avoid serializing the schema, and also is the difference in size of the serialized data with the schema is much larger than just the data without schema?

A little late to the party, but you don't actually need to store the actual schema with each and every record. You do, however, need a way to get back to the original schema from each record's serialized format.
Thus, you could use a schema store + custom serializer that writes the avro record content and the schema id. On read, you can read back in that schema ID, retrieve it from the schema store and then use that schema to rehydrate the record content. Bonus points for using a local cache if your schema store is remote.
This is the exactly the approach that Oracle's NoSQL DB takes to managing schema in a storage efficient manner (its also available able under the AGPL license).
Full disclosure: currently and never previously employed by Oracle or Sun, or worked on the above store. Just came across it recently :)

I'm pretty sure you will always need the schema to be stored with the data. This is because Avro will use it when reading and writing to the .avro file.
According to http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html:
You apply a schema to the value portion of an Oracle NoSQL Database record using Avro bindings. These bindings are used to serialize values before writing them, and to deserialize values after reading them. The usage of these bindings requires your applications to use the Avro data format, which means that each stored value is associated with a schema.
As far as size difference, you only have to store the schema once, so in the big scheme of things, it doesn't make that much of a difference. My schema takes up 105.5KB (And that is a really large schema, yours shouldn't be that large) and each serialized value takes up 3.3KB. I'm not sure what the difference would be for just the raw json of the data, but according to that link I posted:
Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size.
But I believe that may just be for single, simple values.
This is on HDFS for me btw.

Thanks JGibel, Our data would eventually end up in HDFS eventually, and the object container file format does ensure that the schema is only written as a header on the file.
For uses other than HDFS, I was under the wrong assumption that the schema would be attached to every encoded data, but its not the case, meaning you need the schema to deserialize it, but the serialized data does not have to have the schema string attached to it.
E.g.
DatumWriter<TransactionInfo> eventDatumWriter = new SpecificDatumWriter<TransactionInfo>(TransactionInfo.class);
TransactionInfo t1 = getTransaction();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryEncoder becoder = EncoderFactory.get().binaryEncoder(baos, null);
eventDatumWriter.setSchema(t1.getSchema());
eventDatumWriter.write(t1, becoder);
becoder.flush();

Related

Vertica Large Objects

I am migrating a table from Oracle to Vertica that contains an LOB column. The maximum actual size of the LOB column amounts to 800MB. How can this data be accommodated in Vertica? Is it appropriate to use the Flex Table?
In Vertica's documentation, it says that data loaded in a Flex table is stored in column raw which is a LONG VARBINARY data type. By default, it has a max value of 32MB, which, according to the documentation can be changed(i.e. increased) using the parameter FlexTablesRawSize.
I'm thinking this is the approach for storing large objects in Vertica. We just need to update the FlexTablesRawSize parameter to handle 800MB of data. I'd like to consult if this is the optimal way or if there's a better way. Or will this conflict with Vertica's table row constraint limitation that only allows up to 32MB of data per row?
Thank you in advance.
If you use Vertica for what it's built for - running a Big Data database, you would, like in any analytical database, try to avoid large objects in your table. BLOBs and CLOBs are usually used to store unstructured data: large documents, image files, audio files, video files. You can't filter by such a column, you can't run functions on it, or sum it, etc, you can't group by it.
A safe and performant design should lead to storing the file name in a Vertica table column, storing the file maybe even in Hadoop, and letting the front end (usually a BI tool, and all BI tools support that) retrieve the file to bring it to a report screen ...
Good luck ...
Marco

Sqlite copying table resources

I'm looking at using sqlite for storing some data that will be converted from binary to engineering units. The binary data will be kept as one table and the engineering data will be converted from the binary data into another table. It's likely that I'll need to occasionally change the conversions. I read that sqlite doesn't support dropping columns, and I was wondering how expensive (time, resources) it is to copy the old table to a new table, or whether I'm better off with a different database? It's preferable to have the database in a single file and not have a server running.

Text file Vs SQL?

I am using a simple text file to store filenames and their hashvalues; which is later read to search a particular file. Should I go for SQL for such simple task ?
Depends on your need and operations.
If you need simple operation like read and write ( updates and deletion are difficult than DB) considering with a very low volume data sizes it's ok to go that way (not recommending).
Relational Databases are always better than normal file systems because of their rows and tuples structure, suitable for data manipulation operations.
If your need are simple use a json or XML structures. They are way better than raw text files

Is data appended to a table or overwrite it if the table has existed already when streaming data into BigQuery

When streaming data into a BigQuery table, I wonder if the default is to append the json data to a BigQuery table if the table has existed already? The api documentation for tabledata().insertAll() is very brief and doesn't mention parameters like configuration.load.writeDisposition as in a load job.
There are no multiple choices here, so there is no default and no overridden case. Don't forget that BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE.
You just set the path parameters, the trio of project, dataset, table ID,
then set the existing schema as json and the rows, and it will append to the table.
To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.
In case of error you have a short error code that summarizes the error. For help on debugging the specific reason value you receive, see troubleshooting errors.
Also worth reading:
Bigquery internalError when streaming data

Store file on file system or as varbinary(MAX) in SQL Server

I understand that there is a lot of controversy over whether it is bad practice to store files as blob's in a database, but I just want to understand whether this would make sense in my case.
I am creating an ASP.NET application, used internally at a large company, where the users needs to be able to attach files to a 'job' in the system. These files are generally PDF's or Word documents, probably never exceeding a couple of mb.
I am creating a new table like so:
ID (int)
JobID (int)
FileDescription (nvarchar(250))
FileData (varbinary(MAX)
Is the use of varbinary(MAX) here ideal, or should I be storing the path to the file and simply storing the file on the file system somewhere?
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it LARGE_DATA.
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!