Schema evolution for avro, orc, parquet formats - schema

In terms of schema evolution, my understanding is that it should be a binary answer (yes or no). In the above picture though, it shows pie with 100%, 25% and 50% respectively. What is this concept?

Avro supports adding new columns, rename(Alias) and delete columns.
ORC with Hive supports adding new columns and changing column types.
Parquet supports addition of new columns.
Looks like this is what is represented using the images.

Related

Using AWS Glue to create table of parquet data stored in S3 in Athena

I want to preview in Athena data that resides in an S3 bucket. The data is in parquet. This doc here describes the process of how to use AWS Glue to create a preview. One mandatory step here is to input the Column Details. This include entering the column name and its data type. I have two problems with this step:
1 - What if I have no ideas of what columns exist in the parquet file before hand (i.e. I have not seen the content of the parquet before)?
2 - What if there are hundreds if not thousands of columns in there.
Is there a way to make the this work without entering this Column Details ?
The link you provided answers your first question, I think:
What if I have no ideas of what columns exist in the parquet file before hand
Then you should use a Glue crawler to explore the files and have it create a Glue table for you. That table will show up in the AwsDataCatalog catalog as a queryable relation.
What if there are hundreds if not thousands of columns in there.
If you're worried about some column quota limitation, I spent some time looking around documentation to see if there is any mention of a service quota for max columns per table. I could not find any. That doesn't mean that there isn't one, but I would be surprised to see that someone generated a parquet file with more columns than Glue supports.

Vertica Large Objects

I am migrating a table from Oracle to Vertica that contains an LOB column. The maximum actual size of the LOB column amounts to 800MB. How can this data be accommodated in Vertica? Is it appropriate to use the Flex Table?
In Vertica's documentation, it says that data loaded in a Flex table is stored in column raw which is a LONG VARBINARY data type. By default, it has a max value of 32MB, which, according to the documentation can be changed(i.e. increased) using the parameter FlexTablesRawSize.
I'm thinking this is the approach for storing large objects in Vertica. We just need to update the FlexTablesRawSize parameter to handle 800MB of data. I'd like to consult if this is the optimal way or if there's a better way. Or will this conflict with Vertica's table row constraint limitation that only allows up to 32MB of data per row?
Thank you in advance.
If you use Vertica for what it's built for - running a Big Data database, you would, like in any analytical database, try to avoid large objects in your table. BLOBs and CLOBs are usually used to store unstructured data: large documents, image files, audio files, video files. You can't filter by such a column, you can't run functions on it, or sum it, etc, you can't group by it.
A safe and performant design should lead to storing the file name in a Vertica table column, storing the file maybe even in Hadoop, and letting the front end (usually a BI tool, and all BI tools support that) retrieve the file to bring it to a report screen ...
Good luck ...
Marco

Apache pig - Best Hive file formats

Could someone explain which file fomats of hive will be efficient to be used in pigScript using HCatalog.
I would like to understand which hive file formats will be efficient, since currently we have a partitioned hive table based on date and the underlying file is a sequential file.
Reading for 80 days of data creates around 70,000 mappers which is very huge. Tried changing the map split size to 2GB and did not reduce much.
So, instead of sequential file looking for other options which will reduce the number of mappers. Size of data per data is 9GB.
Is there any suggestions or some inspiration?
Thank you.
As per my knowledge ORC is most suitable file format for hive it has high compression ration, efficiently work on large amount of data and also faster in read. ORC Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in hive.

Many small data table I/O for pandas?

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).
This is done many times over and I can choose the storage format for these tables for best (speed) performance.
What natively supported storage format would be the quickest?
IMO there are a few options in this case:
use HDF Store (AKA PyTable, H5) as #jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.
PS you may also want to check this comparison especially if you want to store your data in compressed format

Improving write performance in Hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM
I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.