Apache pig - Best Hive file formats - hive

Could someone explain which file fomats of hive will be efficient to be used in pigScript using HCatalog.
I would like to understand which hive file formats will be efficient, since currently we have a partitioned hive table based on date and the underlying file is a sequential file.
Reading for 80 days of data creates around 70,000 mappers which is very huge. Tried changing the map split size to 2GB and did not reduce much.
So, instead of sequential file looking for other options which will reduce the number of mappers. Size of data per data is 9GB.
Is there any suggestions or some inspiration?
Thank you.

As per my knowledge ORC is most suitable file format for hive it has high compression ration, efficiently work on large amount of data and also faster in read. ORC Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in hive.

Related

How to store and aggregate data in ~300m JSON objects efficiently

I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.
The data it contains needs to be aggregated every day based on different properties.
We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.
Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.
These are the options as I see:
Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data
Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files
Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes
Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse
Any practical advice would be really appreciated. Thanks in advance.
There are multiple factors to be consider :
If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.

BigQuery loading Parquet, 30x space in BQ

A BigQuery noob here.
I have a pretty simple but large table coming from clickhouse and stored in a parquet file to be loaded into BQ.
Size: 50GB in parquet, About 10B rows
Schema:
key:STRING(it was a UUID),type:STRING(cardinality of 4, e.g. CategoryA,CategoryB,CategoryC),value:FLOAT
Size in BigQuery: ~1.5TB
This is about a 30x increase.
Running a SELECT 1 FROM myTable WHERE type=CategoryA shows an expected billing of 500GB, which seems a rather large number given such a low cardinality.
It feels there are two paths:
making the query more efficient (how?)
or even better, making BQ understand the data more and avoid a 30x increase explosion.
Clustering and partitioning could come handy in specific instances of selection, however it seems that the 30x problem still remains, and the moment you start running the wrong query it will just explode in cost
Any idea?
Parquet file is compressed format, so when loaded it will be decompressed.
1.5TB is not huge in BQ world.
Neither the 500GB. * the columns you touch in the where statement are scanned as well.
What you need to do is that you reframe into smaller data sets.
Leverage partitioning and clustering as well.
Never use * in select.
Use materialized views for specific use cases, and turn on BI Engine for optimized queries see a guide here.

S3 partition (file size) for effecient Athena query

I have a pipeline that load daily records into S3. I then utilize AWS Glue Crawler to create partition for facilitating AWS Athena query. However, there is a large partitioned data, if compared to others.
S3 folders/files are displayed as follows:
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/00/00/2019-00-00.parquet.gzip') 7.8 MB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/11/2019-01-11.parquet.gzip') 29.8 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/12/2019-01-12.parquet.gzip') 28.5 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/13/2019-01-13.parquet.gzip') 29.0 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/14/2019-01-14.parquet.gzip') 43.3 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/15/2019-01-15.parquet.gzip') 139.9 KB
with the file size displayed at the end of each line. Note that, 2019-00-00.parquet.gzip contains all records before 2019-01-11 and therefore, its size is large. I have read this and it says that "If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit."
So, I wonder should I split 2019-00-00.parquet.gzip into smaller parquet files with different partitions. For example,
key='database/table/2019/00/00/2019-00-01.parquet.gzip',
key='database/table/2019/00/00/2019-00-02.parquet.gzip',
key='database/table/2019/00/00/2019-00-03.parquet.gzip', ......
However, I suppose this partitioning is not so useful as it does not reflect when were the old records stored. I am opened for all workarounds. Thank you.
If the full size of your data is less than a couple of gigabytes in total, you don't need to partition your table at all. Partitioning small datasets hurt performance much more than it helps. Keep all the files in the same directory, deep directory structures in unpartitioned tables also hurt performance.
For small datasets you'll be better off without partitioning as long as there aren't too many files (try to keep it below a hundred). If you for some reason must have lots of small files you might get benefits from partitioning, but benchmark it in that case.
When the size of the data is small, like in your case, the overhead of finding the files on S3, opening, and reading them will be higher than actually processing them.
If your data grows to hundreds of megabytes you can start thinking about partitioning, and aim for a partitioning scheme where partitions are around a hundred megabytes to a gigabyte in size. If there is a time component to your data, which there seems to be in your case, time is the best thing to partition on. Start by looking at using year as partition key, then month, and so on. Exactly how to partition your data depends on the query patterns, of course.

Improving write performance in Hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM
I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.

Improve apache hive performance

I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.