Improving write performance in Hive - hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM

I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.

Related

BigQuery loading Parquet, 30x space in BQ

A BigQuery noob here.
I have a pretty simple but large table coming from clickhouse and stored in a parquet file to be loaded into BQ.
Size: 50GB in parquet, About 10B rows
Schema:
key:STRING(it was a UUID),type:STRING(cardinality of 4, e.g. CategoryA,CategoryB,CategoryC),value:FLOAT
Size in BigQuery: ~1.5TB
This is about a 30x increase.
Running a SELECT 1 FROM myTable WHERE type=CategoryA shows an expected billing of 500GB, which seems a rather large number given such a low cardinality.
It feels there are two paths:
making the query more efficient (how?)
or even better, making BQ understand the data more and avoid a 30x increase explosion.
Clustering and partitioning could come handy in specific instances of selection, however it seems that the 30x problem still remains, and the moment you start running the wrong query it will just explode in cost
Any idea?
Parquet file is compressed format, so when loaded it will be decompressed.
1.5TB is not huge in BQ world.
Neither the 500GB. * the columns you touch in the where statement are scanned as well.
What you need to do is that you reframe into smaller data sets.
Leverage partitioning and clustering as well.
Never use * in select.
Use materialized views for specific use cases, and turn on BI Engine for optimized queries see a guide here.

Delta Lake: Performance Challenges

Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards,
Chaitanya
you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?
Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.
Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.
I can give you an example:
Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.
But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.
As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake.
If there is not a specific reason to do so in approach 2 I would just use:
spark.read.json("path")
To avoid that process of converting it from pandas to spark dataframes.

BigQueryIO Read vs fromQuery

Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.
BigQueryIO.Read.from("projectid:dataset.tablename")
or
BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")
Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?
I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.
You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.
Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.
As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.
However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.

Many small data table I/O for pandas?

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).
This is done many times over and I can choose the storage format for these tables for best (speed) performance.
What natively supported storage format would be the quickest?
IMO there are a few options in this case:
use HDF Store (AKA PyTable, H5) as #jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.
PS you may also want to check this comparison especially if you want to store your data in compressed format

Improve apache hive performance

I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.