I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.
Related
A BigQuery noob here.
I have a pretty simple but large table coming from clickhouse and stored in a parquet file to be loaded into BQ.
Size: 50GB in parquet, About 10B rows
Schema:
key:STRING(it was a UUID),type:STRING(cardinality of 4, e.g. CategoryA,CategoryB,CategoryC),value:FLOAT
Size in BigQuery: ~1.5TB
This is about a 30x increase.
Running a SELECT 1 FROM myTable WHERE type=CategoryA shows an expected billing of 500GB, which seems a rather large number given such a low cardinality.
It feels there are two paths:
making the query more efficient (how?)
or even better, making BQ understand the data more and avoid a 30x increase explosion.
Clustering and partitioning could come handy in specific instances of selection, however it seems that the 30x problem still remains, and the moment you start running the wrong query it will just explode in cost
Any idea?
Parquet file is compressed format, so when loaded it will be decompressed.
1.5TB is not huge in BQ world.
Neither the 500GB. * the columns you touch in the where statement are scanned as well.
What you need to do is that you reframe into smaller data sets.
Leverage partitioning and clustering as well.
Never use * in select.
Use materialized views for specific use cases, and turn on BI Engine for optimized queries see a guide here.
I want to join 1TB data of table with another table which also has 1TB of data in hive. Could you please suggest some best practises to follow?
I want to know how performance will be improved in hive if both the tables are partitioned. Basically, how mapreduce works in this case.
Below are few performance improvement Rules to follow when dealing with large data -
Tez-Execution Engine in Hive Or Hive on Spark
Use Tez Execution Engine (Hortonworks) – Hive Optimization Techniques, to increase the Hive performance of our hive query by using our execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Or
Use Hive on Spark (Cloudera)
In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.
Let’s Discuss Apache Hive Features & Limitations of Hive
SET hive.execution.engine=tez;
SET hive.execution.engine=spark;
Usage of Suitable File Format in Hive
ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.
Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive.
Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.
Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.
Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.
Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Cost-Based Optimization in Hive (CBO)
Cost-Based Optimization in Hive – Hive Optimization Techniques, before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query.
However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query performance indexing will definitely help. Basically, for the original table use of indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount of time if we want to perform queries only on some columns without indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. Further, it checks the index first and then goes to the particular column and performs the operation.
Hence, maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you like our explanation of Hive Performance Tuning.
Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.
BigQueryIO.Read.from("projectid:dataset.tablename")
or
BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")
Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?
I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.
You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.
Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.
As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.
However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.
I want to source few hundreds of gigabytes from a database via JDBC and then process it using Spark SQL. Currently I am doing some partitioning at that data and process is by batches of milion records. The thing is that I would like also to apply some deduplication to my dataframes and I was going to leave that idea of separated batches processing and try to process those hundreds of gigabytes using a one dataframe partitioned accordingly.
The main concern is: how will .distinct() work in such case? Will Spark SQL firstly try to load ALL the data into the RAM and then apply deduplication involving many shuffles and repartitioning? Do I have to ensure that a cluster has enough of RAM to contain that raw data or maybe it will be able to help itself with HDD storage (thus killing the performance)?
Or maybe I should do it without Spark - move the data to the target storage and there apply distinct counts and detect duplicates and get rid off them?
Spark SQL does NOT use predicate pushdown for distinct queries; meaning that the processing to filter out duplicate records happens at the executors, rather than at the database. So, your assumption regarding shuffles happening over at the executors to process distinct is correct.
Inspite of this, I would still advise you to go ahead and perform the de-duplication on Spark, rather than build a separate arrangement for it. My personal experience with distinct has been more than satisfactory. It has always been the joins which push my buttons.
I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM
I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.