I’m trying to join two tables in BigQuery but it hits the 6 hour query execution time limit. For context:
The primary table is ~150M rows; the joined table is ~250M rows.
It’s a simple SELECT * FROM a LEFT JOIN b with equality comparison on three columns (as ORs). These three columns are strings ranging from 10-80 or so characters in length.
The same query with the same schemas for those tables, but with a sample of only 100K rows each, works quickly and without issue.
Tried: Running with fewer join clauses (3 instead of 9), enabling large results
Expected: BQ to render a result, eventually
Actual: Timeout
Suggestions appreciated!
Table A
---------
col1, col2,Adate,qty
Table B
-------
col2,cost,Bdate
The table sizes are as follows:
A: 1 million
B: 700k
Consider this query:
SELECT
A.col1,
A.col2,
B.Bdate bdate,
SUM(qty)*COLLECT_LIST(cost)[0] price
FROM A
JOIN B
ON (A.col2 = B.col2 AND A.Adate <= B.Bdate)
GROUP BY
A.col1,
A.col2,
B.bdate;
The above hive query takes more than 3 hrs on a cluster of 4 slaves(8GB memory,100 GB disk) and 1 master(16 GB memory, 100 GB disk)
Can this query be optimized? If yes, where can the optimization be possible?
Use Tez and mapjoin.
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --adjust for your smaller table to fit in memory
set hive.execution.engine=tez;
Also this computation is not memory-efficient:
SUM(qty)*COLLECT_LIST(cost)[0] price
COLLECT_LIST will collect all cost values in the group into non unique(contains values from ALL rows in the group) and unordered (yes, unordered, because you have no any distribute + sort before collect_list) array. This array can be big enough (the number of elements = the number of rows in the group), depending on your data, then you are taking [0] element, it means that you are picking just any random cost from the group. Does it make any sense to collect array to get just any random element? Use min() or max instead. If it does not matter which cost should be taken, then min(cost) or max(cost) or some other scalar function will consume less memory. You can use first_value analytic function (may require sub-query, but it will be memory-efficient also)
I will try to give you some advices to improve query performance in Hive.
Check the execution engine you are using
set hive.execution.engine;
If you execution engine is mr, rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
set hive.execution.engine=tez;
Join queries are computationally expensive and can be slow, especially when you’re joining three or more tables, or if you’re working with very large data.
One strategy that can be used to remedy this problem is to join the data in advance and store the pre-joined result in a separate table, which you can then query.
this is one way of denormalizing a normalized database, to make it easier to run analytic queries.
This approach of pre-joining tables has some costs, but it can make analytic queries easier to write and faster to run.
There are some other techniques for improving Hive query performance
Join table ordering (Largest table last)
As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join,
it needs to select which table is streamed and which table is cached.
Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.
A: 1 million B: 700k
Hence, when these two tables are joined it is important that the larger table comes last in the query.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables,
BucketedTables
bucketing-in-hive
Partitioning
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition.
Using partition it is easy to do queries on slices of the data.
apache-hive-partitions
We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.
I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.
Request ran out of memory in the S3 query layer
Total Number of rows in table : 770 Million
Total size of table in Parquet format : 45 GB
Number of records in each partition : 4.2 Million
Million Redshift configuration : Single node dc2.xlarge
Attached is the table ddl
Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.
Also, two further suggestions:
Recommend always using at least 2 nodes of Redshift. This gives
you a free leader node and allows your compute nodes to use all
their RAM for query processing.
Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:
SELECT MD5(ws_sold_date_sk
||ws_sold_time_sk
||ws_ship_date_sk
||ws_item_sk
||ws_bill_customer_sk
||ws_bill_cdemo_sk
||ws_bill_hdemo_sk
||ws_bill_addr_sk
||ws_ship_customer_sk
||ws_ship_cdemo_sk
||ws_ship_hdemo_sk
||ws_ship_addr_sk
||ws_web_page_sk
||ws_web_site_sk
||ws_ship_mode_sk)
, COUNT(*)
FROM spectrum.web_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;
My customer has a computing scenario that some data were stored in Hive in cluster A and some other data were stored in Hbase in cluster B, then they want to do some join operation with the 2 kinds of tables.
so is there a way to let me do it in Hive like this:
select hive_table.col1, hbase_table.col2 from hive_table inner join hbase_table on hive_table.id = hbase_table.id
hive table and hbase table exist in the different cluster.
Yes, it is possible to join Hive table(table1-assuming it is in HDFS) with Hbase table(table2-hbase table). But it is not recommended because HBase is not efficient with full table scans, when you try to join. Best way to do it is convert the Hbase table to parquet or AVRO. Now table1 form Hive and table2 from Hbase both resides in HDFS which makes it efficient.
In nutshell, we can join any tables that Hive metastore have stored. It doesn't matter whether the Hive tables are built over HDFS, Hbase. As long as we have schema in hive metastore we can join them.
Assuming Hive metastore contains the schemas for both tables.
select hive_table.col1, hbase_table.col2 from hive_table inner join hbase_table on (hive_table.id = hbase_table.id);