I use JDBC from Scala to get data from Hive. In Hive I have a simple table with 20 rows in the following format:
user_id, movie_title, rating, date
To group users by movie I do 3 nested select requests:
1) select distinct user_id
2) for each user_id:
select distinct movie_title //select all movies that user saw
3) for each movie_title:
select distinct user_id //select all users who saw this movie
On a local Hive table with 20 rows these nested queries work 26 min! Hive returns first user_id after a minute! Questions:
1)Why Hive is so slow?
2) Any way to optimize 3 nested selects?
Hive uses the MapReduce framework to process queries. There is a decent amount of constant overhead attached to every MapReduce job you run. Each of your queries (which is a fair amount because of your nesting) is going to have to spin up a MapReduce job and that takes time regardless of how much data you have.
Newer versions of Hive are much more responsive, but still not ideal for this type of selection.
Your best bet is to try to minimize the number of queries by using group by or something similar.
Create two tables by inserting records into them based on select distinct queries. First containing distinct user, movie rated where user_rated = user, second, movie_rated = movie. That way, these two tables can be joined to get the desired group by result.
Related
I am transitioning from SQL Server to BigQuery and noticed that the TOP function in BigQuery is only allowed to aggregate in queries. Therefore the below code would not work:
SELECT TOP 5 * FROM TABLE
This is a habit I've had when trying to learn new tables and get more information on the data. Is there another alternative to selecting a few rows from the table? The following select all query works, but is incredibly inefficient and takes a long time to run for large tables:
SELECT * FROM TABLE
In BigQuery, you can use LIMIT as in:
SELECT t.*
FROM TABLE t
LIMIT 5;
But I caution you to be very careful with this. BigQuery charges for the number of columns accessed in a table, not the number of rows. So, in a large table, such a query can be quite expensive.
You can also go into the BigQuery GUI, navigate to the table, and click on "Preview". The preview functionality is free.
As Gordon Linoff mentioned, using LIMIT statement in BigQuery may be very expensive when used with big tables. To make exploratory queries more cost effective BigQuery now supports TABLESAMPLE operator, see also Using table sampling.
Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table.
Query example:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (2 PERCENT)
If you are querying e.g. table views or TABLESAMPLE SYSTEM is not working for other reasons, what you can do is to use e.g. [...] WHERE RAND() < 0.05 for getting 5% of the results randomly selected. Make sure to put it at the end of your query in the WHERE statement.
This works also with table views and if you are not the owner of a table. :)
I'm querying a table, let's call it 'customer_table', this has partitions by day in 'yyyymmdd' format.
When looking at the partitions with 'show partitions customer_table' I can see there is data for yesterday and the day before.
However, when I run a query like :
'select customer, customer_joined_date, name, address, city
from customer_table ct
left join addresses addr on ct.cust_id=addr.cust_id
where ct.customer_joined_date >= date_format(date_sub(current_date,7),'yyyyMMdd')
This query is not including the data from yesterday or previous day.
My instinct is the previous 2 days partitions have some kind of lock which prevents querying while data is still streaming into them.
Can you suggest what's happening? Is there an environment parameter which i can set so the query ignores 'locks'?
Could you do the query below just to check that there’s no funky stuff going on in the join-
select max(customer_joined_date)
from customer_table ct
Where ct.customer_joined_date >= date_format(date_sub(current_date,7),'yyyyMMdd');
If you still don’t see the data from latest partitions, you can try doing gather stats for the latest partitions once and see if that results in any chance.Below is an example of the syntax.
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09') COMPUTE STATISTICS;
The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.
I am creating a data pipeline which writes the data into bigquery table every minute and eventually exceeds the quota limit. Does deleting the table after a few hours and then creating it again will renew the quota limit of that table?
I'm using Python API of bigquery to achieve this task.
Need to update the same table in bigquery without exceeding the quota limit.
As per BQ documents, it imposes an upper-bound limit of 1,000 updates per table per day.
I think you have to "engineer" ways to get around your frequency of updates to a table. There are some very obvious ways around this (which are also pretty standard industry practices) and then there are some tricks. Here is what I can think from top of my head:
You can choose to update your target table (overwrite) less frequently.
You can a compose a new table name to be valid only for updates coming in for a certain time interval during the day (example: between 2-3 AM, let your pipeline write query results to table mydataset.my_table_[date]_02_03). Then, at the time of querying, you can just use wildcard statements like:
select count(*) as cnt from `mydataset.my_table_[date]_*`
Which is equivalent to:
select count(*) as cnt from (
select * from (
select * from `mydataset.my_table_[date]_00_01`
)
union all
select * from (
select * from `mydataset.my_table_[date]_01_02`
)
union all
....
)
In this, however, make sure you are always "appending" (not overwriting) data to the table corresponding to the hour of the day. Also, not to forget, you can always take decent advantage of BQ's date partitioned tables to achieve similar results.
Hope this helps.
I ran two queries to get count of records for two different dates from a Hive managed table partitioned on process date field.
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01' --returned 2 million
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02' --returned 3 million
But if I ran the below query with a UNION ALL clause, the counts returned are different from that of above mentioned individual queries.
SELECT '2018-01-01', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01'
UNION ALL
SELECT '2018-01-02', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02'
What can be the root cause for this difference?
One of our teammate helped us to identify the issue.
When we run a single count() query,the query is not physically executed on table rather count will be taken from statistics.
One of the remedy is to collect the stats on table agian,then the count() on single table will reflect actual count
Regards,
Anoop
I too faced a similar issue with count(*) returning incorrect count. I added the below to my code and the counts are consistent now.
For non-partitioned table use:
ANALYZE TABLE your_table_name COMPUTE STATISTICS
For partitioned table, analyze the recently added partition by specifying the partition value:
ANALYZE TABLE your_table_name
PARTITION(your_partition_name=your_partition_value)
COMPUTE STATISTICS;