I have a fact table which connects to a dimension table (The dimension table has 16 million records), In order to optimize the join, Is it ideal to partition the dimension table based on the SK field using Bigquery integer range partitioning ?
What is the best way to efficiently join to this dimension since the dimension table has 16 million records ?
Thanks
I'll recommend you to cluster, not partition - especially since you have not indicated the range of the ids and how it will change through time.
However, I tested with 20 million records one pattern of querying, and there was no advantage at this scale with either clustering or nothing:
CREATE TABLE temp.lookup_clustered
PARTITION BY fake_date
CLUSTER BY id
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all`
WHERE name<'C'
;
CREATE TABLE temp.lookup_plain
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all`
WHERE name<'C'
;
CREATE TABLE temp.ids AS
SELECT id FROM temp.lookup_plain
;
SELECT MAX(temp)
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_clustered`
USING(id)
# 2.1 sec elapsed, 440.2 MB processed
# Slot time consumed 32.846 sec
# Bytes shuffled 26.51 KB
;
SELECT MAX(temp)
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_plain`
USING(id)
# 1.8 sec elapsed, 440.2 MB processed
# Slot time consumed 34.740 sec
# Bytes shuffled 26.39 KB
Use a similar script to test the best strategy for your use cases (which are missing from the question). And please report results!
Related
GBQ (Google Big Query) provides views for streaming insert meta data, see STREAMING_TIMELINE_BY_*. I would like to use this data to understand the billing for "Streaming Inserts". However, the numbers don't add ab and I'd like to understand if I made a mistake somewhere.
One of the data points in the streaming insert meta data view is the total_input_bytes:
total_input_bytes
INTEGER
Total number of bytes from all rows within the 1 minute interval.
In addition, the Pricing for data ingestion says:
Streaming inserts (tabledata.insertAll)
$0.010 per 200 MB
You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.
So getting the costs for streamining inserts per day should be possible via
0.01/200 * (SUM(total_input_bytes)/1024/1024)
cost per 200 mb -----^
total bytes in mb ---------------------^
This should be the lower boundary as we disregard any rows with less than 1KB that are rounded up to 1KB.
Full query:
SELECT
project_id,
dataset_id,
table_id,
SUM(total_rows) AS num_rows,
round(SUM(total_input_bytes)/1024/1024,2) AS num_bytes_in_mb,
# 0.01$ per 200MB
# #see https://cloud.google.com/bigquery/pricing#data_ingestion_pricing
round(0.01*(SUM(total_input_bytes)/1024/1024)/200, 2) AS cost_in_dollar,
SUM(total_requests) AS num_requests
FROM
`region-us`.INFORMATION_SCHEMA.STREAMING_TIMELINE_BY_PROJECT
where
start_timestamp BETWEEN "2021-04-10" and "2021-04-14"
AND error_code IS NULL
GROUP BY 1, 2, 3
ORDER BY table_id asc
However, the results are not reflected in our actual billing report. The billing shows less than half the costs of what I would expect:
Now I wondering if the costs can even be calculated like this.
Your query is rounding every line that is less than 0.49kb to 0kb. This should explain why you are calculating less costs.
Try inserting a CASE statement that will handle these values:
SELECT
project_id,
dataset_id,
table_id,
SUM(total_rows) AS num_rows,
CASE SUM(total_input_bytes)/1024/1024 < 0.001 THEN 0.001 ELSE
round(SUM(total_input_bytes)/1024/1024,2) END AS num_bytes_in_mb,
# 0.01$ per 200MB
# #see https://cloud.google.com/bigquery/pricing#data_ingestion_pricing
CASE SUM(total_input_bytes)/1024/1024 < 0.001 THEN 0.001 ELSE
round(0.01*(SUM(total_input_bytes)/1024/1024)/200, 2) END AS cost_in_dollar,
SUM(total_requests) AS num_requests
FROM
`region-us`.INFORMATION_SCHEMA.STREAMING_TIMELINE_BY_PROJECT
where
start_timestamp BETWEEN "2021-04-10" and "2021-04-14"
AND error IS NULL
GROUP BY 1, 2, 3
ORDER BY table_id asc
Which one will be best to use from the perspective of cost, time and processing.here etl_batch_date is the partition column for the table.
1.Query - This query will process 607.7 kb when run
Table size : 9.77 MB
SELECT count(*) from demo
WHERE etlbatchid = '20200003094244327' and etl_batch_date='2020-06-03
Query - This query will process 427.6 kb when run
Table size : 9.77MB
SELECT count(*) from demo WHERE etlbatchid = '20200003094244327'
Also when you write second query then does it read the data from every partition?
You valuable comments will be appreciated.
Rule of thumb: Always use the partitioned column to filter data.
Play with this query:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
# 2.2 GB processed
For every datehour you add to the filter, an extra gigabyte of data will be queried. That's because:
Filtering by datehour implies a read of the datehour column. So this makes the query go over more data.
But since the datehour column is the partitioned column, then it only scans that day of data.
Now, if I add another filter:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
AND wiki='en'
# 686.8 MB processed
That processed less data!
That's because wiki is the main clustering column.
So try to always use partitions and clusters - even tho for smaller tables the results might look less intuitive.
I have a table with roughly 10M records, where each record is an ID and some probability (ranges between 0 and 1).
All the IDs are unique. I am trying to break this 10M dataset into 1,000 bins - meaning each bin will have 10k records in it.
But I want to compute these bins based on the probability and hence I arrange the table first in descending order of the probability
and then I try to create the bins.
--10M dataset
with predictions as
(
select id ,probability
from table
)
-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct id) from predictions)) as bins
from predictions
)
select *
from bin_groups
where bins = 1
limit 100
However, I get below error while executing this query -
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 102% of limit. Top memory consumer(s): JOIN operations: 96% other/unattributed: 4%
I read here - https://cloud.google.com/bigquery/docs/best-practices-performance-output#use_a_limit_clause_with_large_sorts that we need to limit the results while querying but seems like LIMIT is not functioning either.
The limit that you have happens after the materializing the 2 select statements above so adding the limit outside won't work.You may have to put the limit inside the bin_groups, although I'm not sure if it would still fit your use-case.
--10M dataset
with predictions as
(
select id ,probability
from table
)
-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct
id) from predictions)) as bins
from predictions
limit 100
)
select *
from bin_groups
where bins = 1
I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:
As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.
Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped
I have the following (very simple) Hive query:
select user_id, event_id, min(time) as start, max(time) as end,
count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;
The table has the following structure:
user_id event_id time interaction
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 0
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 1
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512179696 0
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512217124 0
n0w4uQhOuXymj5jLaCMQ mqf38Xd6CAQtuvuKc5NlWQ 1430512179696 1
I know for a fact that rows are sorted first by user_id and then by event_id.
The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.
Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).
Create a bucketed sorted table. The optimizer will know it sorted from metadata.
See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.