Partition column useful or not while taking count - sql

Which one will be best to use from the perspective of cost, time and processing.here etl_batch_date is the partition column for the table.
1.Query - This query will process 607.7 kb when run
Table size : 9.77 MB
SELECT count(*) from demo
WHERE etlbatchid = '20200003094244327' and etl_batch_date='2020-06-03
Query - This query will process 427.6 kb when run
Table size : 9.77MB
SELECT count(*) from demo WHERE etlbatchid = '20200003094244327'
Also when you write second query then does it read the data from every partition?
You valuable comments will be appreciated.

Rule of thumb: Always use the partitioned column to filter data.
Play with this query:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
# 2.2 GB processed
For every datehour you add to the filter, an extra gigabyte of data will be queried. That's because:
Filtering by datehour implies a read of the datehour column. So this makes the query go over more data.
But since the datehour column is the partitioned column, then it only scans that day of data.
Now, if I add another filter:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
AND wiki='en'
# 686.8 MB processed
That processed less data!
That's because wiki is the main clustering column.
So try to always use partitions and clusters - even tho for smaller tables the results might look less intuitive.

Related

Correct way to create one index with two columns in Timescaledb

I use porstgreSQL extension called Timescaledb.i have a table called timestampdb.I want to ask if this is the right way to create one index with two columns(timestamp1,id13).
Some of my queries that i use look like this:
select * from timestampdb where timestamp1 >='2020-01-01 00:05:00' and timestamp1<='2020-01-02 00:05:00' and id13>'5',
select date_trunc('hour',timestamp1) as hour,avg(id13) from timestampdb where timestamp1 >='2020-01-01 00:05:00' and timestamp1<='2020-01-02 00:05:00' group by hour order by hour ,
select date_trunc('hour',timestamp1) as hour,max(id13) from timestampdb where timestamp1<='2020-01-01 00:05:00' group by hour order by hour desc limit 5
After creating the table i do this:
create_hypertable("timestampdb",timestamp1) and then CREATE INDEX ON timestampdb (timestamp1,id13)
Is this the proper way?Will this create one index with two columns?Or one index in timestamp1 and one index for(timestamp1,id13)
Yes this is the proper way. The provided call will actually create an index combined with those two columns. You want to make sure that the dominant index column (ie the first) is the time-column. Which it is in your code. That way tsdb queries still find your data first by time (which is pretty important on large data sets). And your queries match that index too: they primarily search time-range-based.
You might want to check the way postgres executes your queries by executing
EXPLAIN ANALYZE ;
Or use pgadmin and click the “explain”-button.
By that you can make sure that your Indizes are hit and whether postgres has enough heap buffer to cache the tsdb table pages or needs to read from disk (which is slower by effectively a factor of 1000 to 10000).
I always find those resources helpful:
TSDB YT Channel: https://youtube.com/c/TimescaleDB
TSDB Slack Channel: https://slack.timescale.com/

GCP: select query with unnest from array has very big process data to run compared to hardcoded values

In bigQuery GCP, I am trying to grab some data in a table where the date is the same as a date in a list of values I have got. If I hardcode the list of values in the select it is vastly cheaper in process to run than if I use a temp structure like an array...
Is there a way to use the temp structure but avoid the enormous processing cost ?
Why is it so expensive for something small simple like this.
please see below examples:
**-----1/ array structure example: this query process's 144.8 GB----------**
WITH
get_a as (
SELECT
GENERATE_DATE_ARRAY('2000-01-01','2000-01-02') as array_of_dates
)
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
get_a as b
UNNEST(b.array_of_dates) as c
WHERE
c in (CAST(a.ingest_time AS DATE)
)
**------2/ hardcoded example: this query processes 936.5 MB over 154 X's less ? --------**
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
WHERE
(CAST(a.ingest_time as DATE)) IN ('2000-01-01','2000-01-02')
Presumably, your view_a.events table is partitioned by the ingest_time.
The issue is that partition pruning is very conservative (buggy?). With the direct comparisons, BigQuery is smart enough to recognize exactly which partitions are used for the query. But with the generated version, BigQuery is not able to figure this out, so the entire table needs to be read.

Bigquery Integer range partitioning

I have a fact table which connects to a dimension table (The dimension table has 16 million records), In order to optimize the join, Is it ideal to partition the dimension table based on the SK field using Bigquery integer range partitioning ?
What is the best way to efficiently join to this dimension since the dimension table has 16 million records ?
Thanks
I'll recommend you to cluster, not partition - especially since you have not indicated the range of the ids and how it will change through time.
However, I tested with 20 million records one pattern of querying, and there was no advantage at this scale with either clustering or nothing:
CREATE TABLE temp.lookup_clustered
PARTITION BY fake_date
CLUSTER BY id
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all`
WHERE name<'C'
;
CREATE TABLE temp.lookup_plain
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all`
WHERE name<'C'
;
CREATE TABLE temp.ids AS
SELECT id FROM temp.lookup_plain
;
SELECT MAX(temp)
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_clustered`
USING(id)
# 2.1 sec elapsed, 440.2 MB processed
# Slot time consumed 32.846 sec
# Bytes shuffled 26.51 KB
;
SELECT MAX(temp)
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_plain`
USING(id)
# 1.8 sec elapsed, 440.2 MB processed
# Slot time consumed 34.740 sec
# Bytes shuffled 26.39 KB
Use a similar script to test the best strategy for your use cases (which are missing from the question). And please report results!

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Optimizing Hive GROUP BY when rows are sorted

I have the following (very simple) Hive query:
select user_id, event_id, min(time) as start, max(time) as end,
count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;
The table has the following structure:
user_id event_id time interaction
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 0
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 1
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512179696 0
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512217124 0
n0w4uQhOuXymj5jLaCMQ mqf38Xd6CAQtuvuKc5NlWQ 1430512179696 1
I know for a fact that rows are sorted first by user_id and then by event_id.
The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.
Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).
Create a bucketed sorted table. The optimizer will know it sorted from metadata.
See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.