Can I efficiently GROUP BY over a date partitioned table in BigQuery - google-bigquery

I have a table t in BigQuery that contains ~5billion rows (~80TB) and it is partitioned on column dateTimeCreated which is of type TIMESTAMP and is partitioned by DAY. The table contains data for 5 years so no more than 1825 partitions.
I'd like to find out how many rows exist in the table per day so I crafted this SQL query:
select timestamp_trunc(datetimecreated,DAY),count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc
I was hoping that BigQuery would be able to return the results rapidly because this is basically counting the number of rows in each partition which, I would expect, is a tally that BigQuery maintains as internal metadata anyway (that's certainly my experience when using ingestion time partitioned tables).
Unfortunately that seems to not be the case. It took BigQuery 73s to return the result:
Query complete (1 min 13 sec elapsed, 37.4 GB processed)
I'm curious if there's a more efficient way to query this table. If it were an ingestion-time partitioned table my query would be:
select _PARTITION_DATE,count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc
which I'm confident would return very quickly. This isn't an ingestion-time partitioned table though.
Is there a more efficient method to achieve my desired result?
Another question,, does BigQuery provide queryable metadata per partition that includes the cardinality of the partition?

Found the answer, this does the job:
SELECT table_name, partition_id, total_rows
FROM `p.d.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
and table_name = 't'
order by partition_id desc
it returns quickly and, of course, queries much less data.
Query complete (1.7 sec elapsed, 10 MB processed)

Related

ClickHouse - SELECT row of data is too slow

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max
Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

General question on optimising a database for a large query

I have a database that stores data from sensors in a factory. The DB contains about 1.6 million rows per sensor per day. I have the following index on the DB.
CREATE INDEX sensor_name_time_stamp_index ON sensor_data (time_stamp, sensor_name);
I will be running the following query once per day.
SELECT
time_stamp, value
FROM
(SELECT
time_stamp,
value,
lead(value) OVER (ORDER BY value DESC) as prev_result
FROM
sensor_data WHERE time_stamp between '2021-02-24' and '2021-02-25' and sensor_name = 'sensor8'
ORDER BY
time_stamp DESC) as result
WHERE
result.value IS DISTINCT FROM result.prev_result
ORDER BY
result.time_stamp DESC;
The query returns a list of rows where the value is different from the previous row.
This query takes about 23 seconds to run.
Running on PostgreSQL 10.12 on Aurora serverless.
Questions: Besides the index, are there any other optimisations that I can perform on the DB to make the query run faster?
To support the query optimally, the index must be defined the other way around:
CREATE INDEX ON sensor_data (sensor_name, time_stamp);
Otherwise, PostgreSQL will have to read all index values in the time interval, then discard the ones for the wrong sensor, then fetch the rows from the table.
With the proper column order, only the required rows are scanned in the index.
You asked for other optimizations: Since you have to sort rows, increasing work_mem can be beneficial. Other than that, more memory and faster storage will definitely not harm.

Why does BigQuery scan entire table although it's partitioned by hour?

This table is partitioned by hour:
SELECT *
FROM `blockchain-etl-internal.crypto_ethereum_partitioned.logs_by_topic_0xd78`
WHERE block_timestamp >= '2020-11-14 00:00:00' and block_timestamp < '2020-11-14 01:00:00'
ORDER BY block_timestamp DESC
But whatever filter on the block_timestamp I specify BigQuery scans the entire table. You can see that table size and the amount of data scanned in a query to make sure.
Isn't BigQuery supposed to only scan data in partitions that are filtered out?
This is because all rows in the table is still in the UNPARTITIONED partition and has not been repartitioned into their corresponding partitions. Repartitioning is triggered only when there's enough data (byte size is at least a certain threshold) (https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables).
At the moment, this threshold is set at 5gb, while the table has around 400mb as you stated.

Hive Not Utilizing Partitions in Query

I have a view that works to pull the most recent data for a Hive history table. The history table is partitioned by day. The way that the view works is very straightforward—it has a subquery that does a max date on the date field (the one that is used as the partition) then filters the table based upon that value. The table contains hundreds of days (partitions), each with many millions of rows. In order to speed up the subquery, I am attempting to limit the partitions that are scanned to the last one created. To account for holiday weekends, I'm going back four days to ensure that the query returns data.
If I hard code the values with dates, the subquery runs very fast, and limits to the partitions correctly.
However, if I attempt to limit the partitions with a subquery to calculate the last partition, it doesn’t recognize the partitions and does a full table scan. The query will return correct results, as the filter works, but it takes a long time because it is not limiting the partitions scanned.
I tried doing the subquery as a WITH statement, then using an INNER JOIN on bus_date, but got the same results—partitions were not utilized.
The behavior is repeatable via a query, so I’ll use that rather than the view to demonstrate:
SELECT *
FROM a.transactions
WHERE bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
There are no error messages, and the query actually works (filters to pull the correct data), but it scans all partitions so it is extremely slow. How can I limit the query to utilize the partitions identified in the subquery?
I'm still hopeful that someone will have an answer for this, but I did want to post the workaround that I've come up with in case it is useful for someone else.
SELECT *
FROM a.transactions
WHERE bus_date >= date_sub (CURRENT_DATE, 4)
AND bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
The query is a little clumsy, as it is filtering on the business date twice. The first time it limits the main set of data to the last four days (which limits to those partitions and avoids a scan of all partitions) and the second pins it down to the last day for which data has been loaded (via the MAX bus_date). This is far from perfect, but performs CONSIDERABLY better than the query scanning all partitions. Thanks.

How to get COUNT(*) from one partition of a table in SQL Server 2012?

My table have 7 million records and I do split table in 14 part according to ID, each partition include 5 million record and size of partition is 40G. I want to run a query to get count in one partition but it scan all partitions and time of Query become very large.
SELECT COUNT(*)
FROM Item
WHERE IsComplated = 0
AND ID Between 1 AND 5000000
How can I run my query on one partition only without scan other partition?
Refer http://msdn.microsoft.com/en-us/library/ms188071.aspx
B. Getting the number of rows in each nonempty partition of a partitioned table or index
The following example returns the number of rows in each partition of table TransactionHistory that contains data. The TransactionHistory table uses partition function TransactionRangePF1 and is partitioned on the TransactionDate column.
To execute this example, you must first run the PartitionAW.sql script against the AdventureWorks2012 sample database. For more information, see PartitioningScript.
USE AdventureWorks2012;
GO
SELECT $PARTITION.TransactionRangePF1(TransactionDate) AS Partition,
COUNT(*) AS [COUNT] FROM Production.TransactionHistory
GROUP BY $PARTITION.TransactionRangePF1(TransactionDate)
ORDER BY Partition ;
GO
C. Returning all rows from one partition of a partitioned table or index
The following example returns all rows that are in partition 5 of the table TransactionHistory.
Note Note
To execute this example, you must first run the PartitionAW.sql script against the AdventureWorks2012 sample database. For more information, see PartitioningScript.
SELECT * FROM Production.TransactionHistory
WHERE $PARTITION.TransactionRangePF1(TransactionDate) = 5 ;