ClickHouse - SELECT row of data is too slow - sql

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max

Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

Related

Can I efficiently GROUP BY over a date partitioned table in BigQuery

I have a table t in BigQuery that contains ~5billion rows (~80TB) and it is partitioned on column dateTimeCreated which is of type TIMESTAMP and is partitioned by DAY. The table contains data for 5 years so no more than 1825 partitions.
I'd like to find out how many rows exist in the table per day so I crafted this SQL query:
select timestamp_trunc(datetimecreated,DAY),count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc
I was hoping that BigQuery would be able to return the results rapidly because this is basically counting the number of rows in each partition which, I would expect, is a tally that BigQuery maintains as internal metadata anyway (that's certainly my experience when using ingestion time partitioned tables).
Unfortunately that seems to not be the case. It took BigQuery 73s to return the result:
Query complete (1 min 13 sec elapsed, 37.4 GB processed)
I'm curious if there's a more efficient way to query this table. If it were an ingestion-time partitioned table my query would be:
select _PARTITION_DATE,count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc
which I'm confident would return very quickly. This isn't an ingestion-time partitioned table though.
Is there a more efficient method to achieve my desired result?
Another question,, does BigQuery provide queryable metadata per partition that includes the cardinality of the partition?
Found the answer, this does the job:
SELECT table_name, partition_id, total_rows
FROM `p.d.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
and table_name = 't'
order by partition_id desc
it returns quickly and, of course, queries much less data.
Query complete (1.7 sec elapsed, 10 MB processed)

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

General question on optimising a database for a large query

I have a database that stores data from sensors in a factory. The DB contains about 1.6 million rows per sensor per day. I have the following index on the DB.
CREATE INDEX sensor_name_time_stamp_index ON sensor_data (time_stamp, sensor_name);
I will be running the following query once per day.
SELECT
time_stamp, value
FROM
(SELECT
time_stamp,
value,
lead(value) OVER (ORDER BY value DESC) as prev_result
FROM
sensor_data WHERE time_stamp between '2021-02-24' and '2021-02-25' and sensor_name = 'sensor8'
ORDER BY
time_stamp DESC) as result
WHERE
result.value IS DISTINCT FROM result.prev_result
ORDER BY
result.time_stamp DESC;
The query returns a list of rows where the value is different from the previous row.
This query takes about 23 seconds to run.
Running on PostgreSQL 10.12 on Aurora serverless.
Questions: Besides the index, are there any other optimisations that I can perform on the DB to make the query run faster?
To support the query optimally, the index must be defined the other way around:
CREATE INDEX ON sensor_data (sensor_name, time_stamp);
Otherwise, PostgreSQL will have to read all index values in the time interval, then discard the ones for the wrong sensor, then fetch the rows from the table.
With the proper column order, only the required rows are scanned in the index.
You asked for other optimizations: Since you have to sort rows, increasing work_mem can be beneficial. Other than that, more memory and faster storage will definitely not harm.

Calculating per second peak values after summing up individual values in clickhouse

I am currently using Clickhouse cluster (2 shards, 2 replicas) to read transaction logs from my server. The log contains fields like timestamp, bytes delivered, ttms, etc. The structure of my table is as below:
CREATE TABLE db.log_data_local ON CLUSTER '{cluster}' (
timestamp DateTime,
bytes UInt64,
/*lots of other fields */
) ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/db/tables/logs/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + INTERVAL 1 MONTH;
CREATE TABLE db.log_data ON CLUSTER '{cluster}'
AS cdn_data.http_access_data_local
ENGINE = Distributed('{cluster}','db','log_data_local',rand());
I am ingesting data from Kafka and using materialized view to populate this table. Now I need to calculate the peak throughput per second from this table. So basically I need to sum up the bytes field per second and then find the max value for a 5 minute period.
I tried using ReplicatedAggregatingMergeTree with aggregate functions for the throughput, but the peak value I get is much less compared to the value I get when I directly query the raw table.
The problem is, while creating the material view to populate the peak values, querying the distributed table directly is not giving any results but if I query the local table then only partial data set is considered. I tried using an intermediary table to compute the per-second total and then to create the materialized but I faced the same issue.
This is the schema for my peaks table and the materialized view I am trying to create:
CREATE TABLE db.peak_metrics_5m_local ON CLUSTER '{cluster}'
(
timestamp DateTime,
peak_throughput AggregateFunction(max,UInt64),
)
ENGINE=ReplicatedAggregatingMergeTree('/clickhouse/{cluster}/db/tables/peak_metrics_5m_local/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp)
TTL timestamp + toIntervalDay(90);
CREATE TABLE db.peak_metrics_5m ON CLUSTER '{cluster}'
AS cdn_data.peak_metrics_5m_local
ENGINE = Distributed('{cluster}','db','peak_metrics_5m_local',rand());
CREATE MATERIALIZED VIEW db.peak_metrics_5m_mv ON CLUSTER '{cluster}'
TO db.peak_metrics_5m_local
AS SELECT
toStartOfFiveMinute(timestamp) as timestamp,
maxState(bytes) as peak_throughput,
FROM (
SELECT
timestamp,
sum(bytes) as bytes,
FROM db.log_data_local
GROUP BY timestamp
)
GROUP BY timestamp;
Please help me out with a solution to this.
It's impossible to implement with MV. MV is an insert trigger.
sum(bytes) as bytes, ... GROUP BY timestamp works against inserted buffer and does not read data from log_data_local table.
https://github.com/ClickHouse/ClickHouse/issues/14266#issuecomment-684907869

does pre-sorting a partition tables by certain columns reduce memory used for group bys?

assuming we have a table
CREATE TABLEdataset.user_activity_log
(
partition_time DATE
, user_id STRING
, description STRING
, activity_id int64
)
PARTITION BY partition_time
OPTIONS(
description="partitioned by partition_time"
)
;
And I set it up so that i insert data to it daily and while doing so,
have it order by activity id.
Later on, I would like to create a report over a range of time based on the partition_time field, and do a group by on activity id, would having the activity_id field sorted help with (potentially not running out of memory)?
This is called "Clustered Tables" and creating using DDL
snippet
PARTITION BY partition_time
CLUSTER BY
activity_id
OPTIONS (
read this as well: Optimizing BigQuery: Cluster your tables
You need to cluster your table further by activity_id. If you got into a memory error post your schema, table size, query, and query plan in a new question and you will get optimization tips.