Calculating per second peak values after summing up individual values in clickhouse - sql

I am currently using Clickhouse cluster (2 shards, 2 replicas) to read transaction logs from my server. The log contains fields like timestamp, bytes delivered, ttms, etc. The structure of my table is as below:
CREATE TABLE db.log_data_local ON CLUSTER '{cluster}' (
timestamp DateTime,
bytes UInt64,
/*lots of other fields */
) ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/db/tables/logs/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + INTERVAL 1 MONTH;
CREATE TABLE db.log_data ON CLUSTER '{cluster}'
AS cdn_data.http_access_data_local
ENGINE = Distributed('{cluster}','db','log_data_local',rand());
I am ingesting data from Kafka and using materialized view to populate this table. Now I need to calculate the peak throughput per second from this table. So basically I need to sum up the bytes field per second and then find the max value for a 5 minute period.
I tried using ReplicatedAggregatingMergeTree with aggregate functions for the throughput, but the peak value I get is much less compared to the value I get when I directly query the raw table.
The problem is, while creating the material view to populate the peak values, querying the distributed table directly is not giving any results but if I query the local table then only partial data set is considered. I tried using an intermediary table to compute the per-second total and then to create the materialized but I faced the same issue.
This is the schema for my peaks table and the materialized view I am trying to create:
CREATE TABLE db.peak_metrics_5m_local ON CLUSTER '{cluster}'
(
timestamp DateTime,
peak_throughput AggregateFunction(max,UInt64),
)
ENGINE=ReplicatedAggregatingMergeTree('/clickhouse/{cluster}/db/tables/peak_metrics_5m_local/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp)
TTL timestamp + toIntervalDay(90);
CREATE TABLE db.peak_metrics_5m ON CLUSTER '{cluster}'
AS cdn_data.peak_metrics_5m_local
ENGINE = Distributed('{cluster}','db','peak_metrics_5m_local',rand());
CREATE MATERIALIZED VIEW db.peak_metrics_5m_mv ON CLUSTER '{cluster}'
TO db.peak_metrics_5m_local
AS SELECT
toStartOfFiveMinute(timestamp) as timestamp,
maxState(bytes) as peak_throughput,
FROM (
SELECT
timestamp,
sum(bytes) as bytes,
FROM db.log_data_local
GROUP BY timestamp
)
GROUP BY timestamp;
Please help me out with a solution to this.

It's impossible to implement with MV. MV is an insert trigger.
sum(bytes) as bytes, ... GROUP BY timestamp works against inserted buffer and does not read data from log_data_local table.
https://github.com/ClickHouse/ClickHouse/issues/14266#issuecomment-684907869

Related

ClickHouse - SELECT row of data is too slow

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max
Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

BigQuery - Create view with Partition but base table doesn't have

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?
You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

ClickHouse TTL on materialized column

I am trying to upgrade the clickhouse cluster from version 18.8 to 19.9.2. Previously, I had a cronjob that deletes old data from the database. I want to start using TTL feature instead.
Simplified table definition:
CREATE TABLE myTimeseries(
timestamp_ns Int64,
source_id String,
data String,
date Date MATERIALIZED toDate(timestamp_ns/1e9),
time DateTime MATERIALIZED toDateTime(timestamp_ns/1e9))
ENGINE = MergeTree()
PARTITION BY (source_id, toStartOfHour(time))
TTL date + toInterValDay(7)
SETTINGS index_granularity=8192, merge_with_ttl_timeout=43200
The problem is, it does not delete old data. I could not find anything in the documentation that would help debug this issue.
Questions:
How can I debug this issue? (Is there a way to see when the data will be cleared in the future)?
Might this be because of date field being materialized? I have another table where date is not a materialized field and everything works fine.
Yes, you can use materialized fields with TTL feature.
I've attached simple query that create table with 5 minutes interval to delete.
It works fine with clickhouse server version 20.4.5
CREATE TABLE IF NOT EXISTS test.profiling
(
headtime UInt64,
date DateTime MATERIALIZED toDateTime(headtime),
id Int64,
operation_name String,
duration Int64
)
ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, id)
TTL date + INTERVAL 5 MINUTE
And important note from clickhouse documentation:
Data with an expired TTL is removed when ClickHouse merges data parts.
When ClickHouse see that data is expired, it performs an off-schedule
merge. To control the frequency of such merges, you can set
merge_with_ttl_timeout. If the value is too low, it will perform many
off-schedule merges that may consume a lot of resources.
If you perform the SELECT query between merges, you may get expired
data. To avoid it, use the OPTIMIZE query before SELECT.

does pre-sorting a partition tables by certain columns reduce memory used for group bys?

assuming we have a table
CREATE TABLEdataset.user_activity_log
(
partition_time DATE
, user_id STRING
, description STRING
, activity_id int64
)
PARTITION BY partition_time
OPTIONS(
description="partitioned by partition_time"
)
;
And I set it up so that i insert data to it daily and while doing so,
have it order by activity id.
Later on, I would like to create a report over a range of time based on the partition_time field, and do a group by on activity id, would having the activity_id field sorted help with (potentially not running out of memory)?
This is called "Clustered Tables" and creating using DDL
snippet
PARTITION BY partition_time
CLUSTER BY
activity_id
OPTIONS (
read this as well: Optimizing BigQuery: Cluster your tables
You need to cluster your table further by activity_id. If you got into a memory error post your schema, table size, query, and query plan in a new question and you will get optimization tips.

My data can’t be date partitioned, how do I use clustering?

Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113 
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering
You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.
Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.