ClickHouse TTL on materialized column - ttl

I am trying to upgrade the clickhouse cluster from version 18.8 to 19.9.2. Previously, I had a cronjob that deletes old data from the database. I want to start using TTL feature instead.
Simplified table definition:
CREATE TABLE myTimeseries(
timestamp_ns Int64,
source_id String,
data String,
date Date MATERIALIZED toDate(timestamp_ns/1e9),
time DateTime MATERIALIZED toDateTime(timestamp_ns/1e9))
ENGINE = MergeTree()
PARTITION BY (source_id, toStartOfHour(time))
TTL date + toInterValDay(7)
SETTINGS index_granularity=8192, merge_with_ttl_timeout=43200
The problem is, it does not delete old data. I could not find anything in the documentation that would help debug this issue.
Questions:
How can I debug this issue? (Is there a way to see when the data will be cleared in the future)?
Might this be because of date field being materialized? I have another table where date is not a materialized field and everything works fine.

Yes, you can use materialized fields with TTL feature.
I've attached simple query that create table with 5 minutes interval to delete.
It works fine with clickhouse server version 20.4.5
CREATE TABLE IF NOT EXISTS test.profiling
(
headtime UInt64,
date DateTime MATERIALIZED toDateTime(headtime),
id Int64,
operation_name String,
duration Int64
)
ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, id)
TTL date + INTERVAL 5 MINUTE
And important note from clickhouse documentation:
Data with an expired TTL is removed when ClickHouse merges data parts.
When ClickHouse see that data is expired, it performs an off-schedule
merge. To control the frequency of such merges, you can set
merge_with_ttl_timeout. If the value is too low, it will perform many
off-schedule merges that may consume a lot of resources.
If you perform the SELECT query between merges, you may get expired
data. To avoid it, use the OPTIMIZE query before SELECT.

Related

Is it possible to set expiration time for records in BigQuery

Is it possible to set a time to live for a column in BigQuery?
If there are two records in table payment_details and timestamp, the data in BigQuery table should be deleted automatically if the timestamp is current time - timestamp is greater is 90 days.
Solution 1:
BigQuery has a partition expiration feature. You can leverage that for your use case.
Essentially you need to create a partitioned table, and set the partition_expiration_days option to 90 days.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=90
)
or if you have a table partitioned already by the right column
ALTER TABLE mydataset.mytable
SET OPTIONS (
-- Sets partition expiration to 90 days
partition_expiration_days=90
)
When a partition expires, BigQuery deletes the data in that partition.
Solution 2:
You can setup a Scheduled Query that will prune hourly/daily your data that is older than 90 days. By writing a "Delete" query you have more control to actually combine other business logic, like only delete duplicate rows, but keep most recent entry even if it's older than 90 days.
Solution 3:
If you have larger business process that does the 90 day pruning based on other external factors, like an API response, and conditional evaluation, you can leverage Cloud Workflows to build and invoke a workflow regularly to automate the pruning of your data. See Automate the execution of BigQuery queries with Cloud Workflows article which can guide you with this.

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

ClickHouse - SELECT row of data is too slow

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max
Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

Calculating per second peak values after summing up individual values in clickhouse

I am currently using Clickhouse cluster (2 shards, 2 replicas) to read transaction logs from my server. The log contains fields like timestamp, bytes delivered, ttms, etc. The structure of my table is as below:
CREATE TABLE db.log_data_local ON CLUSTER '{cluster}' (
timestamp DateTime,
bytes UInt64,
/*lots of other fields */
) ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/db/tables/logs/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + INTERVAL 1 MONTH;
CREATE TABLE db.log_data ON CLUSTER '{cluster}'
AS cdn_data.http_access_data_local
ENGINE = Distributed('{cluster}','db','log_data_local',rand());
I am ingesting data from Kafka and using materialized view to populate this table. Now I need to calculate the peak throughput per second from this table. So basically I need to sum up the bytes field per second and then find the max value for a 5 minute period.
I tried using ReplicatedAggregatingMergeTree with aggregate functions for the throughput, but the peak value I get is much less compared to the value I get when I directly query the raw table.
The problem is, while creating the material view to populate the peak values, querying the distributed table directly is not giving any results but if I query the local table then only partial data set is considered. I tried using an intermediary table to compute the per-second total and then to create the materialized but I faced the same issue.
This is the schema for my peaks table and the materialized view I am trying to create:
CREATE TABLE db.peak_metrics_5m_local ON CLUSTER '{cluster}'
(
timestamp DateTime,
peak_throughput AggregateFunction(max,UInt64),
)
ENGINE=ReplicatedAggregatingMergeTree('/clickhouse/{cluster}/db/tables/peak_metrics_5m_local/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp)
TTL timestamp + toIntervalDay(90);
CREATE TABLE db.peak_metrics_5m ON CLUSTER '{cluster}'
AS cdn_data.peak_metrics_5m_local
ENGINE = Distributed('{cluster}','db','peak_metrics_5m_local',rand());
CREATE MATERIALIZED VIEW db.peak_metrics_5m_mv ON CLUSTER '{cluster}'
TO db.peak_metrics_5m_local
AS SELECT
toStartOfFiveMinute(timestamp) as timestamp,
maxState(bytes) as peak_throughput,
FROM (
SELECT
timestamp,
sum(bytes) as bytes,
FROM db.log_data_local
GROUP BY timestamp
)
GROUP BY timestamp;
Please help me out with a solution to this.
It's impossible to implement with MV. MV is an insert trigger.
sum(bytes) as bytes, ... GROUP BY timestamp works against inserted buffer and does not read data from log_data_local table.
https://github.com/ClickHouse/ClickHouse/issues/14266#issuecomment-684907869

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.