I can insert IoT sensor data into an Azure SQL database via an Azure Stream Analytics query,
SELECT
*
INTO
myazuredb
FROM
mystreamin
except each time a sensor sample is taken, stream analytics creates roughly about 60 messages that are all the same and inserts them into the database. I would like just 1 row for each sample to be inserted based on the Date TIMESTAMP which are all identical. My first thought was to try GROUP BY but after some reading about Stream Analytics Query Language I tried.
SELECT CollectTop(1) OVER (ORDER BY Date ASC) as Date
INTO
myazuredb
FROM
mystreamin TIMESTAMP BY Time
GROUP BY Date, TumblingWindow(second, 60)
This query doesn't insert anything, not sure I am even on the right track. Any ideas on how to approach the problem would be great. Table: Date, DeviceId, Temperature, Humidity, Moisture, EventProcessedUtcTime, PartitionId, EventEnqueuedUtcTime, IoTHub, EventID
SELECT TopOne() OVER (ORDER BY Date ASC) as Date
INTO
myazuredb
FROM mystreamin TIMESTAMP BY Time
GROUP BY Date, TumblingWindow(second, 60)
TopOne() returns top record based on the ordering.
Related
The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max
Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries
My data is partitioned by day in the standard Hive format:
/year=2020/month=10/day=01
/year=2020/month=10/day=02
/year=2020/month=10/day=03
/year=2020/month=10/day=04
...
I want to query all data from the last 60 days, using Amazon Athena (IE: Presto). I want this query to use the partitioned columns (year, month, day) so that only the necessary partition files are scanned. Assuming I can't change the file partition format, what is the best approach to this problem?
You don't have to use year, month, day as the partition keys for the table. You can have a single partition key called date and add the partitions like this:
ALTER TABLE the_table ADD
PARTITION (`date` = '2020-10-01') LOCATION 's3://the-bucket/data/year=2020/month=10/day=01'
PARTITION (`date` = '2020-10-02') LOCATION 's3://the-bucket/data/year=2020/month=10/day=02'
...
With this setup you can even set the type of the partition key to date:
PARTITIONED BY (`date` date)
Now you have a table with a date column typed as a DATE, and you can use any of the date and time functions to do calculations on it.
What you won't be able to do with this setup is use MSCK REPAIR TABLE to load partitions, but you really shouldn't do that anyway – it's extremely slow and inefficient and really something you only do when you have a couple of partitions to load into a new table.
An alternative way to that proposed by Theo, is to use the following syntax, e.g.:
select ... from my_table where year||month||day between '2020630' and '20201010'
this works when the format for the columns year, month and day are string. It's particularly useful to query across months.
I am currently using Clickhouse cluster (2 shards, 2 replicas) to read transaction logs from my server. The log contains fields like timestamp, bytes delivered, ttms, etc. The structure of my table is as below:
CREATE TABLE db.log_data_local ON CLUSTER '{cluster}' (
timestamp DateTime,
bytes UInt64,
/*lots of other fields */
) ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/db/tables/logs/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + INTERVAL 1 MONTH;
CREATE TABLE db.log_data ON CLUSTER '{cluster}'
AS cdn_data.http_access_data_local
ENGINE = Distributed('{cluster}','db','log_data_local',rand());
I am ingesting data from Kafka and using materialized view to populate this table. Now I need to calculate the peak throughput per second from this table. So basically I need to sum up the bytes field per second and then find the max value for a 5 minute period.
I tried using ReplicatedAggregatingMergeTree with aggregate functions for the throughput, but the peak value I get is much less compared to the value I get when I directly query the raw table.
The problem is, while creating the material view to populate the peak values, querying the distributed table directly is not giving any results but if I query the local table then only partial data set is considered. I tried using an intermediary table to compute the per-second total and then to create the materialized but I faced the same issue.
This is the schema for my peaks table and the materialized view I am trying to create:
CREATE TABLE db.peak_metrics_5m_local ON CLUSTER '{cluster}'
(
timestamp DateTime,
peak_throughput AggregateFunction(max,UInt64),
)
ENGINE=ReplicatedAggregatingMergeTree('/clickhouse/{cluster}/db/tables/peak_metrics_5m_local/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp)
TTL timestamp + toIntervalDay(90);
CREATE TABLE db.peak_metrics_5m ON CLUSTER '{cluster}'
AS cdn_data.peak_metrics_5m_local
ENGINE = Distributed('{cluster}','db','peak_metrics_5m_local',rand());
CREATE MATERIALIZED VIEW db.peak_metrics_5m_mv ON CLUSTER '{cluster}'
TO db.peak_metrics_5m_local
AS SELECT
toStartOfFiveMinute(timestamp) as timestamp,
maxState(bytes) as peak_throughput,
FROM (
SELECT
timestamp,
sum(bytes) as bytes,
FROM db.log_data_local
GROUP BY timestamp
)
GROUP BY timestamp;
Please help me out with a solution to this.
It's impossible to implement with MV. MV is an insert trigger.
sum(bytes) as bytes, ... GROUP BY timestamp works against inserted buffer and does not read data from log_data_local table.
https://github.com/ClickHouse/ClickHouse/issues/14266#issuecomment-684907869
When doing queries on a partitioned table in SQL Server, does one have to do anything special?
The reason I am asking is because we have a fairly large SQL Server table that is partitioned on a `datetime2(2)' column by day.
Each day is mapped to its own file group with a file in that file group named appropriately such as Logs_2014-09-15.ndf.
If I do a query on this table that say, only spans 2 days. I see that in ResourceMonitor that SQL Server is accessing more than 2 of the daily .ndf files. (edit, in fact I have noticed that it goes and searched through every single one. even if i Select from a day that falls in partition1 )
From my understanding with partitioned tables, it should only search amongst the appropriate data /partitions that it needs to?
So my questions:
Is this the case?
does how I compare the DateTime2 column effect the query?
For example, I could query like so:
select * from LogsTable
where [date] like '2014-09-15'
or I could do:
select * from LogsTable
where [date] = CAST('2014-09-15'AS DATETIME2)
Does the partition function automatically look at the [time] element if it is in the query and then send sql to the correct partition?
Have you tried with this:
select * from LogsTable
where Dateadd(D, 0, Datediff(D, 0, [date])) = CAST('2014-09-15'AS DATETIME2)
i have one table in my database say mytable, which contents request coming from other source. There is one column in this table as Time, which stores date and time(e.g. 2010/07/10 01:21:43) when request was received. Now i want to fetch the data from this table on hourly basis for each day. Means i want count of requests database receive in each hours of a day. e.g.for 1 o'clock to 2 o'clock say count is 50 ..like this.. I will run this query at the end of day. So i will get requests received in a day group by each hour.
Can anybody help me in this.
I want query which will take less time to fetch the data as my database size is huge.
Any othre way than OMG Ponies answer.
Use the TO_CHAR function to format the DATETIME column, so you can GROUP BY it for aggregate functions:
SELECT TO_CHAR(t.time, 'YYYY-MM-DD HH24') AS hourly,
COUNT(*) AS numPerHour
FROM YOUR_TABLE t
GROUP BY TO_CHAR(t.time, 'YYYY-MM-DD HH24')
Why don't you create another table that stores the count and the date. Create a database job that will run hourly and put the count and sysdate in the new table. Your code will be just querying the new table.
create table ohtertable_count (no_of_rows number, time_of_count date);
Your database job, that will run hourly will be something like
insert into othertable_count
select count(1), sysdate
from othertable;
And you will query the table othertable_count instead of querying your original table.
SELECT To_char(your_column, 'YYYY-MM-DD HH24') AS hourly,
Count(*) AS numPerHour
FROM your_table
WHERE your_column > '1-DEC-18'
AND your_column < '4-DEC-18'
GROUP BY To_char(your_column, 'YYYY-MM-DD HH24')
ORDER BY hourly;