General question on optimising a database for a large query

General question on optimising a database for a large query - sql

I have a database that stores data from sensors in a factory. The DB contains about 1.6 million rows per sensor per day. I have the following index on the DB.
CREATE INDEX sensor_name_time_stamp_index ON sensor_data (time_stamp, sensor_name);
I will be running the following query once per day.
SELECT
time_stamp, value
FROM
(SELECT
time_stamp,
value,
lead(value) OVER (ORDER BY value DESC) as prev_result
FROM
sensor_data WHERE time_stamp between '2021-02-24' and '2021-02-25' and sensor_name = 'sensor8'
ORDER BY
time_stamp DESC) as result
WHERE
result.value IS DISTINCT FROM result.prev_result
ORDER BY
result.time_stamp DESC;
The query returns a list of rows where the value is different from the previous row.
This query takes about 23 seconds to run.
Running on PostgreSQL 10.12 on Aurora serverless.
Questions: Besides the index, are there any other optimisations that I can perform on the DB to make the query run faster?

To support the query optimally, the index must be defined the other way around:
CREATE INDEX ON sensor_data (sensor_name, time_stamp);
Otherwise, PostgreSQL will have to read all index values in the time interval, then discard the ones for the wrong sensor, then fetch the rows from the table.
With the proper column order, only the required rows are scanned in the index.
You asked for other optimizations: Since you have to sort rows, increasing work_mem can be beneficial. Other than that, more memory and faster storage will definitely not harm.

Related

ClickHouse - SELECT row of data is too slow

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max

Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

Does it make sense to partition my database table with 10,000,000 rows if I have a query that goes through every row?

I have a postgres database with a table of reviews that has 15 columns and 10,000,000 rows of data.
**Columns**
id
product_id
_description
stars
comfort_level
fit
quality
recommend
created_at
email
_yes
_no
report
I want to get every review and put it onto my front end, but since that is a bit impractical, I decided to only get 4,000 using this query: SELECT * FROM reviews ORDER BY created_at LIMIT 4000;. With an index, this is pretty fast (6.819ms). I think this could be faster, so would partitioning help in this case? Or even in the case of retrieving all 10,000,000 reviews? Or would it make more sense to split my table and use JOIN clauses in my queries?

This query will certainly become slower with partitioning:
At best, you have the table partitioned by created_at, so that only one partition would be scanned. But the other partitions have to be excluded, which takes extra time during query planning or execution.
At worst, the table is not partitioned by created_at, and you have to scan all partitions.
Note that the speed of an index scan does not depend on the size of the table.
I don't know what your problem is. 7 milliseconds for 4000 rows doesn't sound so bad to me.

10M rows is at the top end of the range for a table classed as "small". I wouldn't worry about anything fancy until you get to at least 50M rows.

azure synapse partition table - no performance improvement

one of synapse table we've 300 million rows and keep increasing. Every row as status column i.e active_row either 0 or 1. Active_row is int datatype. Users only query based active_row = 1 which has only 28 million row and rest of data i.e 270 million is inactive.
To increase the performance and avoid to full tablescan on active_row, i've converted the table in partition table on active_row as below
CREATE TABLE [repo].[STXXXXX]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX (
[ID] ASC
),
PARTITION
(
active_Row RANGE LEFT FOR VALUES (0,1)
)
)
as
select * from repo.nonptxx;
Users reported there is no performance improvement after moving to partition table. when i checked the below query i.e partition vs non-partition i don't see any difference in query explain plain interms of estimated sub tree, operation etc and all stats remain same figure. From sys.dm_pdw_nodes_db_partition_stats i can see 3 partition created on partition 1 having 270 million data spilt in 60 nodes and partition 2 of 60 nodes 30 million spilted and partition 3 of 60 nodes is empty.
select * from [repo].[STXXXXX] where active_row =1
vs
select * from repo.nonptxx where active_row =1
Please advise what's wrong and why there is no improvement after moving into partition table and how to tune it?

Are statistics updated?
Run UPDATE STATISTICS [schema_name].[table_name] and rerun your tests (OR create Stats if they don't exist).
You should see a Filter step w/ the smaller number of rows returned when querying a single partition in the tsql query plan right after the Get step. You won't see it in the dsql query plan. You won't see any subtree cost for a Select * which translates to a single Return operation from the individual nodes, however you will see the estimated number of rows per execution get smaller as you filter by partition (w/ stats up to date). Missing or outdated stats can produce some odd query plan results because the optimizer essentially doesn't have enough information to make a good decision...therefore unpredictable and sometimes poor results.
Another option you may want to consider if it doesn't give you the performance you're looking for is keeping the data w/o partitions and simply creating a non-clustered index on the column. Indexes don't always get used or behave exactly how you'd expect w/ SQL server, however in this use case typically a one column index will greatly help performance. The benefit with the index is if you have data moving from active to inactive it doesn't need to move records between physical partitions.

sql last recorded row take too long too query

I am using the following query to get the last row of the table. This code seems to be what should be done for this type of query. So I am wondering why it takes so long for the table to output the result.
I am looking for the potential reasons that could lead to this long time to query. For example maybe the row is not in the top (I use some conditions) and it takes time to reach the row if the query read from top to bottom of the table? Or the query need to read all the table before to conclude which row is correct (I don't think it is the case)?
Any contribution to know how the "sorting algo" is working is appreciated.
The code:
SELECT created_at , currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
order by created_at desc
limit 1

This is your query:
select created_at, currency, balance
from YYY
where id in (ZZZ) and currency = 'XXX'
order by created_at desc
limit 1;
If your query has no indexes, then the engine needs to scan the entire table, choose the rows that match the where clause, sort those rows and choose the one with the maximum created_at.
If you have an index on YYY(currency, id, created_at desc), then the engine can simply look up the appropriate value right away using the index.
So, if you have the right optimizations in the database, then the query is likely to be much, much faster.

Mariadb Scans all partitions in timestamp column

I have a table Partitioned by:
HASH(timestamp DIV 43200 )
When I perform this query
SELECT max(id)
FROM messages
WHERE timestamp BETWEEN 1581708508 AND 1581708807
it scans all partitions while both numbers 1581708508 & 1581708807& numbers between them are in the same partition, how can I make it to scan only that partition?

You have discovered one of the reasons why PARTITION BY HASH is useless.
In your situation, the Optimizer sees the range (BETWEEN) and says "punt, I'll just scan all the partitions".
That is, "partition pruning" does not work when the WHERE clause involves a range and you are using PARTITION BY HASH. PARTITION BY RANGE, on the other hand, may be able to prune. But... What's the advantage? It does not make the query any faster.
I have found only four uses for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint . It sounds like your application does not fit any of those cases.
That particular query would best be done without partitioning. Instead have a non-partitioned table with this 'composite' index:
INDEX(timestamp, id)
It must scan all the rows to discover the MAX(id), but with this index, it is
Scanning only the 2-column index
Not touching any rows outside the timestamp range.
Hence it will be as fast as possible. Even if PARTITION BY HASH were smart enough to do the desired pruning, it would not run any faster.
In particular, when you ask for a range on the Partition key, such as with WHERE timestamp BETWEEN 1581708508 AND 1581708807, the execution looks in all partitions for the desired rows. This is one of the major failings of Hash. Even if it could realize that only Partition is needed, it would be no faster than simply using the index I suggest.

You can determine that individual partition by using modular arithmetic
MOD(<formula which's argument of hash function>,<number of partitions>)
assuming you have 2 partitions
CREATE TABLE messages(ID int, timestamp int)
PARTITION BY HASH( timestamp DIV 43200 )
PARTITIONS 2;
look up partition names by
SELECT CONCAT( 'p',MOD(timestamp DIV 43200,2)) AS partition_name, timestamp
FROM messages;
and determine the related partition name for the value 1581708508 of timestamp column (assume p1). Then Use
SELECT MAX(id)
FROM messages PARTITION(p1)
to get all the records only in the partition p1 without need of a WHERE condition such as
WHERE timestamp BETWEEN 1581708508 AND 1581708807
Btw, all partitions might be listed through
SELECT *
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE table_name='messages'
Demo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

General question on optimising a database for a large query - sql

Related

ClickHouse - SELECT row of data is too slow

Does it make sense to partition my database table with 10,000,000 rows if I have a query that goes through every row?

azure synapse partition table - no performance improvement

sql last recorded row take too long too query

Mariadb Scans all partitions in timestamp column

Categories

Resources