azure synapse partition table - no performance improvement - sql

one of synapse table we've 300 million rows and keep increasing. Every row as status column i.e active_row either 0 or 1. Active_row is int datatype. Users only query based active_row = 1 which has only 28 million row and rest of data i.e 270 million is inactive.
To increase the performance and avoid to full tablescan on active_row, i've converted the table in partition table on active_row as below
CREATE TABLE [repo].[STXXXXX]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX (
[ID] ASC
),
PARTITION
(
active_Row RANGE LEFT FOR VALUES (0,1)
)
)
as
select * from repo.nonptxx;
Users reported there is no performance improvement after moving to partition table. when i checked the below query i.e partition vs non-partition i don't see any difference in query explain plain interms of estimated sub tree, operation etc and all stats remain same figure. From sys.dm_pdw_nodes_db_partition_stats i can see 3 partition created on partition 1 having 270 million data spilt in 60 nodes and partition 2 of 60 nodes 30 million spilted and partition 3 of 60 nodes is empty.
select * from [repo].[STXXXXX] where active_row =1
vs
select * from repo.nonptxx where active_row =1
Please advise what's wrong and why there is no improvement after moving into partition table and how to tune it?

Are statistics updated?
Run UPDATE STATISTICS [schema_name].[table_name] and rerun your tests (OR create Stats if they don't exist).
You should see a Filter step w/ the smaller number of rows returned when querying a single partition in the tsql query plan right after the Get step. You won't see it in the dsql query plan. You won't see any subtree cost for a Select * which translates to a single Return operation from the individual nodes, however you will see the estimated number of rows per execution get smaller as you filter by partition (w/ stats up to date). Missing or outdated stats can produce some odd query plan results because the optimizer essentially doesn't have enough information to make a good decision...therefore unpredictable and sometimes poor results.
Another option you may want to consider if it doesn't give you the performance you're looking for is keeping the data w/o partitions and simply creating a non-clustered index on the column. Indexes don't always get used or behave exactly how you'd expect w/ SQL server, however in this use case typically a one column index will greatly help performance. The benefit with the index is if you have data moving from active to inactive it doesn't need to move records between physical partitions.

Related

PostgreSQL index reduces data size but makes the query slower

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.
One of the queries I'm trying to run looks as follows:
explain analyze
select tender->>'procurementMethodType' as procurement_method,
tender->>'status' as tender_status,
sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2
The query plan and execution time are the following:
The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:
create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))
The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:
The query with index is slower! How can this be possible?
EDIT: the table itself contains two columns: the ID column and the jsonb data column:
create table tenders (
id uuid primary key,
tender jsonb
)
The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.
You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.
But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

poorly performing query on order lines table

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.
As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).
For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.
Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

Return First Row For Given Value in a Column - BigQuery

I have a very large table that has a column which holds custom ID of string type for each row. For each ID, there are 50 properties in that table. This is guaranteed to be unique in the table.
My main task is to get those 50 properties in the row for a given ID.
When I run a normal query like the one below, it takes 5 sec to scan only 1 million rows.
SELECT * FROM `mytable` WHERE id='123'
As per my understanding, BigQuery does a parallel search for a match after partitioning the rows into different clusters. And I believe for a given ID value it will check all the rows in all different clusters. So that even if a match is found in one partition, the others clusters will continue to get other matches.
But as the values in the ID column are unique here, can we somehow "break" the jobs running on other clusters as soon a match is found in a cluster and return the row.
I hope this will speed up the query run time.
Also, in the future, this table will grow to really large so if this can be done it will really be helpful for my purpose.
Any suggestions are welcome.
You can use recently introduced Clustered Tables
This will allow you to bring down cost and improve performance
Please note: currently clustering is supported for partitioned tables only - but support for clustering non-partitioned tables is under development
If you table is partitioned you can just cluster it by id - and you are done
If not - you can introduce 'fake' date field and partition by it so clustering will be available for that table
Meantime, if you just interested in one row for given id - try below
SELECT * FROM mytable WHERE id='123' LIMIT 1

SQL Server : OFFSET FETCH performs scan while TOP WHERE performs seek?

I got the following two queries. One is fast the other is slow.
The table has a clusted index on the Id column.
-- Slow, uses clustered index scan reading 100100 rows
SELECT *
FROM [dbo].[Foo]
ORDER BY Id
OFFSET 100000 ROWS FETCH FIRST 100 ROWS ONLY
-- Fast, uses clustered index seek reading 100 rows
SELECT TOP 100 *
FROM [dbo].[Foo]
WHERE Id > 100000
ORDER BY Id
The plans are identical except for one uses a scan the other a seek.
Can anyone explain why or is this simply how OFFSET works?
The table is very wide with a few NVARCHAR(100-200) and a single NVARCHAR(2500) column.
The two queries are not equivalent. Although you might assume that the ids have no gaps and start at 1, the database engine does not know that.
Indexes are organized to find particular values quickly. They generally do this by traversing a tree structure, and one which is generally balanced. You can read more about this in the documentation.
However, they are not organized to quickly get to the nth row in the table. Hence, the query needs to scan the table to count the number of rows.
That said, the index could do what you want if it kept the number of rows in each child. Do realize that this would complicate modifications to the table, because the entire hierarchy would need to be updated for each update, insert, and delete.

Index Created but doesn't speed-up on retrieval process

I have created table as bellow
create table T1(num varchar2(20))
then I inserted 3 lac numbers in above table so now it looks like below
num
1
2
3
.
.
300000
Now if I do
select * from T1
then it takes 1min 15sec to completely fetch the records and as I created index on column num and if I use below query then it should be faster to fetch 3 lac records but it takes also 1min15sec for fetch the records
select * from T1 where num between '1' and '300000'
So how the index has improved my retrieval process?
The index does not improve the retrieval process when you are trying to fetch all rows.
The index makes it possible to find a subset of rows much more quickly.
An index can help if you want to retrieve a few rows from a large table. But since you retrieve all rows and since your index contains all the columns of your table, it won't speed up the query.
Furthermore, you don't tell us what tool you use to retrieve the data. I guess you use SQL Developer or Toad. So what you measure is the time it takes SQL Developer or Toad to store 300,000 rows in memory in such a way that they can be easily displayed on screen in a scrollable table. You aren't really measuring how long it takes to retrieve them.
To get a test of the effects of having an index in place you might want to try a query such as
SELECT *
FROM T1
WHERE NUM IN ('288888', '188888', '88888')
both with with the index in place, and again after removing the index. You should also collect statistics on the table prior to running the query with the index in place or you may still get a query which performs a full table scan. Share and enjoy.