Time consuming query to Skip First inserted record of Id list - sql

In postgressql I have a data with multiple articleId list on table. Whereever I query it should skip first inserted record of particular userID in specified list of articleID.
select * from (
select * , row_number() over (partition by articleId order by date) rn
from table where articleId in (1200) and userId = 1
) t
where t.rn > 1
It will return expected record by skip first inserted record of each articleId of particular userId.
But above query consuming more time to execute if there is large data.
table:
id
name
articleId
date
userId
1
abc
1200
2021-05-01 06:09:35
1
2
bcd
1400
2021-05-02 06:08:35
1
3
xyz
1200
2021-05-03 09:09:35
2
4
pqr
1200
2021-05-04 08:09:35
1
5
xyz
1200
2021-05-05 09:09:35
3
Expected query Output:
id
name
articleId
date
userId
4
pqr
1200
2021-05-04 08:09:35
1

Try adding the following index, which should cover the call to ROW_NUMBER as well as the WHERE clause:
CREATE INDEX idx ON yourTable (articleId, date, userId);
This should speed up your current query. As always, check the execution plan before and after using EXPLAIN.

I would suggest using a correlated subquery with the right indexing:
select *
from t
where t.articleid = 1200 and t.userId = 1 and
t.date > (select min(t2.date)
from t t2
where t2.articleId = t.articleId
);
Then for this query, you want two indexes: (articleid, userId) and (articleId, date).
Note: I'm a bit surprised that userId is not in the partition by clause.

Related

Getting aggregate data in MySql

I am attempting to write a sql query to fetch aggregate data from a table. I have a table with data that looks as follows (example data):
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
10
100
01/01/2021
example.doc
gateway
in-progress
1
20
100
02/01/2021
null
checkpoint1
in-progress
1
20
100
03/01/2021
null
checkpoint2
in-progress
The aggregate data I would like to query would look like:
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
50
100
03/01/2021
example.doc
checkpoint2
in-progress
In summary, I would like to:
group on trackingId (done)
Sum of all records fetched (done)
get the latest date (done)
name of original document (not sure how to fetch a value from the first row only, I am trying to avoid subqueries due to inefficiencies)
latest checkpoint (get value from the newest record)
latest status (get value from the newest record)
My issue mainly is fetching specific data from either the newest or oldest record.
Thanks.
Consider below
select trackingId,
sum(numberOfRecords) as numberOfRecords,
any_value(totalRecords) as totalRecords,
max(dateSubmitted) as dateSubmitted,
array_agg(fileName order by dateSubmitted limit 1)[offset(0)] as fileName,
array_agg(checkpoint order by dateSubmitted desc limit 1)[offset(0)] as checkpoint,
array_agg(status order by dateSubmitted desc limit 1)[offset(0)] as status,
from `project.dataset.table`
group by trackingId
if applied to sample data in your question - output is
Look for this:
CREATE TABLE test ( id INT, -- will be used for ordering
cat INT, -- will be used for aggregation
col1 INT, -- will be used for to get SUM
col2 INT, -- will be used for to get value from 1st row
col3 INT -- will be used for to get value from last row
);
INSERT INTO test VALUES
(1,1,11,111,1111), (2,1,22,222,2222), (3,1,33,333,3333),
(4,2,4,4,4), (5,2,5,5,5);
SELECT * FROM test;
id
cat
col1
col2
col3
1
1
11
111
1111
2
1
22
222
2222
3
1
33
333
3333
4
2
4
4
4
5
2
5
5
5
SELECT cat,
SUM(col1) col1_sum,
SUBSTRING_INDEX(GROUP_CONCAT(col2 ORDER BY id), ',', 1) col2_first,
SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id), ',', -1) col3_last
FROM test
GROUP BY cat;
cat
col1_sum
col2_first
col3_last
1
66
111
3333
2
9
4
5
db<>fiddle here
The values processed by GROUP_CONCAT() must have no comma in the value.
PS. Do not forget about group_concat_max_len, especially when single value in the column may be long.
PPS. The expression for last value may be SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id DESC), ',', 1) col3_last.

Postgres query with limit that selects all records with similar identifier

I have a table that looks something like this:
customer_id
data
1
123
1
456
2
789
2
101
2
121
2
123
3
123
4
456
What I would like to do is perform a SELECT combined with a LIMIT X to get X number of records as well as any other records that have the same customer_id
Example query: SELECT customer_id, data FROM table ORDER BY customer_id LIMIT 3;
This query returns:
customer_id
data
1
123
1
456
2
789
I'd like a query that will look at the last customer_id value and return all remaining records that match beyond the LIMIT specified. Is it possible to do this in a single operation?
Desired output:
customer_id
data
1
123
1
456
2
789
2
101
2
121
2
123
In Postgres 13 can use with ties:
select t.*
from t
order by customer_id
fetch first 3 rows with ties;
In earlier versions you can use in:
select t.*
from t
where t.customer_id in (select t2.customer_id
from t t2
order by t2.customer_id
limit 3
);
You can use corelated subquery with count as follows:
Select t.*
From t
Where 3 >= (select count(distinct customer_id)
From t tt
where t.customer_id >= tt.customer_id)

Get time stamp of change in column value

I have a table that tracks a certain status using a bit column.I want to get the first timestamp of the status change. I have got the desired output using temp table but is there a better way to do this?
I get the max time stamp for status 1, then I get the min timestamp for status 0 and if the min timestamp for status 0 is greater than max timestamp for status 1 then I include it in the result set.
Sample data
123 0 2016-12-21 20:04:56.217
123 0 2016-12-21 19:00:28.980
123 0 2016-12-21 17:00:10.207 <-- Get this record because this is the latest status change from 1 to 0
123 1 2016-12-20 16:15:58.787
123 1 2016-12-20 16:11:36.523
123 1 2016-12-20 14:20:02.467
123 1 2016-12-20 13:57:57.623
123 0 2016-12-20 13:55:31.421 <-- This should not be included in the result even though it is a status change but since it is not the latest
123 1 2016-12-20 13:54:57.307
123 0 2016-12-19 12:23:46.103
123 0 2016-12-18 11:47:21.267
SQL
CREATE TABLE #temp_status_changed
(
id VARCHAR(22) NOT NULL,
enabled BIT NOT NULL,
dt_create DATETIME NOT null
)
INSERT INTO #temp_status_changed
SELECT id,enabled,MAX(dt_create) FROM mytable WHERE enabled=1
GROUP BY id,enabled
SELECT a.id,a.enabled,MIN(a.dt_create) FROM mytable a
JOIN #temp_status_changed b ON a.id=b.id
WHERE a.enabled=0
GROUP BY a.id,a.enabled
HAVING MIN(a.dt_create) > (SELECT dt_create FROM #temp_status_changed WHERE id=a.id)
DROP TABLE #temp_status_changed
There are several ways to achieve that.
For example, using LAG() function you can always get the previous value and compare it:
SELECT * FROM
(
SELECT *, LAG(Enabled) OVER (PARTITION BY id ORDER BY dt_create) PrevEnabled
FROM YourTable
) x
WHERE Enabled = 0 AND PrevEnabled = 1
Another approach without window functions would be:
SELECT
sc.id,
sc.enabled,
dt_create = MIN(sc.dt_create)
FROM
YourTable AS sc
JOIN (
SELECT
id,
max_dt_create = MAX(dt_create)
FROM
YourTable
WHERE
enabled = 1
GROUP BY
id
) as MaxStatusChanges
ON sc.id = MaxStatusChanges.id AND
sc.dt_create > MaxStatusChanges.max_dt_create
GROUP BY
sc.id,
sc.enabled
The query returns no rows for an id if there's no rows with status 1 for that id, as well as if the most recent status for the id is 1. An unclustered index on enabled column with included id and dt_create columns could improve query performance.

SQL - How to get Top 5 record with sub-record

I have a query that returns the following table
ID SubId Rate Time
1 1 10.00 '00:00:10'
2 1 11.00 '00:00:15'
3 2 12.00 '00:00:20'
4 3 13.00 '00:00:25'
5 4 14.00 '00:00:30'
6 5 15.00 '00:00:35'
7 6 16.00 '00:00:40'
Now the problem is that i need all those record whose SubId lies in Top 5 of Time's order.
ID SubId Rate Time
1 1 10.00 '00:00:10'
2 1 11.00 '00:00:15'
3 2 12.00 '00:00:20'
4 3 13.00 '00:00:25'
5 4 14.00 '00:00:30'
6 5 15.00 '00:00:35'
My Approach
Select ID,SubId,Rate from Query1 where SubId In (Select Top 5 SubId from Query1)
--Time was not included in it
Note : Please do not suggest an answer like above because it needs to use the query twice as the query is already taking too much time to return the above records.
If you don't want to use the same query twice, I suggest you insert the result into a temporary table. That way, you don't have to execute the complex query twice.
CREATE TABLE #TopFive(Id)
INSERT INTO #TopFive
SELECT TOP 5 SubId FROM QueryId ORDER BY [Time] DESC
Then in your subsequent queries, you can just use the temporary table:
SELECT * FROM <tbl> WHERE subId IN(SELECT Id FROM #TopFive)
You could also add a NONCLUSTERED INDEX on the temporary table for added performance gain:
CREATE NONCLUSTERED INDEX NCI_TopFive ON #TopFive(Id)
with x as
(select row_number() over(order by time) as rn, * from tablename)
select ID,SubId,Rate from x where rn <=5
This will assign row numbers based on ascending order of time in your table. You can also partition and order by your desired columns. Thereafter, you can select whatever row numbers from the cte you need.
My answer is only slightly different than Felix's with a small difference. I would rather create a covered NC Index. That way I/O operations would get reduced when it's used down the line.
Store the results once in a temporary table and create a covered non-clustered on SubID
Select ID, SubId, Rate, [Time]
INTO #results
FROM Query1
CREATE NONCLUSTERED INDEX IX_SubID ON #results(SubId) INCLUDE(Id, Rate, [Time])
SELECT A.ID, A.SubId, A.Rate, A.[Time]
FROM
#results A
JOIN
(SELECT TOP 5 SubID from #results order by [Time] desc) B
on A.SubID = B.SubID

How to use a select command to find all the records that has the maximum date value for a specific item?

Say I have a table like this, we call it tbl_test
ID thedate actionid songid
1 2014-10-01 100 10
2 2014-09-30 100 10
3 2014-10-01 80 10
4 2014-09-30 80 10
5 2014-10-01 80 21
6 2014-09-30 100 21
Now I want to find all the record thats in the tbl_test where actionid=100 and with the latest [thedate] value. In this case, I want the final select result to be
(this is the result I want, not an existing table)
ID thedate actionid songid
1 2014-10-01 100 10
6 2014-09-30 100 21
Question, how am I going to do that use nothing but a single select command in MS SQL Server?
Use a join to a query that returns the latest date for each song:
select tbl_test.*
from tbl_test
join (select songid, max(theDate) maxDate
from tbl_test
where actionId = 100
group by songid) t on t.songId = tbl_test.songId and theDate = maxDate
where actionid = 100
This should perform pretty well as it makes only 2 passes over the table - one for the inner query that determines the latest date, and another to output the matching rows
A general SQL way to get this is using not exists:
select t.*
from tbl_test t
where actionid = 100 and
not exists (select 1
from tbl_test t2
where t2.songid = t.songid and t2.actionid = 100 and t2.thedate > t.thedate
);
For performance, you want an index on songid, actionid, thedate.