Getting aggregate data in MySql - google-bigquery

I am attempting to write a sql query to fetch aggregate data from a table. I have a table with data that looks as follows (example data):
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
10
100
01/01/2021
example.doc
gateway
in-progress
1
20
100
02/01/2021
null
checkpoint1
in-progress
1
20
100
03/01/2021
null
checkpoint2
in-progress
The aggregate data I would like to query would look like:
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
50
100
03/01/2021
example.doc
checkpoint2
in-progress
In summary, I would like to:
group on trackingId (done)
Sum of all records fetched (done)
get the latest date (done)
name of original document (not sure how to fetch a value from the first row only, I am trying to avoid subqueries due to inefficiencies)
latest checkpoint (get value from the newest record)
latest status (get value from the newest record)
My issue mainly is fetching specific data from either the newest or oldest record.
Thanks.

Consider below
select trackingId,
sum(numberOfRecords) as numberOfRecords,
any_value(totalRecords) as totalRecords,
max(dateSubmitted) as dateSubmitted,
array_agg(fileName order by dateSubmitted limit 1)[offset(0)] as fileName,
array_agg(checkpoint order by dateSubmitted desc limit 1)[offset(0)] as checkpoint,
array_agg(status order by dateSubmitted desc limit 1)[offset(0)] as status,
from `project.dataset.table`
group by trackingId
if applied to sample data in your question - output is

Look for this:
CREATE TABLE test ( id INT, -- will be used for ordering
cat INT, -- will be used for aggregation
col1 INT, -- will be used for to get SUM
col2 INT, -- will be used for to get value from 1st row
col3 INT -- will be used for to get value from last row
);
INSERT INTO test VALUES
(1,1,11,111,1111), (2,1,22,222,2222), (3,1,33,333,3333),
(4,2,4,4,4), (5,2,5,5,5);
SELECT * FROM test;
id
cat
col1
col2
col3
1
1
11
111
1111
2
1
22
222
2222
3
1
33
333
3333
4
2
4
4
4
5
2
5
5
5
SELECT cat,
SUM(col1) col1_sum,
SUBSTRING_INDEX(GROUP_CONCAT(col2 ORDER BY id), ',', 1) col2_first,
SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id), ',', -1) col3_last
FROM test
GROUP BY cat;
cat
col1_sum
col2_first
col3_last
1
66
111
3333
2
9
4
5
db<>fiddle here
The values processed by GROUP_CONCAT() must have no comma in the value.
PS. Do not forget about group_concat_max_len, especially when single value in the column may be long.
PPS. The expression for last value may be SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id DESC), ',', 1) col3_last.

Related

Time consuming query to Skip First inserted record of Id list

In postgressql I have a data with multiple articleId list on table. Whereever I query it should skip first inserted record of particular userID in specified list of articleID.
select * from (
select * , row_number() over (partition by articleId order by date) rn
from table where articleId in (1200) and userId = 1
) t
where t.rn > 1
It will return expected record by skip first inserted record of each articleId of particular userId.
But above query consuming more time to execute if there is large data.
table:
id
name
articleId
date
userId
1
abc
1200
2021-05-01 06:09:35
1
2
bcd
1400
2021-05-02 06:08:35
1
3
xyz
1200
2021-05-03 09:09:35
2
4
pqr
1200
2021-05-04 08:09:35
1
5
xyz
1200
2021-05-05 09:09:35
3
Expected query Output:
id
name
articleId
date
userId
4
pqr
1200
2021-05-04 08:09:35
1
Try adding the following index, which should cover the call to ROW_NUMBER as well as the WHERE clause:
CREATE INDEX idx ON yourTable (articleId, date, userId);
This should speed up your current query. As always, check the execution plan before and after using EXPLAIN.
I would suggest using a correlated subquery with the right indexing:
select *
from t
where t.articleid = 1200 and t.userId = 1 and
t.date > (select min(t2.date)
from t t2
where t2.articleId = t.articleId
);
Then for this query, you want two indexes: (articleid, userId) and (articleId, date).
Note: I'm a bit surprised that userId is not in the partition by clause.

SQL - Need to find duplicates where one column can have multiple values

I am pretty sure this SQL requires using GROUP BY and HAVING, but not sure how to write it.
I have a table similar to this:
ID
Cust#
Order#
ItemCode
DataPoint1
DataPoint2
1
001
123
I
xxxyyyxxx
123456
2
001
123
Insert
xxxyyyxxx
123456
3
001
123
Delete
asdf
9999
4
001
123
D
asdf
9999
In this table Rows 1 & 2 are effectively duplicates, as are rows 3 & 4.
This is determined by the ItemCode having the value of 'I' or 'Insert' in rows 1 & 2. And 'D' or 'Delete' in rows 3 & 4.
How could I write a SQL select statement to return rows 2 and 4, as I am interested in pulling out the duplicated rows with the higher ID value.
Thanks for any help.
Replace the "offending" column with a consistent value. Then, you can use row_number() or a similar mechanism:
select t.*
from (select t.*,
row_number() over (partition by Cust#, Order#, left(ItemCode, 1), DataPoint1, DataPoint2
order by id asc
) as seqnum
from t
) t
where seqnum > 1;
Note: Not all databases support left(), but all support the functionality somehow. This does assume that the first character of the ItemCode is sufficient to identify identical rows, regardless of the value.

How to "fetch first 1 row only" per IDs in 3 different columns?

I am attempting to fetch 1 row only based on unique set of IDs in another 3 columns. In my database, there are many records per date for each of a unique set of IDs in the other 3 columns (you can imagine product numbers for instance). The stock status is issued only per date (change), not as one final number of a quantity. Therefore the only way to get actual stock status is to check the latest stock status update with the latest date - most top row always per a given combination of product IDs.
This is how my table looks like at the moment:
Code1 Cde2 Code3 Date Stock status
arti-cd-base arti-cd-sfx adfc-cd-diffco stof-dt stof-qty-op
------------ ----------- -------------- ---------- -----------
1 15 0 2019-08-31 200
1 15 0 2019-08-25 290
2 16 2 2019-08-28 100
2 16 2 2019-08-26 80
2 16 2 2019-08-21 200
3 18 25 2019-08-18 75
And this is how I wish it would looks like - visible only the rows with the latest date (stpf-dt) and stock status (stof-qty-op) per each combination of arti-cd-base, arti-cd-sfx and adfc-cd-diffco.
Code1 Cde2 Code3 Date Stock status
arti-cd-base arti-cd-sfx adfc-cd-diffco stof-dt stof-qty-op
------------ ----------- -------------- ---------- -----------
1 15 0 2019-08-31 200
2 16 2 2019-08-28 100
3 18 25 2019-08-18 75
The top column IDs are consecutively as follows:
Code1 Code2 Code3 Date - Stock status
│ arti-cd-base │ arti-cd-sfx │ adfc-cd-diffco │ stof-dt │ stof-qty-op │
Is there any possible way via SQL to achieve this? I found an option to display one row only via the command: "offset 0 row fetch first 1 row only", however this displays simply 1 row, but does not respect one row per a set of product IDs given in the other three columns (arti-cd-base, arti-cd-sfx and adfc-cd-diffco). Would anyone see any way through?
Check out this option:
SELECT *
FROM (
SELECT *
, ROW_NUMBER() OVER ( PARTITION BY arti_cd_base ORDER BY stof_dt DESC) AS rwn
FROM yourTable
) x
WHERE x.rwn = 1
Or in your case :
SELECT *
FROM (
SELECT stof_0."arti-cd-base"
, stof_0."arti-cd-sfx"
, stof_0."adfc-cd-diffco"
, stof_0."dcmf-nr"
, stof_0."stof-dt"
, stof_0."stof-qty-op"
, stof_0."stof-qty-nop"
, stof_0."stof-qty-brok"
, stof_0."stof-qty-brok-ok"
, stof_0."stof-qty-ext"
, stof_0."stof-qty-dock"
, stof_0."stof-qty-trans"
, stof_0."stof-qty-diff"
, stof_0."stof-qty-fin-qty"
, stof_0."stof-qty-fin-temp"
, stof_0."stof-am"
, ROW_NUMBER() OVER (PARTITION BY stof_0."arti-cd-base" ORDER BY stof_0."stof-dt" DESC ) AS rwn
FROM NILOS.PUB.stof stof_0
WHERE (stof_0."arti-cd-base"=1)
AND (stof_0."arti-cd-sfx"=15)
AND (stof_0."adfc-cd-diffco"=0)
) x
WHERE x.rwn = 1
Or:
SELECT x.*
FROM (
SELECT stof_0."arti-cd-base"
, stof_0."arti-cd-sfx"
, stof_0."adfc-cd-diffco"
, stof_0."dcmf-nr"
, stof_0."stof-dt"
, stof_0."stof-qty-op"
, stof_0."stof-qty-nop"
, stof_0."stof-qty-brok"
, stof_0."stof-qty-brok-ok"
, stof_0."stof-qty-ext"
, stof_0."stof-qty-dock"
, stof_0."stof-qty-trans"
, stof_0."stof-qty-diff"
, stof_0."stof-qty-fin-qty"
, stof_0."stof-qty-fin-temp"
, stof_0."stof-am"
FROM NILOS.PUB.stof stof_0
WHERE (stof_0."arti-cd-base"=1)
AND (stof_0."arti-cd-sfx"=15)
AND (stof_0."adfc-cd-diffco"=0)
) x
INNER JOIN ( SELECT stof_1."arti-cd-base"
, MAX(stof_1."stof-dt") AS max_stof_dt
FROM NILOS.PUB.stof stof_1
GROUP BY stof_1."arti-cd-base" ) y ON x"arti-cd-base" = y."arti-cd-base"
AND x."stof-dt" = y."max_stof_dt"
This returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.
So in your case partition is arti_cd_base, and we are ordering by the oldest date per partition. Oldest date per each partition will have always have number 1, thats why there is condition that result of this function must equal 1.
It looks like you want to keep the first value of the last column in an aggregation. Oracle offers this functionality, using keep:
select "arti-cd-base", "arti-cd-sfx", "adfc-cd-diffco",
max("stof-dt") as "stof-dt",
max("stof-qty-op") keep (dense_rank first order by "stof-dt" desc) as "stof-qty-op"
from t
group by "arti-cd-base", "arti-cd-sfx", "adfc-cd-diffco";

Get time stamp of change in column value

I have a table that tracks a certain status using a bit column.I want to get the first timestamp of the status change. I have got the desired output using temp table but is there a better way to do this?
I get the max time stamp for status 1, then I get the min timestamp for status 0 and if the min timestamp for status 0 is greater than max timestamp for status 1 then I include it in the result set.
Sample data
123 0 2016-12-21 20:04:56.217
123 0 2016-12-21 19:00:28.980
123 0 2016-12-21 17:00:10.207 <-- Get this record because this is the latest status change from 1 to 0
123 1 2016-12-20 16:15:58.787
123 1 2016-12-20 16:11:36.523
123 1 2016-12-20 14:20:02.467
123 1 2016-12-20 13:57:57.623
123 0 2016-12-20 13:55:31.421 <-- This should not be included in the result even though it is a status change but since it is not the latest
123 1 2016-12-20 13:54:57.307
123 0 2016-12-19 12:23:46.103
123 0 2016-12-18 11:47:21.267
SQL
CREATE TABLE #temp_status_changed
(
id VARCHAR(22) NOT NULL,
enabled BIT NOT NULL,
dt_create DATETIME NOT null
)
INSERT INTO #temp_status_changed
SELECT id,enabled,MAX(dt_create) FROM mytable WHERE enabled=1
GROUP BY id,enabled
SELECT a.id,a.enabled,MIN(a.dt_create) FROM mytable a
JOIN #temp_status_changed b ON a.id=b.id
WHERE a.enabled=0
GROUP BY a.id,a.enabled
HAVING MIN(a.dt_create) > (SELECT dt_create FROM #temp_status_changed WHERE id=a.id)
DROP TABLE #temp_status_changed
There are several ways to achieve that.
For example, using LAG() function you can always get the previous value and compare it:
SELECT * FROM
(
SELECT *, LAG(Enabled) OVER (PARTITION BY id ORDER BY dt_create) PrevEnabled
FROM YourTable
) x
WHERE Enabled = 0 AND PrevEnabled = 1
Another approach without window functions would be:
SELECT
sc.id,
sc.enabled,
dt_create = MIN(sc.dt_create)
FROM
YourTable AS sc
JOIN (
SELECT
id,
max_dt_create = MAX(dt_create)
FROM
YourTable
WHERE
enabled = 1
GROUP BY
id
) as MaxStatusChanges
ON sc.id = MaxStatusChanges.id AND
sc.dt_create > MaxStatusChanges.max_dt_create
GROUP BY
sc.id,
sc.enabled
The query returns no rows for an id if there's no rows with status 1 for that id, as well as if the most recent status for the id is 1. An unclustered index on enabled column with included id and dt_create columns could improve query performance.

SQL Server GROUP BY COUNT Consecutive Rows Only

I have a table called DATA on Microsoft SQL Server 2008 R2 with three non-nullable integer fields: ID, Sequence, and Value. Sequence values with the same ID will be consecutive, but can start with any value. I need a query that will return a count of consecutive rows with the same ID and Value.
For example, let's say I have the following data:
ID Sequence Value
-- -------- -----
1 1 1
5 1 100
5 2 200
5 3 200
5 4 100
10 10 10
I want the following result:
ID Start Value Count
-- ----- ----- -----
1 1 1 1
5 1 100 1
5 2 200 2
5 4 100 1
10 10 10 1
I tried
SELECT ID, MIN([Sequence]) AS Start, Value, COUNT(*) AS [Count]
FROM DATA
GROUP BY ID, Value
ORDER BY ID, Start
but that gives
ID Start Value Count
-- ----- ----- -----
1 1 1 1
5 1 100 2
5 2 200 2
10 10 10 1
which groups all rows with the same values, not just consecutive rows.
Any ideas? From what I've seen, I believe I have to left join the table with itself on consecutive rows using ROW_NUMBER(), but I am not sure exactly how to get counts from that.
Thanks in advance.
You can use Sequence - ROW_NUMBER() OVER (ORDER BY ID, Val, Sequence) AS g to create a group:
SELECT
ID,
MIN(Sequence) AS Sequence,
Val,
COUNT(*) AS cnt
FROM
(
SELECT
ID,
Sequence,
Sequence - ROW_NUMBER() OVER (ORDER BY ID, Val, Sequence) AS g,
Val
FROM
yourtable
) AS s
GROUP BY
ID, Val, g
Please see a fiddle here.