How to group by similar integer values in Postgres sql? - sql

I have a very simple Database Table where a new entry is inserted everytime a product is scanned (RFID-Scanner).
Scans Table:
ID (PK)
Product_ID (FK)
Created_At
1
1
2023-01-26 10:39:00.0000
2
2
2023-01-26 10:39:02.0000
3
3
2023-01-26 10:39:04.0000
4
4
2023-01-26 10:47:00.0000
My goal is to cluster the product ids by the time they were scanned with a specified tolerance (in seconds), so for example for the entries in my table and a tolerance of 10 seconds, the desired result would be
Product_IDs
{1, 2, 3}
{4}
My first attempt to solve the issue was something like this:
SELECT ARRAY_AGG(DISTINCT Product_ID) FROM scans GROUP BY ROUND(EXTRACT(EPOCH FROM created_at) / 10);
This approach works a little, but in edge cases, when for example one product is scanned at second 19 and another one at seconds 21, it wouldn't be grouped together, although it should.
What is a better, more reliable way to solve this problem?

I will assume that the groups are separated if the time between to rows is more than 10 seconds. For example with test data
create table scans(ID int, Product_ID int, Created_At TimeStamp);
insert into scans values
(1, 1,cast('2023-01-26 10:39:00.000' as TimeStamp))
,(2, 2,cast('2023-01-26 10:39:02.000' as TimeStamp))
,(3, 3,cast('2023-01-26 10:39:11.000' as TimeStamp))
,(4, 4,cast('2023-01-26 10:47:00.000' as TimeStamp))
;
Calculate time difference between current row and preceding. When difference greater than '10 second' - thats new group of scans starts.
with ScansDif as(
select *
,Created_At-lag(Created_At,1,Created_At)over(order by Created_At) dif
from scans
)
,ScansGroup as(
select *
,sum(case when dif>cast('10'||' second' as interval) then 1 else 0 end)
over(order by Created_At rows unbounded preceding) grN
from ScansDif
)
SELECT ARRAY_AGG(DISTINCT Product_ID)
FROM ScansGroup
GROUP BY grn
Group numbers
id
product_id
created_at
dif
grn
1
1
2023-01-26 10:39:00
00:00:00
0
2
2
2023-01-26 10:39:02
00:00:02
0
3
3
2023-01-26 10:39:11
00:00:09
0
4
4
2023-01-26 10:47:00
00:07:56
1
Time difference between first and last row of group 0 is 00:11.
Result
array_agg
{1,2,3}
{4}
Demo

Related

Unpivoting for large dataset and greater number of unique columns

The pivot and unpivot functions in snowflake are not efficient for processing 30+ unique columns into row based.
Use case : I have 35 different month columns which needs to be rows based , another 35 columns will be quantity for the corresponding month .
So at the and there will be 2 columns(one for month data and another for quantity) for 70 unique columns
there would be aggregation of quantity based on month
But unpivoting is not at all efficient. The below query is scanning 15 GB of data from the main table used
select part_num ,concat(date_part(year, dates),'-',date_part(month, dates)) as month_year,
sum(quantity) as quantities
from table_name
unpivot(dates for cols in (month_1, 30 other uniue cols)),
unpivot(quantity for cols in (qunatity_1, 30 other uniue cols)),
group by part_num, month_year
Is there any other approach to unpivot large dataset.
Thanks
Alternative approach could be using conditional aggregation:
with cte as (
select part_num
,concat(date_part(year, dates),'-',date_part(month, dates)) as month_year
,sum(quantity) as quantities
from table_name
group by part_num, month_year
)
SELECT part_num
-- lowest date
,'2020-01' AS "2020-01"
,MAX(IFF(month_year='2020-01', quantities, NULL) AS "quantities_2020-01"
-- next date
,...
-- last date
,'2022-04' AS "2022-04"
,MAX(IFF(month_year='2022-04', quantities, NULL) AS "quantities_2022-04"
FROM cte
GROUP BY part_num;
Version using single GROUP BY and TO_VARCHAR with format:
SELECT part_num
-- lowest date
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM'),'2020-01',NULL) AS "2020-01"
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM')='2020-01',quantities,NULL) AS "quantities_2020-01"
-- next date
,...
-- last date
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM'),'2022-04',NULL) AS "2022-04"
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM')='2022-04',quantities,NULL) AS "quantities_2022-04"
FROM table_name
GROUP BY part_num;
So if we get some example DATA and test is what is happening is what is wanted..
Here is a trival and tiny CTE worth of data
with table_name(part_num, month_1, month_2, month_3, qunatity_1, qunatity_2, qunatity_3) as (
select * from values
(1, '2022-01-01'::date, '2022-02-01'::date, '2022-03-01'::date, 4, 5, 6)
)
now pointing your SQL at it (after making it compile)
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,sum(quantity) as quantities
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
unpivot(quantity for quan in (qunatity_1, qunatity_2, qunatity_3))
group by part_num, month_year
gives:
PART_NUM
MONTH_YEAR
QUANTITIES
1
2022-01
15
1
2022-02
15
1
2022-03
15
which is not what I think you are after.
If we look at the un aggregated rows:
PART_NUM
MONTH
DATES
QUAN
QUANTITY
1
MONTH_1
2022-01-01
QUNATITY_1
4
1
MONTH_1
2022-01-01
QUNATITY_2
5
1
MONTH_1
2022-01-01
QUNATITY_3
6
1
MONTH_2
2022-02-01
QUNATITY_1
4
1
MONTH_2
2022-02-01
QUNATITY_2
5
1
MONTH_2
2022-02-01
QUNATITY_3
6
1
MONTH_3
2022-03-01
QUNATITY_1
4
1
MONTH_3
2022-03-01
QUNATITY_2
5
1
MONTH_3
2022-03-01
QUNATITY_3
6
we are getting a cross join, which is not what I believe you are wanting.
my understanding is you want a relationship between month (1-35) and quantity (1-35)
thus a mix like:
PART_NUM
MONTH
DATES
QUAN
QUANTITY
1
MONTH_1
2022-01-01
QUNATITY_1
4
1
MONTH_2
2022-02-01
QUNATITY_2
5
1
MONTH_3
2022-03-01
QUNATITY_3
6
Guessed Answer:
My guess at what you really are wanting is:
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,array_construct(qunatity_1, qunatity_2, qunatity_3)[split_part(month,'_',2)::number - 1] as qunatity
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
order by 1,2;
which gives (for the same above CTE data):
PART_NUM
MONTH_YEAR
QUNATITY
1
2022-01
4
1
2022-02
5
1
2022-03
6
Another way to way to get than guessed answer:
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,sum(iff(split_part(month,'_',2)=split_part(q_name,'_',2), q_val, null)) as qunatity
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
unpivot(q_val for q_name in (qunatity_1, qunatity_2, qunatity_3))
group by 1,2
order by 1,2;
which uses the double unpivot, so might be slow, but then only aggregates the values if they match. Which feels somewhat almost as gross as the build an array, to rip it apart, but that version is not needing to do large joins, just some per row grossness.
Assuming your data is already aggregated at part_num level, you could divide and conquer like this
with year_month as
(select a.part_num, b.index+1 as month_num, left(b.value,7) as year_month
from my_table a,table(flatten(input=>array_construct(m1,m2,m3...))) b),
quantities as
(select a.part_num, b.index+1 as month_num, b.value::int as quantity
from my_table a,table(flatten(input=>array_construct(q1,q2,q3...))) b)
select a.part_num, a.year_month, b.quantity
from year_month a
join quantities b on a.part_num=b.part_num and a.month_num=b.month_num

Calculating time difference between 2 rows

I have a table which has information on races that have taken place, it holds participants who took part, where they finished in the race and what time they finished. I would like to add a time difference column which shows how far behind each participant was behind the winner.
Race ID Finish place Time Name
1 1 00:00:10 Matt
1 2 00:00:11 Mick
1 3 00:00:17 Shaun
2 1 00:00:13 Claire
2 2 00:00:15 Helen
What I would like to See
Race ID Finish place Time Time Dif Name
1 1 00:00:10 Matt
1 2 00:00:11 00:00:01 Mick
1 3 00:00:17 00:00:07 Shaun
2 1 00:00:13 Claire
2 2 00:00:15 00:00:02 Helen
I have seen similar questions asked but I was unable to relate it to my problem.
My initial idea was to have a number of derived tables which filtered out by finish place but there could be more than 10 racers so things would start to get messy. I'm using Management Studio 2012
You can use min() as a window function:
select t.*,
(case when time <> min_time then time - min_time
end) as diff
from (select t.*, min(t.time) over (partition by t.race_id) as min_time
from t
) t
I would be more inclined to express this as seconds:
(case when time <> min_time then datediff(second, min_time, time)
end) as diff
Using http://www.convertcsv.com/csv-to-sql.htm to build example data:
DROP TABLE IF EXISTS mytable
CREATE TABLE mytable(
Race_ID INTEGER
,Finish_place INTEGER
,Time VARCHAR(30)
,Name VARCHAR(30)
);
INSERT INTO mytable(Race_ID,Finish_place,Time,Name) VALUES (1, 1,'00:00:10','Matt');
INSERT INTO mytable(Race_ID,Finish_place,Time,Name) VALUES (1, 2,'00:00:11','Mick');
INSERT INTO mytable(Race_ID,Finish_place,Time,Name) VALUES (1, 3,'00:00:17','Shaun');
INSERT INTO mytable(Race_ID,Finish_place,Time,Name) VALUES (2, 1,'00:00:13','Claire');
INSERT INTO mytable(Race_ID,Finish_place,Time,Name) VALUES (2, 2,'00:00:15','Helen');
A CTE with only first finshed places would be easier to understand.
WITH CTE_FIRST
AS (
SELECT
M.Race_ID
,M.Finish_place
,M.Time
,M.Name
FROM mytable M
WHERE M.Finish_place = 1
)
SELECT
M.Race_ID
,M.Finish_place
,M.Time
,CASE
WHEN m.Finish_place = 1
THEN NULL
ELSE CONVERT(VARCHAR, DATEADD(ss, DATEDIFF(SECOND, c.Time, M.Time), 0), 108)
END AS [Time Dif]
,M.Name
FROM mytable M
INNER JOIN CTE_FIRST c
ON M.Race_ID = c.Race_ID
You can use window functions. MIN([time]) OVER (PARTITION BY race_id ORDER BY finish_place) gives first row's time value in the same race. DATEDIFF(SECOND, (MIN([time]) OVER (PARTITION BY race_id ORDER BY finish_place)), time) gives the difference.

postgresql - how to calculate the percentage of number of entries in one table w.r.t to no of entries in another table

I have two tables(A & B) in my database(c). both tables contains two coulmns as shown below
column name Datatype
eventtime timestamp without time zone
serialnumber numeric
Table A contains total number of products (good and downgraded -- but in these table they are not defined as affected/downgraded) produced in each day. And table B contains total number of only downgraded products.
I want to make an quality process control chart using the percentage of downgraded products w.r.t total number of products produced (using serialnumber to join for example).
could some one tell me how can i get the percentage value for each day (also for each hour)
Use date_trunc() to group rows by desired period, e.g.:
select
a.r_date::date date,
downgraded,
total,
round(downgraded::numeric/total* 100, 2) percentage
from (
select date_trunc('day', eventtime) r_date, count(*) downgraded
from b
group by 1
) b
join (
select date_trunc('day', eventtime) r_date, count(*) total
from a
group by 1
) a
using (r_date)
order by 1;
date | downgraded | total | percentage
------------+------------+-------+------------
2015-05-05 | 3 | 4 | 75.00
2015-05-06 | 1 | 4 | 25.00
(2 rows)

How to add a running count to rows in a 'streak' of consecutive days

Thanks to Mike for the suggestion to add the create/insert statements.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');
I want to add a new column that is 'days in current streak'
so the result would look like:
pid | date | in_streak
-------|-----------|----------
1 | 2014-10-1 | 1
1 | 2014-10-2 | 2
1 | 2014-10-3 | 3
1 | 2014-10-5 | 1
1 | 2014-10-7 | 1
2 | 2014-10-2 | 1
2 | 2014-10-3 | 2
2 | 2014-10-4 | 3
2 | 2014-10-6 | 1
I've been trying to use the answers from
PostgreSQL: find number of consecutive days up until now
Return rows of the latest 'streak' of data
but I can't work out how to use the dense_rank() trick with other window functions to get the right result.
Building on this table (not using the SQL keyword "date" as column name.):
CREATE TABLE tbl(
pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);
Query:
SELECT pid, the_date
, row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM (
SELECT *
, the_date - '2000-01-01'::date
- row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
FROM tbl
) sub
ORDER BY pid, the_date;
Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.
grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:
the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp
One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.
Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.
Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.
And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.
You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');
The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.
with grouped_dates as (
select pid, date,
(date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid date grouping_date in_streak
--
1 2014-10-01 2014-09-30 1
1 2014-10-02 2014-09-30 2
1 2014-10-03 2014-09-30 3
1 2014-10-05 2014-10-01 1
1 2014-10-07 2014-10-02 1
2 2014-10-01 2014-09-30 1
2 2014-10-02 2014-09-30 2
2 2014-10-03 2014-09-30 3
2 2014-10-05 2014-10-01 1
2 2014-10-07 2014-10-02 1

Oracle: Find previous record for a ranked list of forecasts

Hi I am faced with a difficult problem:
I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.)
whose makeup looks like this:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk) integer (pk) datetime (pk) integer
where:
stationid refers to one of the many weather stations that may create a forecast;
forecastdate refers to the date the forecast is for (date only not time.)
forecastinterval refers to the hour in the forecastdate for the forecast (0 - 23).
forecastcreated refers to the time the forecast was made, can be many days beforehand.
forecastvalue refers to the actual value of the forecast (as the name implies.)
I need to determine for a given stationid and a given forecastdate and forecastinterval pair, the records where a forecastvalue increments more than a nominal number (say 500). I'll show a table of the condition here:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
'stationa' 13-dec-09 10 10-dec-09 04:50:10 0
'stationa' 13-dec-09 10 10-dec-09 17:06:13 0
'stationa' 13-dec-09 10 12-dec-09 05:20:50 300
'stationa' 13-dec-09 10 13-dec-09 09:20:50 300
In the above scenario, I'd like to pull out the third record. This is the record where the forecast value increased by a nominal (say 100) amount.
The task is proving to be very difficult due to the sheer size of the table (many 100s of millions of records.) and taking so long to finish (so long in fact that my query has never returned.)
Here is my attempt so far to grab these values:
select
wtr.stationid,
wtr.forecastcreated,
wtr.forecastvalue,
(wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
(select inner.*
rank() over (partition by stationid,
(inner.forecastdate + inner.forecastinterval),
inner.forecastcreated
order by stationid,
(inner.forecastdate + inner.forecastinterval) asc,
inner.forecastcreated asc
) rk
from weathertable inner) wtr
where
wtr.forecastvalue - 100 > (
select lastvalue
from (select y.*,
rank() over (partition by stationid,
(forecastdate + forecastinterval),
forecastcreated
order by stationid,
(forecastdate + forecastinterval) asc,
forecastcreated asc) rk
from weathertable y
) z
where z.stationid = wtr.stationid
and z.forecastdate = wtr.forecastdate
and (z.forecastinterval =
wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
and wtr.rk = z.rk + 1)
Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...
SQL> select * from t
2 /
STATIONID FORECASTDATE INTERVAL FORECASTCREATED FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa 13-12-2009 10 10-12-2009 04:50:10 0
stationa 13-12-2009 10 10-12-2009 17:06:13 0
stationa 13-12-2009 10 12-12-2009 05:20:50 300
stationa 13-12-2009 10 13-12-2009 09:20:50 300
stationa 13-12-2009 11 13-12-2009 09:20:50 400
stationb 13-12-2009 11 13-12-2009 09:20:50 500
6 rows selected.
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (ORDER BY t.forecastcreated) as difference
12 FROM t) v
13 WHERE v.difference >= 100
14 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
stationa 13-12-2009 09:20:50 400 13-12-2009 11:00:00
stationb 13-12-2009 09:20:50 500 13-12-2009 11:00:00
SQL>
To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (PARTITION BY t.stationid
12 , t.forecastdate
13 , t.forecastinterval
14 ORDER BY t.forecastcreated) as difference
15 FROM t) v
16 WHERE v.difference >= 100
17 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
SQL>
Working with large volumes of data
You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.
But here's the five minute answer anyway.
Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.
On that assumption, the one thing I would suggest you try first is building an index on (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE). This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.