Detect and delete gaps in time series - sql

I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.
We can work with the following example data:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.
Combining other answers in this forum I made up the following code:
Add an autoincrement primary key to identify each row
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq');
UPDATE test SET id = NEXTVAL('id_seq');
ALTER TABLE test ADD PRIMARY KEY (id);
Detect the gaps in the time series
CREATE TABLE to_del AS WITH count3 AS
( SELECT *,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
AS cnt FROM test)
SELECT company, id FROM count3 WHERE cnt >= 3;
Delete the gaps from mytable
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.
Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?

I would create only one query:
DELETE FROM mytable
WHERE company in (
SELECT Company
FROM (
SELECT Company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company)
/
COUNT(*)
OVER (PARTITION BY company) As p50
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
A composite index on (company + value) columns can help to gain a maximum speed of this query.
EDIT
The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric
SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
-- WHERE cnt >= 3 OR p50 > 0.5
and now the delete query should work:
DELETE FROM mytable
WHERE company in (
SELECT company
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
WHERE cnt >= 3 OR p50 > 0.5
)

For the 50% criteria, you could select all the companies for which the number of distinct dates in lower than half the number of days between the min and max dates.
I have not tested this but it should give you an idea. I used a CTE to make it easier to read.
WITH MinMax AS
(
SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
GROUP By Company
)
SELECT Company FROM MinMax
WHERE (calendars_days / 2) > days

Related

Why adding column in group by increases result of sum()?

For example, I have a table
Date
ID
Column2
Result of count()
01-01-2022
1
Yes
3
01-02-2022
1
No
2
01-03-2022
2
Yes
5
And when I want to check totals by date I get the same result when I count directly from table1.
select date, sum(Result of count()) from (select date, count(distinct ID) from table1 group by date) group by date
Result: 10
But when I add another column in subquery to count by, like this
select date, sum(Result of count()) from (select date, column2, count(distinct ID) from table1 group by date, column2) group by date
I get result with more values: 13
How an addition of another column affects the rows counted?

PostgreSQL Pivot by Last Date

I need to make a PIVOT table from Source like this table
FactID UserID Date Product QTY
1 11 01/01/2020 A 600
2 11 02/01/2020 A 400
3 11 03/01/2020 B 500
4 11 04/01/2020 B 200
6 22 06/01/2020 A 1000
7 22 07/01/2020 A 200
8 22 08/01/2020 B 300
9 22 09/01/2020 B 100
Need Pivot Like this where Product QTY is QTY by Last Date
UserID A B
11 400 200
22 200 100
My try PostgreSQL
Select
UserID,
MAX(CASE WHEN Product='A' THEN 'QTY' END) AS 'A',
MAX(CASE WHEN Product='B' THEN 'QTY' END) AS 'B'
FROM table
GROUP BY UserID
And Result
UserID A B
11 600 500
22 1000 300
I mean I get a result by the maximum QTY and not by the maximum date!
What do I need to add to get results by the maximum (last) date ??
Postgres doesn't have "first" and "last" aggregation functions. One method for doing this (without a subquery) uses arrays:
select userid,
(array_agg(qty order by date desc) filter (where product = 'A'))[1] as a,
(array_agg(qty order by date desc) filter (where product = 'B'))[1] as b
from tab
group by userid;
Another method uses select distinct with first_value():
select distinct userid,
first_value(qty) over (partition by userid order by product = 'A' desc, date desc) as a,
first_value(qty) over (partition by userid order by product = 'B' desc, date desc) as b
from tab;
With the appropriate indexes, though, distinct on might be the fastest approach:
select userid,
max(qty) filter (where product = 'A') as a,
max(qty) filter (where product = 'B') as b
from (select distinct on (userid, product) t.*
from tab t
order by userid, product, date desc
) t
group by userid;
In particular, this can use an index on userid, product, date desc). The improvement in performance will be most notable if there are many dates for a given user.
You can use DENSE_RANK() window function in order to filter by the last date per each product and UserID before applying conditional aggregation such as
SELECT UserID,
MAX(CASE WHEN Product='A' THEN QTY END) AS "A",
MAX(CASE WHEN Product='B' THEN QTY END) AS "B"
FROM
(
SELECT t.*, DENSE_RANK() OVER (PARTITION BY Product,UserID ORDER BY Date DESC) AS rn
FROM tab t
) q
WHERE rn = 1
GROUP BY UserID
Demo
presuming all date values are distinct(no ties occur for dates)

SQL: count last equal values

I need to solve this problem in pure SQL:
I have to count all the records with a specific value:
In my table there is a column flag with values 0 or 1. I need to count all the 1 after last 0 and sum the amount column values of those records.
Example:
Flag | Amount
0 | 5
1 | 8
0 | 10
1 | 20
1 | 30
Output:
2 | 50
If last value is 0 I don't need to do anything.
I hasten that I need to perform a fast query (possibly accessing just one time).
I assumed that your example table is logically ordered by Amount. Then you can do this:
select
count(*) as cnt
,sum(Amount) as Amount
from yourTable
where Amount > (select max(Amount) from yourTable where Flag = 0)
If the biggest value is from a row where Flag = 0 then nothing will be returned.
If your table may not contain any zeros, then you are safer with:
select count(*) as cnt, sum(Amount) as Amount
from t
where Amount > all (select Amount from t where Flag = 0)
Or, using window functions:
select count(*) as cnt, sum(amount) as amount
from (select t.*, max(case when flag = 0 then amount end) as flag0_amount
from t
) t
where flag0_amount is null or amount > flag0_amount
I find the solution by myself:
select decode(lv,0,0,tot-prog) somma ,decode(lv,0,0,cnt-myrow) count
from(
select * from
(
select pan,dt,flag,am,
last_value(flag) over() lv,
row_number() OVER (order by dt) AS myrow,
count(*) over() cnt,
case when lead(flag) OVER (ORDER BY dt) != flag then rownum end AS change,
sum(am) over() tot,
sum(am) over(order by dt) prog
from test
where pan=:pan and dt > :dt and flag is not null
order by dt
) t
where change is not null
order by change desc
) where rownum =1

Return last amount for each element with same ref_id

I have 2 tables, one is credit and other one is creditdetails.
Creditdetails creates new row every day for each of credit.
ID Amount ref_id date
1 2 1 16.03
2 3 1 17.03
3 4 1 18.03
4 1 2 16.03
5 2 2 17.03
6 0 2 18.03
I want to sum up amount of every row with the unique id and last date. So the output should be 4 + 0.
You can use ROW_NUMBER to filter on the latest amount per ref_id.
Then SUM it.
SELECT SUM(q.Amount) AS TotalLatestAmount
FROM
(
SELECT
cd.ref_id,
cd.Amount,
ROW_NUMBER() OVER (PARTITION BY cd.ref_id ORDER BY cd.date DESC) AS rn
FROM Creditdetails cd
) q
WHERE q.rn = 1;
A test on db<>fiddle here
With this query:
select ref_id, max(date) maxdate
from creditdetails
group by ref_id
you get all the last dates for each ref_id, so you can join it to the table creditdetails and sum over amount:
select sum(amount) total
from creditdetails c inner join (
select ref_id, max(date) maxdate
from creditdetails
group by ref_id
) g
on g.ref_id = c.ref_id and g.maxdate = c.date
I think you want something like this,
select sum(amount)
from table
where date = ( select max(date) from table);
with the understanding that your date column doesn't appear to be in a standard format so I can't tell if it needs to be formatted in the query to work properly.

how to select a value based on unique id column

Please help me ,
I have table with 3 column , when i select the column i need to dulicate the value based on the id
Id Days Values
1 5 7
1 NULL NULL
1 NULL NULL
2 7 25
2 NULL NULL
2 8 274
2 NULL NULL
I need a Result as
Id Days Values
1 5 7
1 5 7
1 5 7
2 7 25
2 7 25
2 8 274
2 8 274
`
Generate a set of data with the desired repeating values (B). Then join back to the base set (A) containing the # of record to repeat. This assumes that each ID will only have one record populated. If this is not the case, then you will not get desired results.
SELECT B.ID, B.MDays as Days, B.Mvalues as values
FROM TABLE A
INNER JOIN (SELECT ID, max(days) mDays, Max(values) Mvalues
FROM Table
GROUP BY ID) B
on A.ID = B.ID
And due to updates in question....--
This will get you close but without a way to define grouping within ID's I can't subdivide the records into 2 and 2
SELECT B.ID, B.Days as Days, B.values as values
FROM TABLE A
INNER JOIN (SELECT Distinct ID, days, values
FROM Table
GROUP BY ID) B
on A.ID = B.ID
and A.days is null
This isn't even close enough as we still Don't know how to order the rows...
It assumes order within the table which can't be trusted. We generate a row number for each row in the table using the Row_number Over syntax Grouping (partition by) the ID and days with the order of ID days (which doesn't work because of the null values)
We then join this data set back to a distinct set of values on ID and days
to get us close... but we still need some grouping logic. beyond that of ID that handles the null records and lack of order or grouping.
With CTE AS (
SELECT ID, Days, Values, Row_Number() Over (partition by ID, Days ORDER BY ID, Days) RN
FROM Table)
SELECT *
FROM (SELECT Distinct ID, Days, Values, max(RN) mRN FROM CTE GROUP BY ID, Days, Values) A
INNER JOIN CTE B
ON A.ID = B.ID
and A.Days = B.Ddays
and mRN <= B.RN
Order by B.RN