SQL query to track the presence of distinct values in time - sql

Using PostgreSQL, I would like to track the presence of each distinct id's from one day to the next one in the following table. How many were added / removed since the previous date ? How many were there at both dates ?
Date
Id
2021-06-28
1
2021-06-28
2
2021-06-28
3
2021-06-29
3
2021-06-29
4
2021-06-29
5
2021-06-30
4
2021-06-30
5
2021-06-30
6
I am thus looking for a SQL query that returns this kind of results:
Date
Added
Idle
Removed
2021-06-28
3
0
0
2021-06-29
2
1
2
2021-06-30
1
2
1
Do you have any idea ?

First, for each day and id figure out the date when this id last occured.
The number of retained ids is the count of rows per day where the id occurred the previous day.
New are all values that were not retained, and removed were the rows from the previous day that were not retained.
In SQL:
SELECT d,
total - retained AS added,
retained AS idle,
lag(total, 1, 0::bigint) OVER (ORDER BY d) - retained AS removed
FROM (SELECT d,
count(prev_d) FILTER (WHERE d - 1 = prev_d) AS retained,
count(*) AS total
FROM (SELECT d,
lag(d, 1, '-infinity') OVER (PARTITION BY id ORDER BY d) AS prev_d
FROM mytable) AS q
GROUP BY d) AS p;
d │ added │ idle │ removed
════════════╪═══════╪══════╪═════════
2021-06-28 │ 3 │ 0 │ 0
2021-06-29 │ 2 │ 1 │ 2
2021-06-30 │ 1 │ 2 │ 1
(3 rows)

Related

Creating 2 additional columns based on past dates - PostgresSQL

Seeking some help after spending alot of time on searching but to no avail and decided to post this here as I'm rather new to SQL, so any help is greatly appreciated. I've tried a few functions but can't seem to get it right. e.g. GROUP BY, BETWEEN etc
On the PrestoSQL server, I have a table as shown below starting with columns Date, ID and COVID. Using GROUP BY ID, I would like to create a column EverCOVIDBefore which looks back at all past dates of the COVID column to see if there was ever COVID = 1 or not, as well as another column called COVID_last_2_mth which checks if there was ever COVID = 1 within the past 2 months
(Highlighted columns are my expected outcomes)
Link to dataset: https://drive.google.com/file/d/1Sc5Olrx9g2A36WnLcCFMU0YTQ3-qWROU/view?usp=sharing
You can do:
select *,
max(covid) over(partition by id order by date) as ever_covid_before,
max(covid) over(partition by id order by date
range between interval '2 month' preceding and current row)
as covid_last_two_months
from t
Result:
date id covid ever_covid_before covid_last_two_months
----------- --- ------ ------------------ ---------------------
2020-01-15 1 0 0 0
2020-02-15 1 0 0 0
2020-03-15 1 1 1 1
2020-04-15 1 0 1 1
2020-05-15 1 0 1 1
2020-06-15 1 0 1 0
2020-01-15 2 0 0 0
2020-02-15 2 1 1 1
2020-03-15 2 0 1 1
2020-04-15 2 0 1 1
2020-05-15 2 0 1 0
2020-06-15 2 1 1 1
See running example at db<>fiddle.

Counting number of transactions in the following week SQL

I am trying to get a count of the number of transactions that occur in the 7 days following each transaction in a dataset (including the transaction itself). For example, in the following set of transactions I want the count to be as follows.
┌─────────────┬──────────────┐
│ Txn_Date │ Count │
├─────────────┼──────────────┤
│ 2020-01-01 │ 3 │
│ 2020-01-04 │ 3 │
│ 2020-01-05 │ 2 │
│ 2020-01-10 │ 1 │
│ 2020-01-18 │ 3 │
│ 2020-01-20 │ 2 │
│ 2020-01-24 │ 2 │
│ 2020-01-28 │ 1 │
└─────────────┴──────────────┘
For each one it needs to be a rolling week, so row 1 counts all the transactions between 2020-01-01 and 2020-01-08, the second all transactions between 2020-01-04 and 2020-01-11 etc.
The code I have is:
select Txn_Date,
count(Txn_Date) over (partition by Txn_Date(column) where Txn_Date(rows) between Txn_Date and date_add('day', 14, Txn_Date) as Count
This code will not work in it's current state, but hopefully gives an idea of what I am trying to achieve. The database I am working in is Hive.
A good way to provide demo data is to put it into a table variable.
DECLARE #table TABLE (Txn_Date DATE, Count INT)
INSERT INTO #table (Txn_Date, Count) VALUES
('2020-01-01', 3),
('2020-01-04', 3),
('2020-01-05', 2),
('2020-01-10', 1),
('2020-01-18', 3),
('2020-01-20', 2),
('2020-01-24', 2),
('2020-01-28', 1)
If you're using TSQL you can do this using the windowed function LAG after grouping the data by week.
SELECT DATEPART(WEEK,Txn_Date) AS Week, SUM(COUNT) AS Count, LAG(SUM(COUNT),1) OVER (ORDER BY DATEPART(WEEK,Txn_Date)) AS LastWeekCount
FROM #table
GROUP BY DATEPART(WEEK,Txn_Date)
Week Count LastWeekCount
-----------------------------
1 6 NULL
2 3 6
3 3 3
4 4 3
5 1 4
Lag literally lets you go back n rows for a column in a specific order. For this we wanted to go back 1 row in week number order. To move in the opposite direction you can use LEAD the same way.
We're also using the TSQL function DATEPART to get the ISO week number for the date, and grouping by that.

Frequency of Words in Column of Strings

I'm looking for a query that will allow me to
Imagine this is the table, "News_Articles", with two columns: "ID" & "Headline". How can I make a result from this showing
NEWS_ARTICLES
ID
Headline
0001
Today's News: Local Election Today!
0002
COVID-19 Rates Drop Today
0003
Today's the day to shop local
One word per row (from the headline column)
A count of how many unique IDs it appears in
A count of how many total times the word appears in the whole dataset
DESIRED RESULT
Word
Unique_Count
Total_Count
Today
3
4
Local
2
2
Election
1
1
Ideally, we'd like to remove any conjunctions from the words as well (see "Today's" above is counted as "Today").
I'd also like to be able to remove filler words such as "the" or "a". Ideally this would be through some existing library but if not, I can always manually remove the ones I see with a where clause.
I would also change all characters to lowercase if needed.
Thank you!
You can use full text search and unnest to extract the lexemes, then aggregate:
SELECT parts.lexeme AS word,
count(*) AS unique_count,
sum(cardinality(parts.positions)) AS total_count
FROM news_articles
CROSS JOIN LATERAL unnest(to_tsvector('english', news_articles.headline)) AS parts
GROUP BY parts.lexeme;
word │ unique_count │ total_count
═══════╪══════════════╪═════════════
-19 │ 1 │ 1
covid │ 1 │ 1
day │ 1 │ 1
drop │ 1 │ 1
elect │ 1 │ 1
local │ 2 │ 2
news │ 1 │ 1
rate │ 1 │ 1
shop │ 1 │ 1
today │ 3 │ 4
(10 rows)

Renumber rows based on createdDate

I'm in need of a complicated SQL query. Essentially, there's a table called layer and it has a number of columns, the important ones being:
created_date, layer_id, layer_number, section_fk_id
The problem we have is there are some rows where layerId got duplicated per sectionFkId like this:
01/01/2021 4564 L01 1
03/01/2021 5689 L02 1
04/01/2021 6333 L02 1 <<problem row L02 duped
05/01/2021 8495 L03 1
03/01/2021 5603 L01 2
07/01/2021 6210 L02 2
10/01/2021 7345 L03 2
This would need to be fixed by incrementing layer id for those where duplicated and every row following so it ends up like this:
01/01/2021 4564 L01 1
03/01/2021 5689 L02 1
04/01/2021 6333 L03 1 << incremented layer id
05/01/2021 8495 L04 1 << incremented layer id
03/01/2021 5603 L01 2
07/01/2021 6210 L02 2
10/01/2021 7345 L03 2
Again, this is per sectionFkId.
Despite the terribly named column may suggest, layer_number is prefixed with an L and is therefore a varchar.
I've done the best I can in pseudocode, hoping someone can finish it:
startLayerId = L01
for each row in app.layer per section_fk_id order by created_date
Update app.layer set layer_id = (startLayerId)
startLayerId++
You can use row_number() in an update. Your description and pseudo-code suggest that layer_number is a number:
update layers l
set layer_number = ll.seqnum
from (select l.*,
row_number() over (partition by sectionFkId order by created_date) as seqnum
from layers l
) ll
where ll.sectionFkId = l.sectionFkId and
ll.created_date = l.created_date;
If it is a string, you can use:
set l.layer_number = 'L' || lpad(ll.seqnum::text, 2, '0')

Possible to play back table of inserts/deletes to get the current state?

I have a table which stores equipment as a history of changes, e.g.:
Sample table PartHistory (also includes a numerical primary key):
ID Hst Data Set Date (desired result)
1 I partA 1 2014-07-01
2 I partB 1 2014-07-01
3 I partC 1 2014-07-01 Parts A, B, C
4 D partC 2 2014-07-03 Parts A, B
5 I partZ 3 2014-07-06 Parts A, B, Z
6 D partA 4 2014-07-20
7 D partZ 4 2014-07-20
8 I partC 4 2014-07-20
9 I partQ 4 2014-07-20 Parts B, C, Q
Each set involves one or more I - Inserts (new equipment) D - Deletes (equipment removal). Valid data (no removal of non existent equipment, etc.) can be assumed.
Is it possible to use standard SQL (not PL/SQL) to query this data such that I can get the state of equipment on a specific date or set?
For example:
SELECT data, set, date FROM (...ninja SQL...) WHERE ChangeDate = DATE '2014-07-03'
partA 2 2014-07-03
partB 2 2014-07-03
SELECT data, set, date FROM (...ninja SQL...) WHERE set = 4
partB 4 2014-07-20
partC 4 2014-07-20
partQ 4 2014-07-20
If you want the parts and the most recent counts "as-of" a particular date, you can do this in SQL with this structure. Basically, add up the number of "I" values and subtract the number of "D" values to get the count.
select data, sum(case when hst = 'I' then 1 when hst = 'D' then -1 else 0 end) as num
from parthistory ph
where ChangeDate <= DATE '2014-07-03'
group by data
having sum(case when hst = 'I' then 1 when hst = 'D' then -1 else 0 end) > 0;