Renumber rows based on createdDate - sql

I'm in need of a complicated SQL query. Essentially, there's a table called layer and it has a number of columns, the important ones being:
created_date, layer_id, layer_number, section_fk_id
The problem we have is there are some rows where layerId got duplicated per sectionFkId like this:
01/01/2021 4564 L01 1
03/01/2021 5689 L02 1
04/01/2021 6333 L02 1 <<problem row L02 duped
05/01/2021 8495 L03 1
03/01/2021 5603 L01 2
07/01/2021 6210 L02 2
10/01/2021 7345 L03 2
This would need to be fixed by incrementing layer id for those where duplicated and every row following so it ends up like this:
01/01/2021 4564 L01 1
03/01/2021 5689 L02 1
04/01/2021 6333 L03 1 << incremented layer id
05/01/2021 8495 L04 1 << incremented layer id
03/01/2021 5603 L01 2
07/01/2021 6210 L02 2
10/01/2021 7345 L03 2
Again, this is per sectionFkId.
Despite the terribly named column may suggest, layer_number is prefixed with an L and is therefore a varchar.
I've done the best I can in pseudocode, hoping someone can finish it:
startLayerId = L01
for each row in app.layer per section_fk_id order by created_date
Update app.layer set layer_id = (startLayerId)
startLayerId++

You can use row_number() in an update. Your description and pseudo-code suggest that layer_number is a number:
update layers l
set layer_number = ll.seqnum
from (select l.*,
row_number() over (partition by sectionFkId order by created_date) as seqnum
from layers l
) ll
where ll.sectionFkId = l.sectionFkId and
ll.created_date = l.created_date;
If it is a string, you can use:
set l.layer_number = 'L' || lpad(ll.seqnum::text, 2, '0')

Related

Generating columns for daily stats in SQL

I have a table that currently looks like this (simplified to illustate my issue):
Thing| Date
1 2022-12-12
2 2022-11-05
3 2022-11-18
4 2022-12-01
1 2022-11-02
2 2022-11-21
5 2022-12-03
5 2022-12-08
2 2022-11-18
1 2022-11-20
I would like to generate the following:
Thing| 2022-11 | 2022-12
1 2 1
2 3 0
3 1 0
4 0 1
5 0 2
I'm new to SQL and can't quite figure this out - would I use some sort of FOR loop equivalent in my SELECT clause? I'm happy to figure out the exact syntax myself, I just need someone to point me in the right direction.
Thank you!
You may use conditional aggregation as the following:
Select Thing,
Count(Case When Date Between '2022-11-01' And '2022-11-30' Then 1 End) As '2022-11',
Count(Case When Date Between '2022-12-01' And '2022-12-31' Then 1 End) As '2022-12'
From table_name
Group By Thing
Order By Thing
See a demo.
The count function counts only the not null values, so for each row not matching the condition inside the count function a null value is returned, hence not counted.

How to process Row number window function on incremental data

I have a table which as row number window function running for some IDs.
Now every time a new data comes its a full load and the new row numbers are assigned to them again. So Row Num runs on the entire data set again , which is quite ineffeciet as lot of resources get consumed and it makes it CPU intensive. This table is built every 15 to 30 mins. I am trying to achieve the same thing but using incremental and then add the result of the incremental to the last row_count of a particular customer_ID
So when new record comes , I want to save the max row_num for that particular record lets say max_row_num = 4 , now two new record comes for a ID , so row_num for incremental is 1,2. Final output should be 4+1 and 4+2 something. so the new row number looks like 1,2,3,4,5,6 adding 1 and 2 to the max of the previous Row_num.
I want to implement the logic in my Pyspark actually! But I am open to python solution and then may be convert to pyspark DataFrame later.
Please help and suggest the possible solutions
Full load -- intial table
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
1
ABC125
2
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
incremental load
Row_num
customer_ID
1
ABC123
2
ABC123
1
ABC125
1
ABC225
2
ABC225
1
ABC330
DESIRED OUPUT
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
4
ABC123
1
ABC125
2
ABC125
3
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
6
ABC225
1
ABC330
If you are trying to insert the values with the new row number, you can join in the maximum existing row number:
insert into full (row_num, customer_id)
select i.row_number + coalesce(f.max_row_number, 0), i.customer_id
from incremental i left join
(select f.customer_id, max(row_number) as max_row_number
from full f
group by f.customer_id
) f
on i.customer_id = f.customer_id;

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

Calculate time different from previous record

I have a set of data that I want to determine the difference in days between the Begin_time and End_Time for every 2 records to determine the processing time. I'm familiar with DateDiff('d','End_Time','Begin_Time',) to determine the processing time on the same row but how do I determine this for the previous record? For example, something like this DateDiff('Record2.Begin_time','Record1.End_Time') then DateDiff('Record4.Begin_time','Record3.End_Time') then DateDiff('Record6.Begin_time','Record5.End_Time') etc. It doesn't have to use DateDiff function, I'm just using that to illustrate my question. thanks
> Record Begin_Time End_Time Processing_Time
1 11/23/2020 11/24/2020 1
2 11/23/2020 11/24/2020 1
3 11/30/2020 11/30/2020 0
4 11/30/2020 11/30/2020 0
5 11/2/2020 11/3/2020 1
6 11/2/2020 11/3/2020 1
7 11/3/2020 11/5/2020 2
8 11/3/2020 11/5/2020 2
An Aproach could be like this:
Select DateDiff(YourTableEven.Begin_time, YourTableOdd.End_Time)
From YourTable AS YourTableEven
Join YourTable AS YourTableOdd ON YourTableOdd.Record = YourTableEven.Record + 1
Where YourTableEven.Record % 2 = 0

Drop Duplicates based on Nearest Datetime condition

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64