I have a table that currently looks like this (simplified to illustate my issue):
Thing| Date
1 2022-12-12
2 2022-11-05
3 2022-11-18
4 2022-12-01
1 2022-11-02
2 2022-11-21
5 2022-12-03
5 2022-12-08
2 2022-11-18
1 2022-11-20
I would like to generate the following:
Thing| 2022-11 | 2022-12
1 2 1
2 3 0
3 1 0
4 0 1
5 0 2
I'm new to SQL and can't quite figure this out - would I use some sort of FOR loop equivalent in my SELECT clause? I'm happy to figure out the exact syntax myself, I just need someone to point me in the right direction.
Thank you!
You may use conditional aggregation as the following:
Select Thing,
Count(Case When Date Between '2022-11-01' And '2022-11-30' Then 1 End) As '2022-11',
Count(Case When Date Between '2022-12-01' And '2022-12-31' Then 1 End) As '2022-12'
From table_name
Group By Thing
Order By Thing
See a demo.
The count function counts only the not null values, so for each row not matching the condition inside the count function a null value is returned, hence not counted.
I have a table which as row number window function running for some IDs.
Now every time a new data comes its a full load and the new row numbers are assigned to them again. So Row Num runs on the entire data set again , which is quite ineffeciet as lot of resources get consumed and it makes it CPU intensive. This table is built every 15 to 30 mins. I am trying to achieve the same thing but using incremental and then add the result of the incremental to the last row_count of a particular customer_ID
So when new record comes , I want to save the max row_num for that particular record lets say max_row_num = 4 , now two new record comes for a ID , so row_num for incremental is 1,2. Final output should be 4+1 and 4+2 something. so the new row number looks like 1,2,3,4,5,6 adding 1 and 2 to the max of the previous Row_num.
I want to implement the logic in my Pyspark actually! But I am open to python solution and then may be convert to pyspark DataFrame later.
Please help and suggest the possible solutions
Full load -- intial table
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
1
ABC125
2
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
incremental load
Row_num
customer_ID
1
ABC123
2
ABC123
1
ABC125
1
ABC225
2
ABC225
1
ABC330
DESIRED OUPUT
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
4
ABC123
1
ABC125
2
ABC125
3
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
6
ABC225
1
ABC330
If you are trying to insert the values with the new row number, you can join in the maximum existing row number:
insert into full (row_num, customer_id)
select i.row_number + coalesce(f.max_row_number, 0), i.customer_id
from incremental i left join
(select f.customer_id, max(row_number) as max_row_number
from full f
group by f.customer_id
) f
on i.customer_id = f.customer_id;
I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.
I have a set of data that I want to determine the difference in days between the Begin_time and End_Time for every 2 records to determine the processing time. I'm familiar with DateDiff('d','End_Time','Begin_Time',) to determine the processing time on the same row but how do I determine this for the previous record? For example, something like this DateDiff('Record2.Begin_time','Record1.End_Time') then DateDiff('Record4.Begin_time','Record3.End_Time') then DateDiff('Record6.Begin_time','Record5.End_Time') etc. It doesn't have to use DateDiff function, I'm just using that to illustrate my question. thanks
> Record Begin_Time End_Time Processing_Time
1 11/23/2020 11/24/2020 1
2 11/23/2020 11/24/2020 1
3 11/30/2020 11/30/2020 0
4 11/30/2020 11/30/2020 0
5 11/2/2020 11/3/2020 1
6 11/2/2020 11/3/2020 1
7 11/3/2020 11/5/2020 2
8 11/3/2020 11/5/2020 2
An Aproach could be like this:
Select DateDiff(YourTableEven.Begin_time, YourTableOdd.End_Time)
From YourTable AS YourTableEven
Join YourTable AS YourTableOdd ON YourTableOdd.Record = YourTableEven.Record + 1
Where YourTableEven.Record % 2 = 0
import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64