calculating the difference between two rows or alternative - sql

I would like to take this query (see below) and add a where win = in the select statement. here I would like to add a column to show the number of races it took to fulfill the where e.g below where win = 2
I've tried calculating the number between rows but it was wildly wrong on my part
select
date, time, raceid, win
from master
where date = #date
order by time
DATE TIME RACEID WIN
2019-01-06 00:40:00 4445 2
2019-01-06 00:50:00 4432 0
2019-01-06 01:00:00 4441 2
2019-01-06 01:10:00 4446 2
2019-01-06 01:20:00 4433 1
2019-01-06 01:30:00 4439 1
2019-01-06 01:40:00 4447 2
2019-01-06 01:50:00 4434 2
2019-01-06 02:00:00 4442 0
2019-01-06 02:10:00 4448 0
2019-01-06 02:20:00 4435 2
2019-01-06 02:30:00 4443 2
2019-01-06 02:40:00 4449 2
2019-01-06 02:50:00 4436 0
2019-01-06 02:50:00 4444 2
I would like to take this query and add a where win = in the select statement. here I would like to add a column to show the number of races it took to fulfill the where e.g below where win = 2
DATE TIME RACEID WIN RacestoWin
2019-01-06 00:40:00 4445 2 1
2019-01-06 01:00:00 4441 2 2
2019-01-06 01:10:00 4446 2 1
2019-01-06 01:40:00 4447 2 3
2019-01-06 01:50:00 4434 2 1
2019-01-06 02:20:00 4435 2 3
2019-01-06 02:30:00 4443 2 1
2019-01-06 02:40:00 4449 2 1
2019-01-06 02:50:00 4444 2 2
Is there a simple way of doing this? Not the best so any guidance would be greatly appreciated!!

I see. You are counting the rows between the wins. Basically, you want to assign a group. This group is the cumulative number of 2s on or after that record. Then, within each group, you can use row_number() or even aggregation in this case (because you know the last row of the group is "2"):
select date, max(time), 2 as win, count(*) as racestowin
from (select m.*,
sum(case when m.win = 2 then 1 else 0 end) over (partition by m.date order by m.time desc) as grouping
from master m
) m
group by date, grouping;

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

SQL - Count total IDs each day between dates

Here is what my data looks like
ID StartDate EndDate
1 1/1/2019 1/15/2019
2 1/10/2019 1/11/2019
3 2/5/2020 3/10/2020
4 3/10/2019 3/19/2019
5 5/1/2020 5/4/2020
I am trying to get a list of every date in my data set,and how many IDs fall in that time range, aggregated to the date level. So for ID-1, it would be in the records for 1/1/2019, 1/2/2019...through 1/15/2019.
I am not sure how to do this. All help is appreciated.
If you don't have a calendar table (highly recommended), you can perform this task with an ad-hoc tally table in concert with a CROSS APPLY
Example
Declare #YourTable Table ([ID] varchar(50),[StartDate] date,[EndDate] date)
Insert Into #YourTable Values
(1,'1/1/2019','1/15/2019')
,(2,'1/10/2019','1/11/2019')
,(3,'2/5/2020','3/10/2020')
,(4,'3/10/2019','3/19/2019')
,(5,'5/1/2020','5/4/2020')
Select A.ID
,B.Date
From #YourTable A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) Date=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
Returns
ID Date
1 2019-01-01
1 2019-01-02
1 2019-01-03
1 2019-01-04
1 2019-01-05
1 2019-01-06
1 2019-01-07
1 2019-01-08
1 2019-01-09
1 2019-01-15
2 2019-01-10
2 2019-01-11
....
5 2020-05-01
5 2020-05-02
5 2020-05-03
5 2020-05-04

Query required for inventory

I have a table in which I have some inventory of Rooms available.
HotelID RoomID InventoryDate Qty
600 12 2019-01-01 10
600 12 2019-01-02 10
600 12 2019-01-03 10
600 12 2019-01-04 10
600 12 2019-01-05 15
600 12 2019-01-06 15
600 12 2019-01-07 10
600 12 2019-01-08 20
600 12 2019-01-09 20
I required below result set
HotelID RoomID StartDate EndDate Qty
600 12 2019-01-01 2019-01-04 10
600 12 2019-01-05 2019-01-06 15
600 12 2019-01-07 2019-01-07 10
600 12 2019-01-08 2019-01-09 20
I am not sure from where to start. Please guide. Thanks.
You can try below -
select HotelID,RoomID,min(InventoryDate),max(InventoryDate),Qty
from tablename
group by HotelID,RoomID,Qty
You can use aggregate function to achieve this, in Your context MIN() and MAX() will cater to the requirement.
SELECT HotelID,RoomID,MIN(InventoryDate) as StartDate,MAX(InventoryDate) as EndDate,MAX(Qty)as Qty
FROM Tablename
GROUP BY HotelID,RoomID
You can use the below query to get the desired output:
SELECT hotelid,
roomid,
Min(inventorydate) AS StartDate,
Max(inventorydate) AS EndDate,
qty
FROM inventory_table
GROUP BY hotelid,
roomid,
qty

Pandas time difference calculation error

I have two time columns in my dataframe: called date1 and date2.
As far as I always assumed, both are in date_time format. However, I now have to calculate the difference in days between the two and it doesn't work.
I run the following code to analyse the data:
df['month1'] = pd.DatetimeIndex(df['date1']).month
df['month2'] = pd.DatetimeIndex(df['date2']).month
print(df[["date1", "date2", "month1", "month2"]].head(10))
print(df["date1"].dtype)
print(df["date2"].dtype)
The output is:
date1 date2 month1 month2
0 2016-02-29 2017-01-01 1 1
1 2016-11-08 2017-01-01 1 1
2 2017-11-27 2009-06-01 1 6
3 2015-03-09 2014-07-01 1 7
4 2015-06-02 2014-07-01 1 7
5 2015-09-18 2017-01-01 1 1
6 2017-09-06 2017-07-01 1 7
7 2017-04-15 2009-06-01 1 6
8 2017-08-14 2014-07-01 1 7
9 2017-12-06 2014-07-01 1 7
datetime64[ns]
object
As you can see, the month for date1 is not calculated correctly!
The final operation, which does not work is:
df["date_diff"] = (df["date1"]-df["date2"]).astype('timedelta64[D]')
which leads to the following error:
incompatible type [object] for a datetime/timedelta operation
I first thought it might be due to date2, so I tried:
df["date2_new"] = pd.to_datetime(df['date2'] - 315619200, unit = 's')
leading to:
unsupported operand type(s) for -: 'str' and 'int'
Anyone has an idea what I need to change?
Use .dt accessor with days attribute:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df['date_diff'] = (df['date1'] - df['date2']).dt.days
Output:
date1 date2 month1 month2 date_diff
0 2016-02-29 2017-01-01 1 1 -307
1 2016-11-08 2017-01-01 1 1 -54
2 2017-11-27 2009-06-01 1 6 3101
3 2015-03-09 2014-07-01 1 7 251
4 2015-06-02 2014-07-01 1 7 336
5 2015-09-18 2017-01-01 1 1 -471
6 2017-09-06 2017-07-01 1 7 67
7 2017-04-15 2009-06-01 1 6 2875
8 2017-08-14 2014-07-01 1 7 1140
9 2017-12-06 2014-07-01 1 7 1254

PostgreSQL - rank over rows listed in blocks of 0 and 1

I have a table that looks like:
id code date1 date2 block
--------------------------------------------------
20 1234 2017-07-01 2017-07-31 1
15 1234 2017-06-01 2017-06-30 1
13 1234 2017-05-01 2017-05-31 0
11 1234 2017-03-01 2017-03-31 0
9 1234 2017-02-01 2017-02-28 1
8 1234 2017-01-01 2017-01-31 0
7 1234 2016-11-01 2016-11-31 0
6 1234 2016-10-01 2016-10-31 1
2 1234 2016-09-01 2016-09-31 1
I need to rank the rows according to the blocks of 0's and 1's, like:
id code date1 date2 block desired_rank
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 1
13 1234 2017-05-01 2017-05-31 0 2
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 4
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 5
2 1234 2016-09-01 2016-09-31 1 5
I've tried to use rank() and dense_rank(), but the result I end up with is:
id code date1 date2 block dense_rank()
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 2
13 1234 2017-05-01 2017-05-31 0 1
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 3
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 4
2 1234 2016-09-01 2016-09-31 1 5
In the last table, the rank doesn't care about the rows, it just takes all the 1's and 0's as a unit and sets an ascending count starting at the first 1 and 0.
My query goes like this:
CREATE TEMP TABLE data (id integer,code text, date1 date, date2 date, block integer);
INSERT INTO data VALUES
(20,'1234', '2017-07-01','2017-07-31',1),
(15,'1234', '2017-06-01','2017-06-30',1),
(13,'1234', '2017-05-01','2017-05-31',0),
(11,'1234', '2017-03-01','2017-03-31',0),
(9, '1234', '2017-02-01','2017-02-28',1),
(8, '1234', '2017-01-01','2017-01-31',0),
(7, '1234', '2016-11-01','2016-11-30',0),
(6, '1234', '2016-10-01','2016-10-31',1),
(2, '1234', '2016-09-01','2016-09-30',1);
SELECT *,dense_rank() OVER (PARTITION BY code,block ORDER BY date2 DESC)
FROM data
ORDER BY date2 DESC;
By the way, the database is in postgreSQL.
I hope there's a workaround... Thanks :)
Edit: Note that the blocks of 0's and 1's aren't equal.
There's no way to get this result using a single Window Function:
SELECT *,
Sum(flag) -- now sum the 0/1 to create the "rank"
Over (PARTITION BY code
ORDER BY date2 DESC)
FROM
(
SELECT *,
CASE
WHEN Lag(block) -- check if this is the 1st row of a new block
Over (PARTITION BY code
ORDER BY date2 DESC) = block
THEN 0
ELSE 1
END AS flag
FROM DATA
) AS dt