Last change column Impala - sql

I would like to create a new column indicating when was the last change done to the price column. If the price changes, we want to see the current date and if it is stable we would like to see date when there was the last change. Everything should be written without loops and declare because it should be working in Impala.
Input:
date price
2023-01-31 150
2023-01-30 150
2023-01-29 100
2023-01-28 100
2023-01-27 100
2023-01-26 50
Output:
date price valid_from
2023-01-31 150 2023-01-30
2023-01-30 150 2023-01-30
2023-01-29 100 2023-01-27
2023-01-28 100 2023-01-27
2023-01-27 100 2023-01-27
2023-01-26 50 2023-01-26
Thanks.

Related

Get the value in a dataframe based on a value and a date in another dataframe

I tried countless answers to similar problems here on SO but couldn't find anything that works for this scenario. It's driving me nuts.
I have these two Dataframes:
df_op:
index
Date
Close
Name
LogRet
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
1
2022-11-29 00:00:00
280.57
QQQ
-0.0076
2
2022-12-13 00:00:00
342.46
ADBE
0.0126
3
2022-12-13 00:00:00
256.92
MSFT
0.0173
df_quotes:
index
Date
Close
Name
72
2022-11-29 00:00:00
141.17
AAPL
196
2022-11-29 00:00:00
240.33
MSFT
73
2022-11-30 00:00:00
148.03
AAPL
197
2022-11-30 00:00:00
255.14
MSFT
11
2022-11-30 00:00:00
293.36
QQQ
136
2022-12-01 00:00:00
344.11
ADBE
198
2022-12-01 00:00:00
254.69
MSFT
12
2022-12-02 00:00:00
293.72
QQQ
I would like to add a column to df_op that indicates the close of the stock in df_quotes 2 days later. For example, the first row of df_op should become:
index
Date
Close
Name
LogRet
Next
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
254.69
In other words:
for each row in df_op, find the corresponding Name in df_quotes with Date of 2 days later and copy its Close to df_op in column 'Next'.
I tried tens of combinations like this without success:
df_quotes[df_quotes['Date'].isin(df_op['Date'] + pd.DateOffset(days=2)) & df_quotes['Name'].isin(df_op['Name'])]
How can I do this without recurring to loops?
Try this:
#first convert to datetime
df_op['Date'] = pd.to_datetime(df_op['Date'])
df_quotes['Date'] = pd.to_datetime(df_quotes['Date'])
#merge on Date and Name, but the date is offset 2 business days
(pd.merge(df_op,
df_quotes[['Date','Close','Name']].rename({'Close':'Next'},axis=1),
left_on=['Date','Name'],
right_on=[df_quotes['Date'] - pd.tseries.offsets.BDay(2),'Name'],
how = 'left')
.drop(['Date_x','Date_y'],axis=1))
Output:
Date index Close Name LogRet Next
0 2022-11-29 0 240.33 MSFT -0.0059 254.69
1 2022-11-29 1 280.57 QQQ -0.0076 NaN
2 2022-12-13 2 342.46 ADBE 0.0126 NaN
3 2022-12-13 3 256.92 MSFT 0.0173 NaN

Transpose a table with multiple ID rows and different assessment dates

I would like to transpose my table to see trends in a data. The data is formatted as such:
UserId is can occur multiple times because of different assessment periods. Let's say a user with ID 1 inccured some charges in January, February, and March. There are currently three rows that contain data from these periods respectively.
I would like to see everything as one row - independently of the number of periods (up to 12 months), for each user ID.
This would enable me to see and compare changes between assessment periods and attributes.
Current format:
UserId AssessmentDate Attribute1 Attribute2 Attribute3
1 2020-01-01 00:00:00.000 -01:00 20.13 123.11 405.00
1 2021-02-01 00:00:00.000 -01:00 1.03 78.93 11.34
1 2021-03-01 00:00:00.000 -01:00 15.03 310.10 23.15
2 2021-02-01 00:00:00.000 -01:00 14.31 41.30 63.20
2 2021-03-01 00:03:45.000 -01:00 0.05 3.50 1.30
Desired format:
UserId LastAssessmentDate Attribute1_M-2 Attribute2_M-1 ... Attribute3_M0
1 2021-03-01 00:00:00.000 -01:00 20.13 123.11 23.15
2 2021-03-01 00:03:45.000 -01:00 NULL 41.30 1.30
Either SQL or Pandas - both work for me. Thanks for the help!

Pandas resample only when makes sense

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0
You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.

How to fill pandas dataframe with max() values

I have a dataframe where each day starts at 7:00 and ends at 22:10 in 5 minute intervals.
In the df are around 200 days (weekend days and some specific days are excluded)
Date Time Volume
0 2019-09-03 07:00:00 70000 778
1 2019-09-03 07:05:00 70500 1267
2 2019-09-03 07:10:00 71000 1208
3 2019-09-03 07:15:00 71500 715
4 2019-09-03 07:20:00 72000 372
I need another column, let's call it 'lastdayVolume', with the max value of Volume of the prior day
For example, in 2019-09-03 (between 7:00 and 22:10) the maximum volume value in a single row is 50000, then I need in every row of 2019-09-04 the value 50000 in column 'lastdayVolume'.
How would you do this without decreasing the lenght of the dataframe?
Have you tried
df.resample('1D', on='Date').max()
This should give you one row per day with the maximal value at this day.
EDIT: To combine that with the old Data, you can use a left join. Its a bit messy but
pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
In [54]: pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
Out[54]:
Date Time Volume lastdayVolume
0 2019-09-03 07:00:00 70000 778 800.0
1 2019-09-03 07:05:00 70500 1267 800.0
2 2019-09-03 07:10:00 71000 1208 800.0
3 2019-09-03 07:15:00 71500 715 800.0
4 2019-09-03 07:20:00 72000 372 800.0
0 2019-09-02 08:00:00 70000 800 NaN
seems to work out.
Equivalently you can use the slightly shorter
df.join(df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date))
here.
The first DataFrame is your old one, the second is the one I calculated above (with appropriate renaming). For the values to merge on you use your 'Date' column which contains timestamps, offset it by one day and converted to an actual date on the left. On the right simply use the index.
The left join ensures you don't accidentally drop rows if you have no transactions the day before.
EDIT 2: To find out that maximum in a certain timerange, you can use
df.set_index('Date').between_time('15:30:00', '22:10:00')
to filter the DataFrame. Afterwards resample as before
df.join(df.set_index('Date').between_time('15:30:00', '22:10:00').resample('1D')...
where the on parameter in the resample is no longer necessary as the Date went into the index.

subtraction in SQL giving incorrect value

I have a table that contains Id, Date and a float value as below:
ID startDt Days
1328 2015-04-01 00:00:00.000 15
2444 2015-04-03 00:00:00.000 5.7
1658 2015-05-08 00:00:00.000 6
1329 2015-05-12 00:00:00.000 28.5
1849 2015-06-23 00:00:00.000 28.5
1581 2015-06-30 00:00:00.000 25.5
3535 2015-07-03 00:00:00.000 3
3536 2015-08-13 00:00:00.000 13.5
2166 2015-09-22 00:00:00.000 28.5
3542 2015-11-05 00:00:00.000 13.5
3543 2015-12-18 00:00:00.000 6
2445 2015-12-25 00:00:00.000 5.7
4096 2015-12-31 00:00:00.000 7.5
2446 2016-01-01 00:00:00.000 5.7
4287 2016-02-11 00:00:00.000 13.5
4288 2016-02-18 00:00:00.000 13.5
4492 2016-03-02 00:00:00.000 19.7
2447 2016-03-25 00:00:00.000 5.7
I am using a stored procedure which adds up the Days then subtracts it from a fixed value stored in a variable.
The total in the table is 245 and the variable is set to 245 so I should get a value of 0 when subtracting the two. However, I am getting a value of 5.6843418860808E-14 instead. I cant figure out why this is the case and I have gone and re entered each number in the table but I still get the same result.
This is my sql statement that I am using to calculate the result:
Declare #AL_Taken as float
Declare #AL_Remaining as float
Declare #EntitledLeave as float
Set #EntitledLeave=245
set #AL_Taken= (select sum(Days) from tblALMain)
Set #AL_Remaining=#EntitledLeave-#AL_Taken
Select #EntitledLeave, #AL_Taken, #AL_Remaining
The select returns the following:
245, 245, 5.6843418860808E-14
Can anyone suggest why I am getting this number when I should be getting 0?
Thanks for the help
Rob
I changed the data type to Decimal as Tab Allenman suggested and this resolved my issue. I still dont understand why I didnt get zero when using float as all the values added up to 245 exactly (I even re-entered the values manually) and 245 - 245 should have given me 0.
Thanks again for all the comments and explanations.
Rob