How to merge two dataframes based on time? - pandas

Trying to merge to dataframes. First is big, second is small. They have a datetime set as index.
I want the datetime value of the second one (and the row) merged in between the datetime values of the first one, sorted by time.
df1:
df1 = pd.read_csv(left_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df1.columns = ['FLIGHT_ID','X', 'Y','MODE_C', 'SPEED', 'HEADING', 'TRK_ROCD', 'TIJD']
df1['datetime'] = pd.to_datetime(df1['TIJD'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S")
df1.set_index(['datetime'], inplace=True)
print(df1)
FLIGHT_ID X Y MODE_C SPEED HEADING TRK_ROCD TIJD
datetime
2019-01-28 00:26:56 20034026 -13345 -1923 230.0 414 88 NaN 28-1-2019 00:26:56
2019-01-28 00:27:00 20034026 -13275 -1923 230.0 414 88 NaN 28-1-2019 00:27:00
2019-01-28 00:27:05 20034026 -13204 -1923 230.0 414 88 NaN 28-1-2019 00:27:05
2019-01-28 00:27:10 20034026 -13134 -1923 230.0 414 88 NaN 28-1-2019 00:27:10
2019-01-28 00:27:15 20034026 -13064 -1923 230.0 414 88 NaN 28-1-2019 00:27:15
... ... ... ... ... ... ... ... ...
2019-01-29 00:08:32 20035925 13443 -531 230.0 257 85 NaN 29-1-2019 00:08:32
2019-01-29 00:08:37 20035925 13487 -526 230.0 257 85 NaN 29-1-2019 00:08:37
2019-01-29 00:08:42 20035925 13530 -520 230.0 257 85 NaN 29-1-2019 00:08:42
2019-01-29 00:08:46 20035925 13574 -516 230.0 257 85 NaN 29-1-2019 00:08:46
2019-01-29 00:08:51 20035925 13617 -510 230.0 257 85 NaN 29-1-2019 00:08:51
551446 rows × 8 columns
df2:
df2 = pd.read_csv(right_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df2['datetime'] = pd.to_datetime(df2['T_START'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S" , dayfirst=True)
df2.set_index(['datetime'], inplace=True)
df2.drop(columns=['T_START', 'T_END', 'AIRFIELD'], inplace=True)
print(df2)
QNH MODE_C_CORRECTION
datetime
2019-01-28 02:14:00 1022 235
2019-01-28 02:14:00 1022 235
2019-01-28 02:16:00 1019 155
2019-01-28 02:21:00 1019 155
2019-01-28 02:36:00 1019 155
... ... ...
2019-01-28 21:56:00 1014 21
2019-01-28 22:56:00 1014 21
2019-01-28 23:26:00 1014 21
2019-01-28 23:29:00 1014 21
2019-01-28 23:52:00 1014 21
[69 rows x 2 columns]
The idea is that the first row of df2 should be inserted somewhere at 2019-01-28 02:14:00.
I have spent hours on Stackoverflow and pandas documentation (merge, join, concat) but cannot find the right solution.
The next step would be to interpolate the values in column 'QNH' to the rows that are in df1, based on that time.
Any help greatly appreciated!

Just concatenate two DataFrames and sort by date:
df = pd.concat([df1,df2]).sort_values(by='datetime')
For the next step you can use pandas.DataFrame.interpolate.

Related

impute missing values in a multi index data frame with taking the last available value of second index level

I'm struggling with the following problem:
I have a multilevel index data frame of time series data of following structure:
import pandas as pd
import numpy as np
multi_index = pd.MultiIndex.from_tuples([('2022-02-18', '2022-02-17'),
('2022-02-19', '2022-02-17'),
('2022-02-20', '2022-02-17'),
('2022-02-21', '2022-02-17'),
('2022-02-19', '2022-02-18'),
('2022-02-20', '2022-02-18'),
('2022-02-21', '2022-02-18'),
('2022-02-22', '2022-02-18'),
('2022-02-20', '2022-02-19'),
('2022-02-21', '2022-02-19'),
('2022-02-22', '2022-02-19'),
('2022-02-23', '2022-02-19'),],
names=['date1','date2'])
data = [[45, 365],
[91, 254],
[60, 268],
[57, 781],
[68, 236],
[36, np.nan],
[87, 731],
[12, 452],
[np.nan, 214],
[33, 654],
[74, 113],
[65, 381]]
df = pd.DataFrame(data, columns=['value1', 'value2'], index = multi_index)
df looks like the following table:
date1
date2
value1
value2
2022-02-18
2022-02-17
45
365
2022-02-19
2022-02-17
91
254
2022-02-20
2022-02-17
60
268
2022-02-21
2022-02-17
57
781
2022-02-19
2022-02-18
68
236
2022-02-20
2022-02-18
36
NaN
2022-02-21
2022-02-18
87
731
2022-02-22
2022-02-18
12
452
2022-02-20
2022-02-19
NaN
214
2022-02-21
2022-02-19
33
654
2022-02-22
2022-02-19
74
113
2022-02-23
2022-02-19
65
381
date1 and date2 are the multi index. I would like to impute the missing values in the table with the last available value in date2. So in this case it would be 36 for value1 and 268 for value2, because I want that date1 of the imputed value is the same and for date2 I want to take the last available value.
I tried to impute with pandas method fillna() and tried different variations of the hyper parameter 'method' but nothing seems to be a proper solution for my problem.
This should give you what you've described:
df.groupby('date1').ffill()

Pandas DF will in for Missing Months

I have a dataframe of values that are mostly (but not always) quarterly values.
I need to fill in for any missing months so it is complete.
Here i need to put it into a complete df from 2015-12 to 2021-03.
Thank you.
id date amt rate
0 15856 2015-12-31 85.09 0.0175
1 15857 2016-03-31 135.60 0.0175
2 15858 2016-06-30 135.91 0.0175
3 15859 2016-09-30 167.27 0.0175
4 15860 2016-12-31 173.32 0.0175
....
19 15875 2020-09-30 305.03 0.0175
20 15876 2020-12-31 354.09 0.0175
21 15877 2021-03-31 391.19 0.0175
You can use pd.date_range() to generate a list of months end with freq='M' then reindex datetime index.
df_ = df.set_index('date').reindex(pd.date_range('2015-12', '2021-03', freq='M')).reset_index().rename(columns={'index': 'date'})
print(df_)
date id amt rate
0 2015-12-31 15856.0 85.09 0.0175
1 2016-01-31 NaN NaN NaN
2 2016-02-29 NaN NaN NaN
3 2016-03-31 15857.0 135.60 0.0175
4 2016-04-30 NaN NaN NaN
.. ... ... ... ...
58 2020-10-31 NaN NaN NaN
59 2020-11-30 NaN NaN NaN
60 2020-12-31 15876.0 354.09 0.0175
61 2021-01-31 NaN NaN NaN
62 2021-02-28 NaN NaN NaN
To fill the NaN value, you can use df_.fillna(0).

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144

seaborn multiple variables group bar plot

I have pandas dataframe, one index(datetime) and three variables(int)
date A B C
2017-09-05 25 261 31
2017-09-06 261 1519 151
2017-09-07 188 1545 144
2017-09-08 200 2110 232
2017-09-09 292 2391 325
I can create grouped bar plot with basic pandas plot.
df.plot(kind='bar', legend=False)
However, I want to display in Seaborn or other libraries to improve my skills. I found very close answer(Pandas: how to draw a bar plot with two categories and four series each?).
In its suggested answer, it has code that
ax=sns.barplot(x='', y='', hue='', data=data)
If I apply this code to mine, I do not know what my 'y` would be.
ax=sns.barplot(x='date', y=??, hue=??, data=data)
How can I plot multiple variables with Seaborn or other libraries?
I think need melt if want use barplot:
data = df.melt('date', var_name='a', value_name='b')
print (data)
date a b
0 2017-09-05 A 25
1 2017-09-06 A 261
2 2017-09-07 A 188
3 2017-09-08 A 200
4 2017-09-09 A 292
5 2017-09-05 B 261
6 2017-09-06 B 1519
7 2017-09-07 B 1545
8 2017-09-08 B 2110
9 2017-09-09 B 2391
10 2017-09-05 C 31
11 2017-09-06 C 151
12 2017-09-07 C 144
13 2017-09-08 C 232
14 2017-09-09 C 325
ax=sns.barplot(x='date', y='b', hue='a', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Pandas solution with DataFrame.plot.bar and set_index:
df.set_index('date').plot.bar()