impute missing values in a multi index data frame with taking the last available value of second index level - pandas

I'm struggling with the following problem:
I have a multilevel index data frame of time series data of following structure:
import pandas as pd
import numpy as np
multi_index = pd.MultiIndex.from_tuples([('2022-02-18', '2022-02-17'),
('2022-02-19', '2022-02-17'),
('2022-02-20', '2022-02-17'),
('2022-02-21', '2022-02-17'),
('2022-02-19', '2022-02-18'),
('2022-02-20', '2022-02-18'),
('2022-02-21', '2022-02-18'),
('2022-02-22', '2022-02-18'),
('2022-02-20', '2022-02-19'),
('2022-02-21', '2022-02-19'),
('2022-02-22', '2022-02-19'),
('2022-02-23', '2022-02-19'),],
names=['date1','date2'])
data = [[45, 365],
[91, 254],
[60, 268],
[57, 781],
[68, 236],
[36, np.nan],
[87, 731],
[12, 452],
[np.nan, 214],
[33, 654],
[74, 113],
[65, 381]]
df = pd.DataFrame(data, columns=['value1', 'value2'], index = multi_index)
df looks like the following table:
date1
date2
value1
value2
2022-02-18
2022-02-17
45
365
2022-02-19
2022-02-17
91
254
2022-02-20
2022-02-17
60
268
2022-02-21
2022-02-17
57
781
2022-02-19
2022-02-18
68
236
2022-02-20
2022-02-18
36
NaN
2022-02-21
2022-02-18
87
731
2022-02-22
2022-02-18
12
452
2022-02-20
2022-02-19
NaN
214
2022-02-21
2022-02-19
33
654
2022-02-22
2022-02-19
74
113
2022-02-23
2022-02-19
65
381
date1 and date2 are the multi index. I would like to impute the missing values in the table with the last available value in date2. So in this case it would be 36 for value1 and 268 for value2, because I want that date1 of the imputed value is the same and for date2 I want to take the last available value.
I tried to impute with pandas method fillna() and tried different variations of the hyper parameter 'method' but nothing seems to be a proper solution for my problem.

This should give you what you've described:
df.groupby('date1').ffill()

Related

Grouping data month-wise with Categorical data in pandas

How can I group data into months from dates where a data frame has both categorical and numerical data in pandas. I tried the groupby function but I think it won't work with categorical data. There are multiple values in the categorical column. Sample data:
Date
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
2021-06-12
Dove_birds
1268
7656
2021-02-05
Pantene_winner
657
8964
2021-09-15
Budweiser_wazap
7642
76
2021-05-13
Pantene_winner
425
6578
2021-12-12
Budweiser_wazap
9867
111
2021-09-09
Dove_birds
1578
11456
2021-05-24
Pantene_winner
678
7475
2021-09-27
Budweiser_wazap
8742
96
2021-09-09
Dove_soft
1175
15486
Now I need to group the data months wise and show for example that Budweiser_wazap in September gained a total audience of xxxx and in December gained xxxx audience and so on for the other campaigns as well.
Expected output sample:
Month
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
February
Pantene_winner
657
8964
September
Budweiser_wazap
16384
172
Since Budweiser_wazap campaign ran twice in September, the resulting output for No_of_Male_Viewers is: 7642 + 8742 = 16384, and for No_of_Female_Viewers is: 76 + 96 = 172.
USE-
##Get Month Name for each date
df['Month'] = df['Date'].dt.month_name()
#Groupby `Month` & `Campaign_Name`
df.groupby(['Month', 'Campaign_Name'])[['No_of_Male_viewers', 'No_of_Female_viewers']].sum().reset_index()
df
Sample Reproducible code-
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-06-08', '2015-08-05', '2015-05-06', '2015-05-05', '2015-07-08', '2015-05-07', '2015-06-05', '2015-07-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month_name()
df
df output-
Date Sym Data2 Data3 Month
0 2015-06-08 aapl 11 5 June
1 2015-08-05 aapl 8 8 August
2 2015-05-06 aapl 10 6 May
3 2015-05-05 aapl 15 1 May
4 2015-07-08 aaww 110 50 July
5 2015-05-07 aaww 60 100 May
6 2015-06-05 aaww 100 60 June
7 2015-07-05 aaww 40 120 July
Groupby Condition-
df.groupby(['Month', 'Sym'])[['Data2', 'Data3']].sum().reset_index()
Output-
Month Sym Data2 Data3
0 August aapl 8 8
1 July aaww 150 170
2 June aapl 11 5
3 June aaww 100 60
4 May aapl 25 7
5 May aaww 60 100
Ref link- Pandas - dataframe groupby - how to get sum of multiple columns
If you use strftime('%B') - that automatically extracts Month names - you can reach the same result with one line of code :)
#download dataframe from Stackoverflw and convert column to datetime
df=pd.read_clipboard()
df['Date']=pd.to_datetime(df['Date'], infer_datetime_format=True)
#'%B' - will return full month name, '%b' - 3-Letter- like Dec, Sep
df.groupby([df['Date'].dt.strftime('%B'), "Campaign_Name"]).sum()
Here is how the output looks like:

How to merge two dataframes based on time?

Trying to merge to dataframes. First is big, second is small. They have a datetime set as index.
I want the datetime value of the second one (and the row) merged in between the datetime values of the first one, sorted by time.
df1:
df1 = pd.read_csv(left_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df1.columns = ['FLIGHT_ID','X', 'Y','MODE_C', 'SPEED', 'HEADING', 'TRK_ROCD', 'TIJD']
df1['datetime'] = pd.to_datetime(df1['TIJD'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S")
df1.set_index(['datetime'], inplace=True)
print(df1)
FLIGHT_ID X Y MODE_C SPEED HEADING TRK_ROCD TIJD
datetime
2019-01-28 00:26:56 20034026 -13345 -1923 230.0 414 88 NaN 28-1-2019 00:26:56
2019-01-28 00:27:00 20034026 -13275 -1923 230.0 414 88 NaN 28-1-2019 00:27:00
2019-01-28 00:27:05 20034026 -13204 -1923 230.0 414 88 NaN 28-1-2019 00:27:05
2019-01-28 00:27:10 20034026 -13134 -1923 230.0 414 88 NaN 28-1-2019 00:27:10
2019-01-28 00:27:15 20034026 -13064 -1923 230.0 414 88 NaN 28-1-2019 00:27:15
... ... ... ... ... ... ... ... ...
2019-01-29 00:08:32 20035925 13443 -531 230.0 257 85 NaN 29-1-2019 00:08:32
2019-01-29 00:08:37 20035925 13487 -526 230.0 257 85 NaN 29-1-2019 00:08:37
2019-01-29 00:08:42 20035925 13530 -520 230.0 257 85 NaN 29-1-2019 00:08:42
2019-01-29 00:08:46 20035925 13574 -516 230.0 257 85 NaN 29-1-2019 00:08:46
2019-01-29 00:08:51 20035925 13617 -510 230.0 257 85 NaN 29-1-2019 00:08:51
551446 rows × 8 columns
df2:
df2 = pd.read_csv(right_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df2['datetime'] = pd.to_datetime(df2['T_START'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S" , dayfirst=True)
df2.set_index(['datetime'], inplace=True)
df2.drop(columns=['T_START', 'T_END', 'AIRFIELD'], inplace=True)
print(df2)
QNH MODE_C_CORRECTION
datetime
2019-01-28 02:14:00 1022 235
2019-01-28 02:14:00 1022 235
2019-01-28 02:16:00 1019 155
2019-01-28 02:21:00 1019 155
2019-01-28 02:36:00 1019 155
... ... ...
2019-01-28 21:56:00 1014 21
2019-01-28 22:56:00 1014 21
2019-01-28 23:26:00 1014 21
2019-01-28 23:29:00 1014 21
2019-01-28 23:52:00 1014 21
[69 rows x 2 columns]
The idea is that the first row of df2 should be inserted somewhere at 2019-01-28 02:14:00.
I have spent hours on Stackoverflow and pandas documentation (merge, join, concat) but cannot find the right solution.
The next step would be to interpolate the values in column 'QNH' to the rows that are in df1, based on that time.
Any help greatly appreciated!
Just concatenate two DataFrames and sort by date:
df = pd.concat([df1,df2]).sort_values(by='datetime')
For the next step you can use pandas.DataFrame.interpolate.

How to extract the last value of a group in pandas dataframe

I have a huge data that needs to be grouped based on its 'IDs' and only the last value of each ID needs to be exported into a SINGLE csv/excel file.
incl = ['A', 'B', 'C']
for k, g in df[df['ID'].isin(incl)].groupby('ID'):
g.tail(1).to_csv(f'{k}.csv')
I have tried this but it makes different csv files for each ID instead of a one big file containing the last value of each group.
Sample data:
ID Date Open High Low
30 UNITY 2020-06-18 11.50 11.75 11.41
31 UNITY 2020-06-21 11.44 11.50 10.88
32 UNITY 2020-06-22 11.26 11.78 11.26
33 UNITY 2020-06-23 11.72 12.08 11.53
34 UNITY 2020-06-24 11.51 11.59 11.40
35 UNITY 2020-06-25 11.85 11.85 11.11
36 SSOM 2020-05-03 27.50 27.95 27.00
37 SSOM 2020-05-05 27.50 27.50 27.50
38 SSOM 2020-05-06 29.20 29.56 29.20
39 SSOM 2020-05-07 31.77 31.77 31.77

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144

seaborn multiple variables group bar plot

I have pandas dataframe, one index(datetime) and three variables(int)
date A B C
2017-09-05 25 261 31
2017-09-06 261 1519 151
2017-09-07 188 1545 144
2017-09-08 200 2110 232
2017-09-09 292 2391 325
I can create grouped bar plot with basic pandas plot.
df.plot(kind='bar', legend=False)
However, I want to display in Seaborn or other libraries to improve my skills. I found very close answer(Pandas: how to draw a bar plot with two categories and four series each?).
In its suggested answer, it has code that
ax=sns.barplot(x='', y='', hue='', data=data)
If I apply this code to mine, I do not know what my 'y` would be.
ax=sns.barplot(x='date', y=??, hue=??, data=data)
How can I plot multiple variables with Seaborn or other libraries?
I think need melt if want use barplot:
data = df.melt('date', var_name='a', value_name='b')
print (data)
date a b
0 2017-09-05 A 25
1 2017-09-06 A 261
2 2017-09-07 A 188
3 2017-09-08 A 200
4 2017-09-09 A 292
5 2017-09-05 B 261
6 2017-09-06 B 1519
7 2017-09-07 B 1545
8 2017-09-08 B 2110
9 2017-09-09 B 2391
10 2017-09-05 C 31
11 2017-09-06 C 151
12 2017-09-07 C 144
13 2017-09-08 C 232
14 2017-09-09 C 325
ax=sns.barplot(x='date', y='b', hue='a', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Pandas solution with DataFrame.plot.bar and set_index:
df.set_index('date').plot.bar()