I have two tables with the following formats:
Table1: key = Date, Index
Date Index Value1
0 2015-01-01 A -1.292040
1 2015-04-01 A 0.535893
2 2015-02-01 B -1.779029
3 2015-06-01 B 1.129317
Table2: Key = Date
Date Value2
0 2015-01-01 2.637761
1 2015-02-01 -0.496927
2 2015-03-01 0.226914
3 2015-04-01 -2.010917
4 2015-05-01 -1.095533
5 2015-06-01 0.651244
6 2015-07-01 0.036592
7 2015-08-01 0.509352
8 2015-09-01 -0.682297
9 2015-10-01 1.231889
10 2015-11-01 -1.557481
11 2015-12-01 0.332942
Table2 has more rows and I want to join Table1 into Table2 on Date so I can do stuff with the Values. However, I also want to bring in Index and and fill in for each index, all the Dates they don't have like this:
Result:
Date Index Value1 Value2
0 2015-01-01 A -1.292040 2.637761
1 2015-02-01 A NaN -0.496927
2 2015-03-01 A NaN 0.226914
3 2015-04-01 A 0.535893 -2.010917
4 2015-05-01 A NaN -1.095533
5 2015-06-01 A NaN 0.651244
6 2015-07-01 A NaN 0.036592
7 2015-08-01 A NaN 0.509352
8 2015-09-01 A NaN -0.682297
9 2015-10-01 A NaN 1.231889
10 2015-11-01 A NaN -1.557481
11 2015-12-01 A NaN 0.332942
.... and so on with Index B
I suppose I could manually filter out each Index value from Table1 into Table2, but that would be really tedious and troublesome if I didn't actually know all the indexes. I essentially want to do a "Table1 group by Index and right join to Table2 on Date" at the same time, but I'm stuck on how to express this.
Running the latest versions of Pandas and Jupyter.
EDIT: I have a program to fill in the NaNs, so they're not a problem right now.
It seems you want to merge 'Value1' of df1 with df2 on 'Date', while assigning the Index to every date. You can use pd.concat with a list comprehension
import pandas as pd
pd.concat([df2.assign(Index=i).merge(gp, how='left') for i, gp in df1.groupby('Index')],
ignore_index=True)
Output:
Date Value2 Index Value1
0 2015-01-01 2.637761 A -1.292040
1 2015-02-01 -0.496927 A NaN
2 2015-03-01 0.226914 A NaN
3 2015-04-01 -2.010917 A 0.535893
4 2015-05-01 -1.095533 A NaN
5 2015-06-01 0.651244 A NaN
6 2015-07-01 0.036592 A NaN
7 2015-08-01 0.509352 A NaN
8 2015-09-01 -0.682297 A NaN
9 2015-10-01 1.231889 A NaN
10 2015-11-01 -1.557481 A NaN
11 2015-12-01 0.332942 A NaN
12 2015-01-01 2.637761 B NaN
13 2015-02-01 -0.496927 B -1.779029
14 2015-03-01 0.226914 B NaN
15 2015-04-01 -2.010917 B NaN
16 2015-05-01 -1.095533 B NaN
17 2015-06-01 0.651244 B 1.129317
18 2015-07-01 0.036592 B NaN
19 2015-08-01 0.509352 B NaN
20 2015-09-01 -0.682297 B NaN
21 2015-10-01 1.231889 B NaN
22 2015-11-01 -1.557481 B NaN
23 2015-12-01 0.332942 B NaN
By not specifying the merge keys, it's automatically using the intersection of columns, which is ['Date', 'Index'] for each group.
Related
I have a dataframe of values that are mostly (but not always) quarterly values.
I need to fill in for any missing months so it is complete.
Here i need to put it into a complete df from 2015-12 to 2021-03.
Thank you.
id date amt rate
0 15856 2015-12-31 85.09 0.0175
1 15857 2016-03-31 135.60 0.0175
2 15858 2016-06-30 135.91 0.0175
3 15859 2016-09-30 167.27 0.0175
4 15860 2016-12-31 173.32 0.0175
....
19 15875 2020-09-30 305.03 0.0175
20 15876 2020-12-31 354.09 0.0175
21 15877 2021-03-31 391.19 0.0175
You can use pd.date_range() to generate a list of months end with freq='M' then reindex datetime index.
df_ = df.set_index('date').reindex(pd.date_range('2015-12', '2021-03', freq='M')).reset_index().rename(columns={'index': 'date'})
print(df_)
date id amt rate
0 2015-12-31 15856.0 85.09 0.0175
1 2016-01-31 NaN NaN NaN
2 2016-02-29 NaN NaN NaN
3 2016-03-31 15857.0 135.60 0.0175
4 2016-04-30 NaN NaN NaN
.. ... ... ... ...
58 2020-10-31 NaN NaN NaN
59 2020-11-30 NaN NaN NaN
60 2020-12-31 15876.0 354.09 0.0175
61 2021-01-31 NaN NaN NaN
62 2021-02-28 NaN NaN NaN
To fill the NaN value, you can use df_.fillna(0).
I have a data frame as shown below.
df:
cust_id lead_date dob
1 2016-12-25 1989-12-20
2 2017-10-25 1980-09-20
3 2016-11-25 NaN
4 2014-04-25 1989-12-20
5 2019-12-21 NaN
From the above I would like to calculate age as difference of lead_date with dob in years.
if dob is NaN then make age as 0.
Expected output:
cust_id lead_date dob age
1 2016-12-25 1989-12-20 27
2 2017-10-25 1980-09-20 37
3 2016-11-25 NaN 0
4 2014-04-25 1989-12-20 25
5 2019-12-21 NaN 0
You can do:
# convert to datetime type
df['lead_date'] = pd.to_datetime(df.lead_date)
df['dob'] = pd.to_datetime(df.dob)
df['age'] = (df.lead_date.dt.year - df.dob.dt.year).fillna(0)
Output:
cust_id lead_date dob age
0 1 2016-12-25 1989-12-20 27.0
1 2 2017-10-25 1980-09-20 37.0
2 3 2016-11-25 NaT 0.0
3 4 2014-04-25 1989-12-20 25.0
4 5 2019-12-21 NaT 0.0
i am facing issue with group by ffill. It does not seem to apply forward fill in correct order
Here is my starting data
group date stage_2
0 A 2014-01-01 NaN
1 A 2014-01-03 NaN
2 A 2014-01-04 NaN
3 A 2014-01-05 1.0
4 B 2014-01-02 NaN
5 B 2014-01-06 NaN
6 B 2014-01-10 NaN
7 C 2014-01-03 1.0
8 C 2014-01-05 3.0
9 C 2014-01-08 NaN
10 C 2014-01-09 NaN
11 C 2014-01-10 NaN
12 C 2014-01-11 NaN
13 D 2014-01-01 NaN
14 D 2014-01-03 NaN
15 D 2014-01-04 NaN
16 E 2014-01-04 1.0
17 E 2014-01-06 3.0
18 E 2014-01-07 4.0
19 E 2014-01-08 NaN
20 E 2014-01-09 NaN
21 E 2014-01-10 NaN
22 F 2014-01-08 NaN
After applying the ffill method this is what i get
df['stage_2'] = df.groupby('group')['stage_2'].ffill()
I am expecting a different value at index 9 through 12 and 21
group date stage_2
0 A 2014-01-01 NaN
1 A 2014-01-03 NaN
2 A 2014-01-04 NaN
3 A 2014-01-05 1.0
4 B 2014-01-02 NaN
5 B 2014-01-06 NaN
6 B 2014-01-10 NaN
7 C 2014-01-03 1.0
8 C 2014-01-05 3.0
9 C 2014-01-08 1.0
10 C 2014-01-09 NaN
11 C 2014-01-10 NaN
12 C 2014-01-11 NaN
13 D 2014-01-01 NaN
14 D 2014-01-03 NaN
15 D 2014-01-04 NaN
16 E 2014-01-04 1.0
17 E 2014-01-06 3.0
18 E 2014-01-07 4.0
19 E 2014-01-08 4.0
20 E 2014-01-09 4.0
21 E 2014-01-10 NaN
22 F 2014-01-08 NaN
The only way I can reproduce this is by putting in non-ASCII characters e.g. Cyrillic С and Е into the group column at indexes 9-12 and 21 respectively.
EDIT
OK, most likely you're using pandas v0.23.0 which had a bug (fixed in future versions, at least in v0.23.4) that makes .ffill() give the exact output you posted. So please upgrade your pandas.
When data using group by, how can I cumsum millisenconds in df?
Inputs is bellow here.
inputs:
time key isValue
2018-03-04 00:00:06.520 1 NaN
2018-03-04 00:00:07.230 1 NaN
2018-03-04 00:00:08.140 1 1
2018-03-04 00:00:08.720 1 1
2018-03-04 00:00:09.110 1 1
2018-03-04 00:00:09.650 1 NaN
2018-03-04 00:00:10.360 1 NaN
2018-03-04 00:00:11.150 1 NaN
2018-03-04 00:00:11.770 2 NaN
2018-03-04 00:00:12.320 2 NaN
2018-03-04 00:00:12.910 2 1
2018-03-04 00:00:13.250 2 1
2018-03-04 00:00:13.960 2 1
2018-03-04 00:00:14.550 2 NaN
2018-03-04 00:00:15.250 2 NaN
....
And I wanna Outputs is bellow here.
outputs
key : time
1 : 1.030
2 : 1.050
3 : X.xxx
4 : X.xxx
....
Well, I'm using this code
df.groupby(["key"])["time"].cumsum()
Is not correct code that I think.
I think need:
df['new'] = df["time"].dt.microsecond.groupby(df["key"]).cumsum() / 1000
print (df)
time key isValue new
0 2018-03-04 00:00:06.520 1 NaN 520.0
1 2018-03-04 00:00:07.230 1 NaN 750.0
2 2018-03-04 00:00:08.140 1 1.0 890.0
3 2018-03-04 00:00:08.720 1 1.0 1610.0
4 2018-03-04 00:00:09.110 1 1.0 1720.0
5 2018-03-04 00:00:09.650 1 NaN 2370.0
6 2018-03-04 00:00:10.360 1 NaN 2730.0
7 2018-03-04 00:00:11.150 1 NaN 2880.0
8 2018-03-04 00:00:11.770 2 NaN 770.0
9 2018-03-04 00:00:12.320 2 NaN 1090.0
10 2018-03-04 00:00:12.910 2 1.0 2000.0
11 2018-03-04 00:00:13.250 2 1.0 2250.0
12 2018-03-04 00:00:13.960 2 1.0 3210.0
13 2018-03-04 00:00:14.550 2 NaN 3760.0
14 2018-03-04 00:00:15.250 2 NaN 4010.0
This is DataFrame 1:
Date Serial Number Type
0 2014-12-17 1N4AL2EP8DC270200 New
1 2015-10-28 1N4AL2EP8DC270200 Used
2 2015-01-22 1N4AL3AP1EN239307 New
3 2015-11-22 1N4AL3AP1EN239307 Used
4 2015-05-22 1N4AL3AP1FC235402 New
5 2016-12-02 1N4AL3AP1FC235402 Used
6 2015-01-22 1N4AL3AP2FC213098 New
7 2016-05-13 1N4AL3AP2FC213098 Used
8 2014-05-14 1N4AL3AP3EC132416 New
9 2016-04-07 1N4AL3AP3EC132416 Used
10 2014-05-24 1N4AL3AP5EC316644 New
11 2014-12-18 1N4AL3AP5EC316644 Used
12 2014-12-11 1N4AL3AP6EC322517 New
13 2015-10-04 1N4AL3AP6EC322517 Used
14 2016-06-06 1N4AL3AP6EC322517 Used
...
This is DataFrame 2:
Date Serial Number
0 2014-03-12 5N1AA08C78N611573
1 2014-03-12 JN8AS5MT3EW604277
2 2014-03-12 1N6AF0LX5DN114710
3 2014-03-12 1N4AL3AP8DN447876
4 2014-03-12 JN8AZ1MU8AW021145
5 2014-03-12 JN1AZ4EH0AM500138
6 2014-03-12 JN8AF5MR3BT013548
7 2014-03-12 3N1AB61E17L629049
8 2014-03-12 3N1BC13E87L368844
9 2014-03-13 1N6AD07W95C431183
10 2014-03-13 1N6AA07A25N543180
11 2014-03-13 1N4CL2AP1BC110185
12 2014-03-13 JN8AZ1MW1BW181306
13 2014-03-13 5N1BV28U46N116791
...
Just given a sample of the DataFrame, not the entire DataFrame. I need to retrieve the first Date of every Serial Number that has its type as Used in DataFrame 1 (For example: For serial number '1N4AL3AP6EC322517' 2015-10-04 is the Date I'm looking for. Then compare this Date to the Date recorded for the same Serial Number in DataFrame 2 if the Date in DataFrame 2 is earlier that in DataFrame 1, mark it with 'A' otherwise mark it with 'B'.
Have to do this for over 2000 serial numbers, what's an efficient way to do the same?
I think you can use merge_asof:
print (df2)
Date Serial Number
0 2016-03-12 1N4AL3AP6EC322517
1 2013-03-12 1N4AL3AP5EC316644
2 2014-03-12 1N4AL3AP3EC132416
3 2016-08-12 1N4AL3AP2FC213098
4 2014-03-12 JN8AZ1MU8AW021145
#if necessary cast Date columns to datetime
df1.Date = pd.to_datetime(df1.Date)
df2.Date = pd.to_datetime(df2.Date)
#get first value of column Serial Number filtered by Used
df = df1[df1.Type == 'Used'].drop_duplicates(['Serial Number'])
print (df)
Date Serial Number Type
1 2015-10-28 1N4AL2EP8DC270200 Used
3 2015-11-22 1N4AL3AP1EN239307 Used
5 2016-12-02 1N4AL3AP1FC235402 Used
7 2016-05-13 1N4AL3AP2FC213098 Used
9 2016-04-07 1N4AL3AP3EC132416 Used
11 2014-12-18 1N4AL3AP5EC316644 Used
13 2015-10-04 1N4AL3AP6EC322517 Used
#add value B
df2['Mark'] = 'B'
df = pd.merge_asof(df.sort_values(['Date']),
df2.sort_values(['Date']), on='Date', by='Serial Number')
print (df)
Date Serial Number Type Mark
0 2014-12-18 1N4AL3AP5EC316644 Used B
1 2015-10-04 1N4AL3AP6EC322517 Used NaN
2 2015-10-28 1N4AL2EP8DC270200 Used NaN
3 2015-11-22 1N4AL3AP1EN239307 Used NaN
4 2016-04-07 1N4AL3AP3EC132416 Used B
5 2016-05-13 1N4AL3AP2FC213098 Used NaN
6 2016-12-02 1N4AL3AP1FC235402 Used NaN
#add value A
mask = df['Serial Number'].isin(df2['Serial Number'])
df.loc[mask, 'Mark'] = df.loc[mask, 'Mark'].fillna('A')
print (df)
Date Serial Number Type Mark
0 2014-12-18 1N4AL3AP5EC316644 Used B
1 2015-10-04 1N4AL3AP6EC322517 Used A
2 2015-10-28 1N4AL2EP8DC270200 Used NaN
3 2015-11-22 1N4AL3AP1EN239307 Used NaN
4 2016-04-07 1N4AL3AP3EC132416 Used B
5 2016-05-13 1N4AL3AP2FC213098 Used A
6 2016-12-02 1N4AL3AP1FC235402 Used NaN