KeyError: Timestamp is not in index - pandas

I am trying to overlap timestamp from df2 with df1 time window. When ever there is no match I am getting the following error. how can I get the output with out the following error?
Error
KeyError: "[Timestamp('2022-01-01 03:12:02')] not in index"
input
from datetime import datetime, date
import pandas as pd
df1 = pd.DataFrame({'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date1': [datetime(2022,1,1,1,1,1),
datetime(2022,1,1,2,4,1),
datetime(2022,1,1,3,8,1),
datetime(2022,1,1,4,12,1)],
'date2': [datetime(2022,1,1,2,1,1),
datetime(2022,1,1,3,6,1),
datetime(2022,1,1,3,10,1),
datetime(2022,1,1,4,14,1)] })
input2
df2 = pd.DataFrame({'id': ['A', 'B', 'C', 'D'],
'value': [10,20,30,40],
'date': [datetime(2022,1,1,1,12,1),
datetime(2022,1,1,1,40,1),
datetime(2022,1,1,3,12,2),
datetime(2022,1,1,4,12,2)] })
Expected output
(2022-01-01 01:01:01, 2022-01-01 02:01:01] 15.0
(2022-01-01 04:12:01, 2022-01-01 04:14:01] 40.0
Code
idx = pd.IntervalIndex.from_arrays(pd.to_datetime(df1['date1']),
pd.to_datetime(df1['date2']))
mapper = pd.Series(idx, index=idx)
df2.groupby(mapper[pd.to_datetime(df2['date'])].values)['value'].mean()

One option is with conditional_join from pyjanitor, which solves inequality joins such as this:
# pip install pyjanitor
import pandas as pd
import janitor
df1['date1'] = pd.to_datetime(df1['date1'])
df1['date2'] = pd.to_datetime(df1['date2'])
df2['date'] = pd.to_datetime(df2['date'])
(
df1
.filter(like='date')
.conditional_join(
df2.filter(['value', 'date']),
('date1', 'date', '<='),
('date2', 'date', '>='))
.groupby(['date1', 'date2'])
.value
.mean()
)
date1 date2
2022-01-01 01:01:01 2022-01-01 02:01:01 15.0
2022-01-01 04:12:01 2022-01-01 04:14:01 40.0
Name: value, dtype: float64

I think I figured it out. It is not the best but works.
df1['date'] = pd.to_datetime(df1['date1']).dt.date
df2['date'] = pd.to_datetime(df2['dates']).dt.date
df3 = pd.merge(df1, df2, on=['date'], how='left')
mask = (df3['dates'] > df3['date1']) & (df3['dates'] < df3['date2'])
df4 = df3.loc[mask]
df4.groupby(['date1', 'date2'])['value'].mean()

Related

geopandas spatial join with conditions / additional columns

I have two DataFrames with Lat, Long columns and other additional columns. For example,
import pandas as pd
import geopandas as gpd
df1 = pd.DataFrame({
'id': [0, 1, 2],
'dt': [01-01-2022, 02-01-2022, 03-01-2022],
'Lat': [33.155480, 33.155480, 33.155480],
'Long': [-96.731630, -96.731630, -96.731630]
})
df2 = pd.DataFrame({
'val': ['a', 'b', 'c'],
'dt': [01-01-2022, 02-01-2022, 03-01-2022],
'Lat': [33.155480, 33.155480, 33.155480],
'Long': [-96.731630, -96.731630, -96.731630]
})
I'd like to do a spatial join not just on lat, long but also on date column. Expected output:
id dt lat long val
0 01-01-2022 33.155480 -96.731630 a
1 02-01-2022 33.155480 -96.731630 b
2 03-01-2022 33.155480 -96.731630 c

Transform dictionary value in a single pandas dataframe column into multiple columns

import numpy as np
import pandas as pd
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
dataDF = pd.DataFrame(data)
#trails
# dataDF['newColumns'] = dataDF['empDetails'].apply(lambda x: x[0].get('key'))
# dataDF = dataDF['empDetails'].apply(pd.Series)
# create dataframe
# dataDF = pd.DataFrame(dataDF['empDetails'], columns=dataDF['empDetails'].keys())
# create the dataframe
# df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in dataDF['empDetails'].items()], axis=1)
# print(dataDF['empDetails'].items())
display(dataDF)
I am trying to iterate through empDetails column and fetch the value of Name,salary and Department into 3 different column
Using pd.series I am able to split the dictionary into different columns, but not able to rename the columns as the column order may change.
What will be the effective way to do this.
Expected output
Use lambda function for extract keys and values to new dictionaries and pass to DataFrame constructor:
f = lambda x: {y['key']:y['value'] for y in x}
df = dataDF.join(pd.DataFrame(dataDF['empDetails'].apply(f).tolist(), index=dataDF.index))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Alternative solution:
f = lambda x: pd.Series({y['key']:y['value'] for y in x})
df = dataDF.join(dataDF['empDetails'].apply(f))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Or use list comprehension (only pandas solution):
df1 = pd.DataFrame([{y['key']:y['value'] for y in x} for x in dataDF['empDetails']],
index=dataDF.index)
df = dataDF.join(df1)
If you are using python 3.5+, then you can unroll dict elements and append "ID" column in one line:
df.apply(lambda row: pd.Series({**{"ID":row["ID"]}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Update: If you want all columns from original df, then use dict comprehension:
df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Full example:
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
df = pd.DataFrame(data)
df = df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
[Out]:
Department ID Name salary score
0 HR 112 Ajay NaN 2.0
1 NaN 113 Balu 7.5 NaN

Change index of multiple dataframes at once

Here is my earlier part of the code -
import pandas as pd
import itertools as it
import numpy as np
a= pd.read_excel(r'D:\Ph.D. IEE\IEE 6570 - IR Dr. Greene\3machinediverging.xlsx')
question = pd.DataFrame(a).set_index('Job')
df2 = question.index
permutations = list(it.permutations(question.index))
dfper = pd.DataFrame(permutations)
for i in range(len(dfper)):
fr = dfper.iloc[0:len(dfper)]
fr.index.name = ''
print(fr)
for i in range(0, fr.shape[0], 1):
print (fr.iloc[i:i+1].T)
This gives me 120 dataframes.
0
0 A
1 B
2 C
3 D
4 E
1
0 A
1 B
2 C
3 E
4 D
and so on...
I would like to change the index of these dataframes to the alphabet column (using a for loop). Any help would be really appreciated. Thank you.
import pandas as pd
df1 = pd.DataFrame({0: [ 'A', 'B', 'C', 'D', 'E']})
df2 = pd.DataFrame({1: [ 'A', 'B', 'C', 'D', 'E']})
df_list = [df1, df2]
for df in df_list:
# set the first column as index
df.set_index(df.columns[0], inplace=True)
I would organize all dataframes in a list:
df_list = [df1, df2, df3,...]
and then:
for df in df_list: df = df.set_index(df.columns[0])

Pandas extracting value from column in Dataframe

I have a column in my Dataframe that has data in the below format
id,value
101,[{'self': 'https://www.web.com/rest/api/101', 'value': 'Yes', 'id': '546'}]
The type of the column (value) is of type pandas.core.series.Series.
I am trying to extract text corresponding to value in the above dataframe.
Expected output:
id, output
101,Yes
See if his works for you
a=df['value'].str[0].apply(pd.Series)
df['value']=a['value']
print(df)
Output
id value
0 101 Yes
import pandas as pd
import numpy as np
cols = ['id', 'value']
data = [
[101, [{'self': 'https://www.web.com/rest/api/101', 'value': 'Yes', 'id': '546'}]]
]
df = pd.DataFrame(data=data, columns=cols)
df.value = df.apply(lambda x: x['value'][0]['value'], axis=1)
print(df)
Result
id value
0 101 Yes

Getting standard deviation on a specific number of dates

In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}