Reindex pandas DataFrame to match index with another DataFrame - pandas

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?

One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Related

Handling queries in pandas when a CSV input contains multiple duplicate columns?

I have a fairly simple CSV that looks like this:
When I use pandas to read the CSV, columns that have the same name automatically gets renamed with a ".n" notation, as follows:
>>> import pandas as pd
>>> food = pd.read_csv("food.csv")
>>> food
Order Number Item Description Item Cost Item Description.1 Item Cost.1 Item Description.2 Item Cost.2
0 110 Chow Mein 5.00 NaN NaN NaN NaN
1 111 Cake 1.50 Chocolate 13.10 Noodle 3.75
2 112 Chocolate 11.00 Chips 5.75 NaN NaN
3 113 Sandwich 6.25 Milk 2.00 Ice 0.50
4 114 Chocolate 13.10 Water 0.25 NaN NaN
5 115 Tea 1.00 Milkshake 2.80 Chocolate 13.10
6 116 Green Tea 1.25 NaN NaN NaN NaN
7 117 Burger 2.00 Fries 3.50 NaN NaN
8 118 Chocolate 5.00 Green Tea 1.50 NaN NaN
9 119 Tonic 3.00 Burger 3.75 Milk 2.00
10 120 Orange 1.50 Milkshake 4.20 NaN NaN
>>>
food.csv:
Order Number,Item Description,Item Cost,Item Description,Item Cost,Item Description,Item Cost
110,Chow Mein,5,,,,
111,Cake,1.5,Chocolate,13.1,Noodle,3.75
112,Chocolate,11,Chips,5.75,,
113,Sandwich,6.25,Milk,2,Ice,0.5
114,Chocolate,13.1,Water,0.25,,
115,Tea,1,Milkshake,2.8,Chocolate,13.1
116,Green Tea,1.25,,,,
117,Burger,2,Fries,3.5,,
118,Chocolate,5,Green Tea,1.5,,
119,Tonic,3,Burger,3.75,Milk,2
120,Orange,1.5,Milkshake,4.2,,
As such, queries that rely on the column names will only work if they match the first column (e.g.):
>>> print(food[(food['Item Description'] == "Chocolate") & (food['Item Cost'] == 13.10)]['Order Number'].to_string(index=False))
114
While I can technically lengthen the masks to include the .1 and .2 columns, this seems relatively inefficient, especially when the number of duplicated columns is large (in this example there are only 3 sets of duplicated columns, but in other datasets, I have a large number which would not work well if I just construct a mask for each column.)
I am not sure if I am approaching this the right way or if I am missing something simple (like in loading the CSV) or if there are some groupbys I can do that can answer the same question (i.e. Find the order numbers when the order contains an item that has chocolate listed that costs $13.10).
Would the problem be different if it's something like: average all the costs of chocolates paid for all the orders?
Thanks in advance.
Here's a bit of a simpler approach with pandas' wide_to_long function
(i will use the df provided by #mitoRibo in another answer)
documentation: https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
df.rename(columns={'Item Description': 'Item Description.0', 'Item Cost': 'Item Cost.0'}, inplace=True)
long = pd.wide_to_long(df, stubnames=['Item Description', 'Item Cost'], i="Order Number", j="num_after_col_name", sep='.')
It's often easier to operate on a table in "long" form instead of "wide" form that you currently have.
There's example code below to convert from an example wide_df:
To a long df version:
In the long_df version each row is a unique Order/Item and now we don't have to store any null values. Pandas also makes it easy to perform grouping operations on tables in long form. Here's what the agg table looks like from the code below
You can also easily make your query of finding orders where a chocolate cost $13.10 by long_df[long_df['Description'].eq('Chocolate') & long_df['Cost'].eq(13.10)]['Order Number'].unique()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
# Convert table to long form
desc_cols = [c for c in df.columns if 'Desc' in c]
cost_cols = [c for c in df.columns if 'Cost' in c]
desc_df = df.melt(id_vars='Order Number', value_vars=desc_cols, value_name='Description')
cost_df = df.melt(id_vars='Order Number', value_vars=cost_cols, value_name='Cost')
long_df = pd.concat((desc_df[['Order Number','Description']], cost_df[['Cost']]), axis=1).dropna()
long_df.insert(1,'Item Number',long_df.groupby('Order Number').cumcount().add(1))
long_df = long_df.sort_values(['Order Number','Item Number'])
# Calculate group info
group_info = long_df.groupby('Order Number').agg(
ordered_chocolate = ('Description', lambda d: d.eq('Chocolate').any()),
total_cost = ('Cost','sum'),
)

Combine several columns into an array, find correlations, and save to another column

Assuming I have the following toy dataframe:
Firm num1 num2 num3
A 0.1 0.2 0.3
B 0.4 1.5 9.7
C 2.1 3.7 1.5
D 6.2 2.3 5.5
I want to combine columns num1, num2, num3 into arrays (lists), so that A = [0.1, 0.2, 0.3] and create a new column that is this new array. My method so far involves converting them to strings first.
Firm arrone
A [0.1, 0.2, 0.3]
B [0.4, 1.5,9.7 ]
C [2.1, 3.7, 1.5]
D [6.2, 2.3, 5.5]
Next, I want to create combinations using itertools. This has been achieved here
out = df.merge(df,how='cross',suffixes = ('_1', '_2')).query('base1_1<base1_2')
Firm_1 Firm_2 arrone arrtwo
A B [0.1, 0.2, 0.3] [0.4, 1.5, 9.7]
A C [0.1, 0.2, 0.3] [2.1, 3.7, 1.5]
A D [0.1, 0.2, 0.3] [6.2, 2.3, 5.5]
B C [0.4, 1.5,9.7 ] [2.1, 3.7, 1.5]
B D [0.4, 1.5,9.7 ] [6.2, 2.3, 5.5]
C D [2.1, 3.7, 1.5] [6.2, 2.3, 5.5]
Finally, I want to create df['corrcoef'] = correlation between arrone and arrtwo. This link has been helpful. However, the manually inputted toy model has the column in list form.
My working dataset is loaded from a csv file. For that, I get the following error when I try the above:
ValueError: x and y must have the same length.
I wish to have a final dataset that looks like this:
Firm_1 Firm_2 arrone arrtwo corrcoef
A B [0.1, 0.2, 0.3] [0.4, 1.5, 9.7] 0.8
A C [0.1, 0.2, 0.3] [2.1, 3.7, 1.5] 0.3
A D [0.1, 0.2, 0.3] [6.2, 2.3, 5.5] 0.2
B C [0.4, 1.5,9.7 ] [2.1, 3.7, 1.5] 0.5
B D [0.4, 1.5,9.7 ] [6.2, 2.3, 5.5] 0.7
C D [2.1, 3.7, 1.5] [6.2, 2.3, 5.5] 0.9
If you want to convert your initial DataFrame of columns into one column with lists, you could use:
df[['Firm']].join(pd.Series(df.filter(like='num').to_numpy().tolist(),
index=df.index, name='arrone'))
output:
Firm arrone
0 A [0.1, 0.2, 0.3]
1 B [0.4, 1.5, 9.7]
2 C [2.1, 3.7, 1.5]
3 D [6.2, 2.3, 5.5]

First 'Group by' then plot/save as png from pandas

first I need to filter data then plot each group separately and save files to directory
for id in df["set"].unique():
df2= df.loc[df["set"] == id]
outpath = "path/of/your/folder/"
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=df2, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig(path.join(outpath,"id.png",dpi=300 )
This worked for me but it is very slow
groups = df.groupby("set")
for name, group in groups:
sns.set_style("whitegrid", {'grid.linestyle': '-'})
plt.figure(figsize=(12,8))
ax1=sns.scatterplot(data=group, x="x", y="y", hue="result",markers=['x'],s=1000)
ax1.get_legend().remove()
ax1.set_yticks((0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5), minor=False)
ax1.set_xticks([0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.6], minor=False)
fig = ax1.get_figure()
fig.savefig("directory/{0}.png".format(name), dpi=300)

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Why does pandas change the index value in this example?

First we create a raw dataset with MultiIndex-
In [166]: import numpy as np; import pandas as pd
In [167]: data_raw = pd.DataFrame([
...: {'frame': 1, 'face': np.NaN, 'lmark': np.NaN, 'x': np.NaN, 'y': np.NaN},
...: {'frame': 197, 'face': 0, 'lmark': 1, 'x': 969, 'y': 737},
...: {'frame': 197, 'face': 0, 'lmark': 2, 'x': 969, 'y': 740},
...: {'frame': 197, 'face': 0, 'lmark': 3, 'x': 970, 'y': 744},
...: {'frame': 197, 'face': 0, 'lmark': 4, 'x': 972, 'y': 748},
...: {'frame': 197, 'face': 0, 'lmark': 5, 'x': 973, 'y': 752},
...: {'frame': 300, 'face': 0, 'lmark': 1, 'x': 745, 'y': 367},
...: {'frame': 300, 'face': 0, 'lmark': 2, 'x': 753, 'y': 411},
...: {'frame': 300, 'face': 0, 'lmark': 3, 'x': 759, 'y': 455},
...: {'frame': 301, 'face': 0, 'lmark': 1, 'x': 741, 'y': 364},
...: {'frame': 301, 'face': 0, 'lmark': 2, 'x': 746, 'y': 408},
...: {'frame': 301, 'face': 0, 'lmark': 3, 'x': 750, 'y': 452}]).set_index(['frame', 'face', 'lmark'])
Next we calculate the z-scores for each lmark -
In [168]: ((data_raw - data_raw.mean(level='lmark')).abs()) / data_raw.std(level='lmark')
Out[168]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
4.0 NaN NaN
5.0 NaN NaN
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
The index values don't change, as expected.
Now we filter out records where lmark > 3 -
In [170]: data_filtered = data_raw.loc[(slice(None), slice(None), [np.NaN, slice(3)]),:]
In [171]: data_filtered
Out[171]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 969.0 737.0
2.0 969.0 740.0
3.0 970.0 744.0
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
301 0.0 1.0 741.0 364.0
2.0 746.0 408.0
3.0 750.0 452.0
and recalculate the z-scores -
In [172]: ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
Out[172]:
x y
frame face lmark
1 NaN 1.0 NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
Why has the value of the first record's lmark index changed from NaN to 1.0?
I think it seems bug.
Solution is use MultiIndex.remove_unused_levels:
data_filtered.index = data_filtered.index.remove_unused_levels()
a = ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
print (a)
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270