Handling queries in pandas when a CSV input contains multiple duplicate columns? - pandas

I have a fairly simple CSV that looks like this:
When I use pandas to read the CSV, columns that have the same name automatically gets renamed with a ".n" notation, as follows:
>>> import pandas as pd
>>> food = pd.read_csv("food.csv")
>>> food
Order Number Item Description Item Cost Item Description.1 Item Cost.1 Item Description.2 Item Cost.2
0 110 Chow Mein 5.00 NaN NaN NaN NaN
1 111 Cake 1.50 Chocolate 13.10 Noodle 3.75
2 112 Chocolate 11.00 Chips 5.75 NaN NaN
3 113 Sandwich 6.25 Milk 2.00 Ice 0.50
4 114 Chocolate 13.10 Water 0.25 NaN NaN
5 115 Tea 1.00 Milkshake 2.80 Chocolate 13.10
6 116 Green Tea 1.25 NaN NaN NaN NaN
7 117 Burger 2.00 Fries 3.50 NaN NaN
8 118 Chocolate 5.00 Green Tea 1.50 NaN NaN
9 119 Tonic 3.00 Burger 3.75 Milk 2.00
10 120 Orange 1.50 Milkshake 4.20 NaN NaN
>>>
food.csv:
Order Number,Item Description,Item Cost,Item Description,Item Cost,Item Description,Item Cost
110,Chow Mein,5,,,,
111,Cake,1.5,Chocolate,13.1,Noodle,3.75
112,Chocolate,11,Chips,5.75,,
113,Sandwich,6.25,Milk,2,Ice,0.5
114,Chocolate,13.1,Water,0.25,,
115,Tea,1,Milkshake,2.8,Chocolate,13.1
116,Green Tea,1.25,,,,
117,Burger,2,Fries,3.5,,
118,Chocolate,5,Green Tea,1.5,,
119,Tonic,3,Burger,3.75,Milk,2
120,Orange,1.5,Milkshake,4.2,,
As such, queries that rely on the column names will only work if they match the first column (e.g.):
>>> print(food[(food['Item Description'] == "Chocolate") & (food['Item Cost'] == 13.10)]['Order Number'].to_string(index=False))
114
While I can technically lengthen the masks to include the .1 and .2 columns, this seems relatively inefficient, especially when the number of duplicated columns is large (in this example there are only 3 sets of duplicated columns, but in other datasets, I have a large number which would not work well if I just construct a mask for each column.)
I am not sure if I am approaching this the right way or if I am missing something simple (like in loading the CSV) or if there are some groupbys I can do that can answer the same question (i.e. Find the order numbers when the order contains an item that has chocolate listed that costs $13.10).
Would the problem be different if it's something like: average all the costs of chocolates paid for all the orders?
Thanks in advance.

Here's a bit of a simpler approach with pandas' wide_to_long function
(i will use the df provided by #mitoRibo in another answer)
documentation: https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
df.rename(columns={'Item Description': 'Item Description.0', 'Item Cost': 'Item Cost.0'}, inplace=True)
long = pd.wide_to_long(df, stubnames=['Item Description', 'Item Cost'], i="Order Number", j="num_after_col_name", sep='.')

It's often easier to operate on a table in "long" form instead of "wide" form that you currently have.
There's example code below to convert from an example wide_df:
To a long df version:
In the long_df version each row is a unique Order/Item and now we don't have to store any null values. Pandas also makes it easy to perform grouping operations on tables in long form. Here's what the agg table looks like from the code below
You can also easily make your query of finding orders where a chocolate cost $13.10 by long_df[long_df['Description'].eq('Chocolate') & long_df['Cost'].eq(13.10)]['Order Number'].unique()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Order Number': ['Order_01', 'Order_02', 'Order_03', 'Order_04', 'Order_05', 'Order_06', 'Order_07', 'Order_08', 'Order_09', 'Order_10'],
'Item Description': ['Burger', 'Cake', 'Cake', 'Tonic', 'Green Tea', 'Sandwich', 'Orange', 'Burger', 'Cake', 'Chow Mein'],
'Item Cost': [7, 10, 4, 1, 10, 7, 9, 9, 6, 3],
'Item Description.1': ['Tonic', 'Burger', 'Green Tea', 'Sandwich', 'Orange', None, 'Chocolate', None, 'Chocolate', 'Tea'],
'Item Cost.1': [4.0, 1.0, 7.0, 7.0, 8.0, np.nan, 6.0, np.nan, 8.0, 3.0],
'Item Description.2': [None, 'Chow Mein', 'Chow Mein', 'Chocolate', 'Tea', None, 'Burger', None, 'Tea', 'Green Tea'],
'Item Cost.2': [np.nan, 8.0, 1.0, 9.0, 9.0, np.nan, 2.0, np.nan, 1.0, 9.0],
'Item Description.3': [None, 'Sandwich', 'Orange', 'Cake', 'Tonic', None, None, None, 'Sandwich', 'Burger'],
'Item Cost.3': [np.nan, 5.0, 9.0, 2.0, 7.0, np.nan, np.nan, np.nan, 8.0, 4.0],
'Item Description.4': [None, 'Green Tea', 'Burger', 'Green Tea', 'Cake', None, None, None, None, 'Orange'],
'Item Cost.4': [np.nan, 4.0, 4.0, 3.0, 10.0, np.nan, np.nan, np.nan, np.nan, 1.0],
'Item Description.5': [None, None, 'Tea', 'Burger', 'Chocolate', None, None, None, None, 'Sandwich'],
'Item Cost.5': [np.nan, np.nan, 8.0, 5.0, 1.0, np.nan, np.nan, np.nan, np.nan, 4.0],
'Item Description.6': [None, None, 'Tonic', 'Tea', 'Burger', None, None, None, None, 'Chocolate'],
'Item Cost.6': [np.nan, np.nan, 8.0, 2.0, 8.0, np.nan, np.nan, np.nan, np.nan, 9.0],
})
# Convert table to long form
desc_cols = [c for c in df.columns if 'Desc' in c]
cost_cols = [c for c in df.columns if 'Cost' in c]
desc_df = df.melt(id_vars='Order Number', value_vars=desc_cols, value_name='Description')
cost_df = df.melt(id_vars='Order Number', value_vars=cost_cols, value_name='Cost')
long_df = pd.concat((desc_df[['Order Number','Description']], cost_df[['Cost']]), axis=1).dropna()
long_df.insert(1,'Item Number',long_df.groupby('Order Number').cumcount().add(1))
long_df = long_df.sort_values(['Order Number','Item Number'])
# Calculate group info
group_info = long_df.groupby('Order Number').agg(
ordered_chocolate = ('Description', lambda d: d.eq('Chocolate').any()),
total_cost = ('Cost','sum'),
)

Related

Reindex pandas DataFrame to match index with another DataFrame

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?
One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Inconsistent behavior of pandas.DataFrame.MultiIndex.union() that involve np.nan in keys

Python version 3.7 - 3.9
pandas 1.4.4
>>> import pandas as pd
>>> import numpy as np
>>> df1 = df1 = pd.DataFrame({'name':[np.nan, np.nan, 'John', 'Mary', 'David'],
... 'age' :[np.nan, np.nan, 20, 30, 40],
... 'state': ['NY', 'CA', 'CA', 'NY', 'IL']})
>>> df1
name age state
0 NaN NaN NY
1 NaN NaN CA
2 John 20.0 CA
3 Mary 30.0 NY
4 David 40.0 IL
>>> df2 = pd.DataFrame({'name':['Lee', 'David', 'Mary', np.nan],
... 'age':[np.nan, 40, 30, np.nan],
... 'city':['Boston', 'Chicago', 'New York', 'Seattle']})
>>> df2
name age city
0 Lee NaN Boston
1 David 40.0 Chicago
2 Mary 30.0 New York
3 NaN NaN Seattle
>>> df1_new = df1.set_index(['name', 'age'])
>>> df2_new = df2.set_index(['name', 'age'])
>>> df1_new.index
MultiIndex([( nan, nan),
( nan, nan),
( 'John', 20.0),
( 'Mary', 30.0),
('David', 40.0)],
names=['name', 'age'])
>>> df2_new.index
MultiIndex([( 'Lee', nan),
('David', 40.0),
( 'Mary', 30.0),
( nan, nan)],
names=['name', 'age'])
>>> df1_new.index.union(df2_new.index)
MultiIndex([('David', 40.0),
( 'John', 20.0),
( 'Mary', 30.0),
( nan, nan),
( nan, nan)],
names=['name', 'age'])
>>> df2_new.index.union(df1_new.index)
MultiIndex([('David', 40.0),
( 'John', 20.0),
( 'Lee', nan),
( 'Mary', 30.0),
( nan, nan),
( nan, nan)],
names=['name', 'age'])
So the output of df1_new.index.union(df2_new.index) has a missing key ( 'Lee', nan) that is in df2_new.index.union(df1_new.index). It is an unexpected behavior as mathematically a union operation shall be not dependent on the sequence of operands.
I notice this issue when I am trying to figure out how to ensure `pandas.merge(df1, df2, left_index=True, right_index=True, how='outer') will have all indices included.

Merge similar columns and add extracted values to dict

Given this input:
pd.DataFrame({'C1': [6, np.NaN, 16, np.NaN], 'C2': [17, np.NaN, 1, np.NaN],
'D1': [8, np.NaN, np.NaN, 6], 'D2': [15, np.NaN, np.NaN, 12]}, index=[1,1,2,2])
I'd like to combine columns beginning in the same letter (the Cs and Ds), as well as rows with same index (1 and 2), and extract the non-null values to the simplest representation without duplicates, which I think is something like:
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}
Using stack or groupby gets me part of the way there, but I feel like there is a more efficient way to do it.
You can rename columns by lambda function for first letters with aggregate lists after DataFrame.stack and then create nested dictionary in dict comprehension:
s = df.rename(columns=lambda x: x[0]).stack().groupby(level=[0,1]).agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}

Pandas unique doest not work on groupby object when applied on several columns

Lets' say I have a dataframe with 3 columns, one containing the groups, and I would to collect the collections of values in the 2 other columns for each group.
Normally I would use the pandas.groupby function and apply the unique method. Well this does not work if unique is applied on more than 1 column...
df = pd.DataFrame({
'group': [1, 1, 2, 3, 3, 3, 4],
'param1': [1, 5, 8, np.nan, 2, 3, np.nan],
'param2': [5,6,9,10,11,12,1]
})
Apply unique on 1 column:
df.groupby('group')['param1'].unique()
group
1 [1.0, 5.0]
2 [8.0]
3 [nan, 2.0, 3.0]
4 [nan]
Name: param1, dtype: object
Apply unique on 2 columns:
df.groupby('group')[['param1', 'param2']].unique()
I get an AttributeError:
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
Instead I would expect this dataframe:
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10,11,12]
4 [nan] [1]
Reason of error is unique working only for Series, so is only implemented SeriesGroupBy.unique.
For me working Series.unique with convert to list:
df = df.groupby('group')[['param1', 'param2']].agg(lambda x: list(x.unique()))
print (df)
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10, 11, 12]
4 [nan] [1]
df = df.groupby('group').agg({'param1': 'unique',
'param2': 'unique'})
print(df)
param1 param2
group
1 [1.0, 5.0] [5, 6]
2 [8.0] [9]
3 [nan, 2.0, 3.0] [10, 11, 12]
4 [nan] [1]
If you have many groups, and you want the same behavior (i.e unique) then we can use .stack before the groupby so you don't need to call each column manually.
df.set_index('group').stack(dropna=False).groupby(level=[0,1]).unique().unstack()
param1 param2
group
1 [1.0, 5.0] [5.0, 6.0]
2 [8.0] [9.0]
3 [nan, 2.0, 3.0] [10.0, 11.0, 12.0]
4 [nan] [1.0]

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.