pandas convert lists in multiple columns within DataFrame to separate columns - pandas

I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)

If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6

Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6

Related

pandas sort_values with condition

I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5

Insert a level o in the existing data frame such that 4 columns are grouped as one

I want to do multiindexing for my data frame such that MAE,MSE,RMSE,MPE are grouped together and given a new index level. Similarly the rest of the four should be grouped together in the same level but different name
> mux3 = pd.MultiIndex.from_product([list('ABCD'),list('1234')],
> names=['one','two'])###dummy data
> df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) #### dummy data frame
> print(df3) #intended output required for the data frame in the picture given below
Assuming column groups are already in the appropriate order we can simply create an np.arange over the length of the columns and floor divide by 4 to get groups and create a simple MultiIndex.from_arrays.
Sample Input and Output:
import numpy as np
import pandas as pd
initial_index = [1, 2, 3, 4] * 3
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, len(initial_index))), columns=initial_index
)
1 2 3 4 1 2 3 4 1 2 3 4 # Column headers are in repeating order
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
# Create New Columns
df3.columns = pd.MultiIndex.from_arrays([
np.arange(len(df3.columns)) // 4, # Group Each set of 4 columns together
df3.columns # Keep level 1 the same as current columns
], names=['one', 'two']) # Set Names (optional)
df3
one 0 1 2
two 1 2 3 4 1 2 3 4 1 2 3 4
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
If columns are in mixed order:
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, 8)), columns=[1, 1, 3, 2, 4, 3, 2, 4]
)
df3
1 1 3 2 4 3 2 4 # Cannot select groups positionally
0 3 6 6 0 9 8 4 7
1 0 0 7 1 5 7 0 1
2 4 6 2 9 9 9 9 1
We can convert Index.to_series then enumerate columns using groupby cumcount then sort_index if needed to get in order:
df3.columns = pd.MultiIndex.from_arrays([
# Enumerate Groups to create new level 0 index
df3.columns.to_series().groupby(df3.columns).cumcount(),
df3.columns
], names=['one', 'two']) # Set Names (optional)
# Sort to Order Correctly
# (Do not sort before setting columns it will break alignment with data)
df3 = df3.sort_index(axis=1)
df3
one 0 1
two 1 2 3 4 1 2 3 4 # Notice Data has moved with headers
0 3 0 6 9 6 4 8 7
1 0 1 7 5 0 0 7 1
2 4 9 2 9 6 9 9 1

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

groupby list of lists of indexes

I have a list of np. arrays, representing indexes of pandas dataframe.
I need to groupby index to get each group for each array
let's say, that is the df:
index values
0 2
1 3
2 2
3 2
4 4
5 4
6 1
7 4
8 4
9 4
and that is the list of np.arrays:
[array([0, 1, 2, 3]), array([6, 7, 8])]
from this data I expect to get 2 groups without loop opertaions as a single groupby object:
group1:
index values
0 2
1 3
2 2
3 2
group2:
index values
6 1
7 4
8 4
I would stress again that finally I need to get a single groupby object.
Thank you!
I still using for-loop to create the groupby key dict
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
df=pd.DataFrame([2, 3, 2, 2, 4, 4, 1, 4, 4, 4],columns=['values'])
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
list(df.groupby(L))
Out[33]:
[(0.0, values
index
0 2
1 3
2 2
3 2), (1.0, values
index
6 1
7 4
8 4)]
df=pd.DataFrame([2,3,2,2,4,4,1,4,4,4],columns=['values'])
df.index.name ='index'
l=[np.array([0, 1, 2, 3]), np.array([6, 7, 8])]
group1= df.loc[pd.Series(l[0])]
group2= df.loc[pd.Series(l[1])]
This seems like an X-Y problem:
l = [np.array([0,1,2,3]), np.array([6,7,8])]
df_indx = pd.DataFrame(l).stack().reset_index()
df_new = df.assign(foo=df['index'].map(df_indx.set_index(0)['level_0']))
for n,g in df_new.groupby('foo'):
print(g)
Output:
index values foo
0 0 2 0.0
1 1 3 0.0
2 2 2 0.0
3 3 2 0.0
index values foo
6 6 1 1.0
7 7 4 1.0
8 8 4 1.0

Merging dataframes in pandas

I am new to pandas and I am facing the following problem:
I have 2 data frames:
df1 :
x y
1 3 4
2 nan
3 6
4 nan
5 9 2
6 1 4 9
df2:
x y
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
The size of the two is same.
I want to merge the two dataframes such that all the resulting dataframe i get is the following:
result :
x y
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 6 7
So in the result, priority is given to df2. If there is a value in df2, it is put first and the remaining values are put from df1 (they have the same position as in df1). There should be no repeated values in the result (i.e if a value is in position 1 in df1 and position 3 in df2, then that value should come only in position 1 in the result and not repeat)
Any kind of help will be appreciated.
Thanks!
IIUC
Setup
df1 = pd.DataFrame(dict(x=range(1, 7),
y=[[3, 4], None, [6], None, [9, 2], [1, 4, 9]]))
df2 = pd.DataFrame(dict(x=range(1, 7), y=[[2, 3, 6, 1, 5], [4, 1, 8, 7, 5],
[6, 3, 1, 4, 5], [2, 1, 3, 5, 4],
[9, 2, 3, 8, 7], [1, 4, 5, 3, 7]]))
print df1
print
print df2
x y
0 1 [3, 4]
1 2 None
2 3 [6]
3 4 None
4 5 [9, 2]
5 6 [1, 4, 9]
x y
0 1 [2, 3, 6, 1, 5]
1 2 [4, 1, 8, 7, 5]
2 3 [6, 3, 1, 4, 5]
3 4 [2, 1, 3, 5, 4]
4 5 [9, 2, 3, 8, 7]
5 6 [1, 4, 5, 3, 7]
convert to something more usable:
df1_ = df1.set_index('x').y.apply(pd.Series)
df2_ = df2.set_index('x').y.apply(pd.Series)
print df1_
print
print df2_
0 1 2
x
1 3.0 4.0 NaN
2 NaN NaN NaN
3 6.0 NaN NaN
4 NaN NaN NaN
5 9.0 2.0 NaN
6 1.0 4.0 9.0
0 1 2 3 4
x
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
Combine with priority given to df1 (I think you meant df1 as that what was consistent with my interpretation of your question and the expected output you provided) then reducing to eliminate duplicates:
print df1_.combine_first(df2_).apply(lambda x: x.unique(), axis=1)
0 1 2 3 4
x
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 9 3 7