Convert tabular pandas DataFrame into nested pandas DataFrame - pandas

Supposing that i have a simple pd.DataFrame like so:
d = {'col1': [1, 20], 'col2': [3, 40], 'col3': [5, 50]}
df = pd.DataFrame(data=d)
df
col1 col2 col4
0 1 3 5
1 20 40 60
is there a way to convert this to nasted pandas Dataframe (df_new) , so as when i call df_new.values[0] taking as ouptut:
array(
[0 1
1 3
2 5
Length: 3, dtype: int], dtype=object)

I still don't think I understand the exact requirement, but here is something:
One way of getting the desired output is this:
>>> pd.Series(df.T[0].values)
0 1
1 3
2 5
dtype: int64
If you want to have these as 2d arrays:
>>> np.array(pd.DataFrame(df.T[0].values).reset_index())
array([[0, 1],
[1, 3],
[2, 5]])
>>> np.array(pd.DataFrame(df.T[1].values).reset_index())
array([[ 0, 20],
[ 1, 40],
[ 2, 50]])

Related

How to group dataframe rows on unique elements in a specific column?

As an example, how do I convert df to df1, by gathering rows into matrices based on shared values in a specific column tidx?
>>> df = pd.DataFrame({'col3':[[1,40],[2,50],[3,60],[4,70]], 'tidx':[21,22,23,21]})
>>> df['col3'] = df['col3'].apply(np.array)
>>> df
col3 tidx
0 [1, 40] 21
1 [2, 50] 22
2 [3, 60] 23
3 [4, 70] 21
>>> df1 = pd.DataFrame({'col3':[[[1,40],[4,70]],[[2,50]],[[3,60]]], 'tidx':[21,22,23]})
>>> df1['col3'] = df1['col3'].apply(np.array)
>>> df1
col3 tidx
0 [[1, 40], [4, 70]] 21
1 [[2, 50]] 22
2 [[3, 60]] 23
You can use .groupby and then apply list function as shown in example below.
df = pd.DataFrame({'col3':[[1,40],[2,50],[3,60],[4,70]], 'tidx':[21,22,23,21]})
df1 = df.groupby('tidx')['col3'].apply(list).reset_index()

pandas: most elegant way to pivot table on pattern in name of columns

Given the following DataFrame:
pd.DataFrame({
'x': [0, 1],
'y': [0, 1],
'a_idx': [0, 1],
'a_val': [2, 3],
'b_idx': [4, 5],
'b_val': [6, 7],
})
What is the cleanest way to pivot the DataFrame based on the prefix of the idx and val columns if you have an indeterminate amount of unique prefixes (a, b, ... n), so as to obtain the following DataFrame?
pd.DataFrame({
'x': [0, 1, 0, 1],
'y': [0, 1, 0, 1],
'key': ['a','a','b','b'],
'idx': [0, 1, 4, 5],
'val': [2, 3, 6, 7]
})
I am not very knowledgeable in pandas, so my easiest solution was to go earlier in the data generation process and generate a subset of the result DataFrame for each prefix in SQL, and then concat the result sets into a final DataFrame. I'm curious however if there is a simple way to do this using the API of pandas.DataFrame. Is there such a thing?
Let's try wide_to_long with extras:
(pd.wide_to_long(df,stubnames=['a','b'],
i=['x','y'],
j='key',
sep='_',
suffix='\\w+'
)
.unstack('key').stack(level=0).reset_index()
)
Or manually with melt:
out = df.melt(['x', 'y'])
out = (out.join(out['variable'].str.split('_', expand=True))
.rename(columns={0: 'key'})
.pivot_table(index=['x', 'y', 'key'], columns=[1], values='value')
.reset_index()
)
Output:
key x y level_2 idx val
0 0 0 a 0 2
1 0 0 b 4 6
2 1 1 a 1 3
3 1 1 b 5 7

pandas.reorder_levels with only one index

Pandas offers a feature to reorder index with the reorder_index function :
pandas.DataFrame({"A" : [1, 2, 3], "B" : [4,5,6], "C" : [7,8,9]}).set_index(["A", "B"]).reorder_levels(["B", "A"])
However it doesn't seem to work with single-indexed DataFrames :
pandas.DataFrame({"A" : [1, 2, 3], "B" : [4,5,6]}).set_index("A").reorder_levels(["A"])
Am I doing something incorrectly ?
PS : I know it doesn't make sens to reorder the Index with only one index, however it's a border effect and I usually tend to avoid un-necessary if statements for code clarity.
Doing the following is equivalent to reordering:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}).set_index(["A", "B"])
print(df.reset_index().set_index(['B', 'A']))
Output
C
B A
4 1 7
5 2 8
6 3 9
And it works with a single index:
odf = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}).set_index("A")
print(odf)
Output
B
A
1 4
2 5
3 6

Sort a dictionary in a column in pandas

I have a dataframe as shown below.
user_id Recommended_modules Remaining_modules
1 {A:[5,11], B:[4]} {A:2, B:1}
2 {A:[8,4,2], B:[5], C:[6,8]} {A:7, B:1, C:2}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[8,4,2], B:[5,1,2], C:[6]} {A:3, B:4, C:1}
Brief about the dataframe:
In the column Recommended_modules A, B and C are courses and the numbers inside the list are modules.
Key(Remaining_modules) = Course name
value(Remaining_modules) = Number of modules remaining in that course
From the above I would like to reorder the recommended_modules column based on the values in the Remaining_modules as shown below.
Expected Output:
user_id Ordered_Recommended_modules Ordered_Remaining_modules
1 {B:[4], A:[5,11]} {B:1, A:2}
2 {B:[5], C:[6,8], A:[8,4,2]} {B:1, C:2, A:7}
3 {B:[8], A:[2,3,9]} {B:1, A:5}
4 {C:[6], A:[8,4,2], B:[5,1,2]} {C:1, A:3, B:4}
Explanation:
For user_id = 2, Remaining_modules = {A:7, B:1, C:2}, sort like this {B:1, C:2, A:7}
similarly arrange Recommended_modules also in the same order as shown below
{B:[5], C:[6,8], A:[8,4,2]}.
It is possible, only need python 3.6+:
def f(x):
#https://stackoverflow.com/a/613218/2901002
d1 = {k: v for k, v in sorted(x['Remaining_modules'].items(), key=lambda item: item[1])}
L = d1.keys()
#https://stackoverflow.com/a/21773891/2901002
d2 = {key:x['Recommended_modules'][key] for key in L if key in x['Recommended_modules']}
x['Remaining_modules'] = d1
x['Recommended_modules'] = d2
return x
df = df.apply(f, axis=1)
print (df)
user_id Recommended_modules \
0 1 {'B': [4], 'A': [5, 11]}
1 2 {'B': [5], 'C': [6, 8], 'A': [8, 4, 2]}
2 3 {'B': [8], 'A': [2, 3, 9]}
3 4 {'C': [6], 'A': [8, 4, 2], 'B': [5, 1, 2]}
Remaining_modules
0 {'B': 1, 'A': 2}
1 {'B': 1, 'C': 2, 'A': 7}
2 {'B': 1, 'A': 5}
3 {'C': 1, 'A': 3, 'B': 4}

Pandas: Replace every value in every column matching a pattern with a value from another column in that row

I have a dataframe with 1000 columns. I want to replace every -9 value in every column with that row's df['a'] value.
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
What I want is
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
I have tried
df.replace(-9, df['a'], inplace = True)
And
df.replace(-9, np.nan, inplace = True)
df.fillna(df.a, inplace = True)
But they don't change the df.
My solution right now is to use a for loop:
df.replace(-9, np.nan, inplace = True)
col_list = list(df)
for i in col_list:
df[i].fillna(df['a'], inplace = True)
This solution works, but it also replaces any np.nan values. Any ideas as to how I can replace just the -9 values without first converting it into np.nan? Thanks.
I think need mask:
df = df.mask(df == -9, df['a'], axis=0)
print (df)
a b c
0 1 6.0 1
1 2 2.0 19
2 3 8.0 3
3 4 NaN 4
4 5 5.0 5
Or:
df = pd.DataFrame(np.where(df == -9, df['a'].values[:, None], df), columns=df.columns)
print (df)
a b c
0 1.0 6.0 1.0
1 2.0 2.0 19.0
2 3.0 8.0 3.0
3 4.0 NaN 4.0
4 5.0 5.0 5.0
you can also do something like this
import numpy as np
import pandas as pd
df_tar = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
df.loc[df['b']==-9,'b']=df.loc[df['b']==-9,'a']
df.loc[df['c']==-9,'c']=df.loc[df['c']==-9,'a']