Pivoting and transposing using pandas dataframe - pandas

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |

The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN

You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)

If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

Related

insert column to df on sequenced location

i have a df like this:
id
month
1
1
1
3
1
4
1
6
i want to transform it become like this:
id
1
2
3
4
5
6
1
1
0
1
1
0
1
ive tried using this code:
ndf = df[['id']].join(pd.get_dummies(
df['month'])).groupby('id').max()
but it shows like this:
id
1
3
4
6
1
1
1
1
1
how can i insert the middle column (2 and 5) even if it's not in the data?
You can use pd.crosstab
instead, then create new columns using pd.RangeIndex based on the min and max month, and finally use DataFrame.reindex (and optionally DataFrame.reset_index afterwards):
import pandas as pd
new_cols = pd.RangeIndex(df['month'].min(), df['month'].max())
res = (
pd.crosstab(df['id'], df['month'])
.reindex(columns=new_cols, fill_value=0)
.reset_index()
)
Output:
>>> res
id 1 2 3 4 5
0 1 1 0 1 1 0

Leave the first TWO dates for each id

I have a dataframe of id number and dates:
import pandas as pd
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
df
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 1 04/05/2003
3 2 01/05/2010
4 2 08/08/2009
5 3 12/11/2008
I am looking for a way to leave for each id the first TWO dates (i.e. the two earliest dates).
for the example above the output would be:
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 2 08/08/2009
3 2 01/05/2010
4 3 12/11/2008
Thanks!
ensure timestamp
df['start_date']=pd.to_datetime(df['start_date'])
sort values
df=df.sort_values(by=['id','start_date'])
group and select first 2 only
df_=df.groupby('id')['id','start_date'].head(2)
Just group by id and then you can call head. Be sure to sort your values first.
df = df.sort_values(['id', 'start_date'])
df.groupby('id').head(2)
full code:
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
# 1. convert 'start_time' column to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
# 2. sort the dataframe ascending by 'start_time'
df.sort_values(by='start_date', ascending=True, inplace=True)
# 3. select only the first two occurances of each id
df.groupby('id').head(2)
output:
id start_date
0 1 2000-01-01
1 1 2002-01-07
5 3 2008-12-11
4 2 2009-08-08
3 2 2010-01-05

Using .loc and shift() to add one to a serialnumber

I'm trying to add two dataframes using concat with axis = 0, so the columns stay the same but the index increases. One of the dataframes contains a specific columns with a serial number (going from one upwards - but not necessarily in sequence eg. 1,2,3,4,5, etc.)
import pandas as pd
import numpy as np
a = pd.DataFrame(data = {'Name': ['A', 'B','C'],
'Serial Number': [1, 2,5]} )
b = pd.DataFrame(data = {'Name': ['D','E','F'],
'Serial Number': [np.nan,np.nan,np.nan]})
c = pd.concat([a,b],axis=0).reset_index()
I would like to have column 'Serial Number' in dataframe C to start from 5+1 the next one 6+1.
I've tried a variety of things eg:
c.loc[c['B'].isna(), 'B'] = c['B'].shift(1)+1
But it doesn't seem to work.
Desired output:
| Name | Serial Number|
-------------------------
1 A | 1
2 B | 2
3 C | 5
4 D | 6
5 E | 7
6 F | 8
One idea is create arange by number od missinng values add maximal value and 1:
a = np.arange(c['Serial Number'].isna().sum()) + c['Serial Number'].max() + 1
c.loc[c['Serial Number'].isna(), 'Serial Number'] = a
print (c)
index Name Serial Number
0 0 A 1.0
1 1 B 2.0
2 2 C 5.0
3 0 D 6.0
4 1 E 7.0
5 2 F 8.0

Sort data in Pandas dataframe alphabetically

I have a dataframe where I need to sort the contents of one column (comma separated) alphabetically:
ID Data
1 Mo,Ab,ZZz
2 Ab,Ma,Bt
3 Xe,Aa
4 Xe,Re,Fi,Ab
Output:
ID Data
1 Ab,Mo,ZZz
2 Ab,Bt,Ma
3 Aa,Xe
4 Ab,Fi,Re,Xe
I have tried:
df.sort_values(by='Data')
But this does not work
You can split, sorting and then join back:
df['Data'] = df['Data'].apply(lambda x: ','.join(sorted(x.split(','))))
Or use list comprehension alternative:
df['Data'] = [','.join(sorted(x.split(','))) for x in df['Data']]
print (df)
ID Data
0 1 Ab,Mo,ZZz
1 2 Ab,Bt,Ma
2 3 Aa,Xe
3 4 Ab,Fi,Re,Xe
IIUC get_dummies
s=df.Data.str.get_dummies(',')
df['n']=s.dot(s.columns+',').str[:-1]
df
Out[216]:
ID Data n
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
IIUC you can use a list comprehension:
[','.join(sorted(i.split(','))) for i in df['Data']]
#['Ab,Mo,ZZz', 'Ab,Bt,Ma', 'Aa,Xe', 'Ab,Fi,Re,Xe']
using explode and sort_values
df["Sorted_Data"] = (
df["Data"].str.split(",").explode().sort_values().groupby(level=0).agg(','.join)
)
print(df)
ID Data Sorted_Data
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
Using row iteration:
for index, row in df.iterrows():
row['Data'] = ','.join(sorted(row['Data'].split(',')))
In [29]: df
Out[29]:
Data
0 Ab,Mo,ZZz
1 Ab,Bt,Ma
2 Aa,Xe
3 Ab,Fi,Re,Xe

Dataframe merge by row

I have two pd df and I want to merge df2 to each row of df1 based on the ID in df1. The final df should look like in df3.
How do I do it? I tried merge, join and concat and didn't get want I wanted.
df1
ID Division
1 10
2 2
3 4
... ...
df2
Product type Level
1 0
1 1
1 2
2 0
2 1
2 2
2 3
df3
ID Product type Level Division
1 1 0 10
1 1 1 10
1 1 2 10
1 2 0 10
1 2 1 10
1 2 2 10
1 2 3 10
and repeat for ID 2 and ......
Looks like you are looking for a Cartesian product of two dataframes. The following approach should achieve what you want,
(df1.assign(key=1)
.merge(df2.assign(key=1))
.drop('key', axis=1))
Consider such an option:
set index in both DataFrames to 0,
perform an outer join (on indices, so the result is just the Cartesian
product),
reset index.
The code to do it is:
df1.index = [0] * df1.index.size
df2.index = [0] * df2.index.size
result = df1.join(df2, how='outer').reset_index(drop=True)