Transform pandas dataframe rename columns based on row values

Transform pandas dataframe rename columns based on row values - pandas

I have the following data frame df_in
data = [{'s':123, 'x': 5, 'a': 1, 'b': 2, 'c': 3},
{'s':123, 'x': 22, 'a': 4, 'b': 5, 'c': 6},
{'s':123, 'x': 33, 'a': 7, 'b': 8, 'c': 9},
{'s':124, 'x': 5, 'a': 11, 'b': 12, 'c': 3},
{'s':124, 'x': 22, 'a': 14, 'b': 15, 'c': 16},
{'s':124, 'x': 33, 'a': 17, 'b': 18, 'c': 19}]
df = pd.DataFrame(data, columns=['s', 'x', 'a', 'b', 'c'])
and would like to produce df_out where _x needs to be appended to the column names. Later I will index df_out on s and then do df_out.to_dict('index') to produce the desired output I need. I have tried transposing the df_in and then renaming rows with lambda function based on x but having trouble getting the desired df_out. Any help would be great.
Thanks

converting x to str to aid in joining
pivot followed by merging of the column labels
df2=df.pivot(index='s', columns='x').reset_index()
df2.columns= [str(col[0]+'_'+ str(col[1])).strip('_') for col in df2.columns]
df2
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19

You can use a pivot:
(df
.pivot('s', 'x')
.pipe(lambda d: d.set_axis(d. columns.map(lambda x: '_'.join(map(str, x))), axis=1))
.reset_index()
)
Output:
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19

Here's another option:
df_out = df.melt(['s','x']).set_index(['s','x', 'variable']).unstack([2,1])['value']
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
print(df_out.reset_index())
Output:
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19

One option is with pivot_wider from pyjanitor :
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_wider(index = 's', names_from = 'x')
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19

Related

Retrieving values from different columns in Pandas based on a column condition [duplicate]

The operation pandas.DataFrame.lookup is "Deprecated since version 1.2.0", and has since invalidated a lot of previous answers.
This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.
Standard LookUp Values With Default Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 B 4 8
I would like to be able to lookup the corresponding value in the column specified in Col:
I would like my result to look like:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Standard LookUp Values With a Non-Default Index
Non-Contiguous Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
Col A B
0 B 1 5
2 A 2 6
8 A 3 7
9 B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
MultiIndex
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
Col A B
C E B 1 5
F A 2 6
D E A 3 7
F B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
LookUp with Default For Unmatched/Not-Found Values
Given the following DataFrame
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 C 4 8 # Column C does not correspond with any column
I would like to look up the corresponding values if one exists otherwise I'd like to have it default to 0
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0 # Default value 0 since C does not correspond
LookUp with Missing Values in the lookup Col
Given the following DataFrame:
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 NaN 4 8 # <- Missing Lookup Key
I would like any NaN values in Col to result in a NaN value in Val
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # NaN to indicate missing

Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
factorize is used to convert the column encode the values as an "enumerated type".
idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:
df.reindex(columns=col)
B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.
The standard approach is to use np.arange based on the length of the DataFrame:
np.arange(len(df))
[0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
[5 2 3 8]
*Note: This approach will always work regardless of type of index.
MultiIndex
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
Why use np.arange and not df.index directly?
Standard Contiguous Range Index
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange is the same as the df.index.
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.
First let's look at what happens by default if there is a non-corresponding value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.
For this reason, when we reindex the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0
*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.
LookUp with Missing Values in the lookup Col
factorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key
idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A
This -1 means that, by default, we'll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.
The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.
Here I use the empty string '':
idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected

Other Approaches to LookUp
There are 2 other approaches to performing this operation:
apply (Intuitive, but quite slow)
apply can be used on axis=1 in order to use the Column values as the key:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This operation will work regardless of index type:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
# Col A B
# 0 B 1 5
# 2 A 2 6
# 8 A 3 7
# 9 B 4 8
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
When dealing with Missing/Non-Corresponding Values we can use Series.get can be used to remedy this issue:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'C', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 C 3 7 <- Non Corresponding
# 3 NaN 4 8 <- Missing
df['Val'] = df.apply(lambda row: row.get(row['Col']), axis=1)
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN # Missing value
3 NaN 4 8 NaN # Missing value
With Default Value
df['Val'] = df.apply(lambda row: row.get(row['Col'], default=-1), axis=1)
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 C 3 7 -1 # Default -1
3 NaN 4 8 -1 # Default -1
apply is extremely flexible and modifications are straightforward, however, the general iterative approach, as well as all the individual Series lookups can become extremely costly in large DataFrames.
get_indexer (limited)
Index.get_indexer can be used to convert the column to index values into an indexer for the DataFrame. This means there is no reason to reindex the DataFrame as the indexer corresponds to the DataFrame as a whole.
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This approach is reasonably fast, however, missing values are represented by -1 meaning that if a value is missing it will grab the value from the -1 column (The last column in the DataFrame).
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'Col': ['B', 'A', 'A', 'C']})
# A B Col <- Col is now the Last Col
# 0 1 5 B
# 1 2 6 A
# 2 3 7 A
# 3 4 8 C <- Notice Col `C` does not correspond to a Valid Column Header
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df:
A B Col Val
0 1 5 B 5
1 2 6 A 2
2 3 7 A 3
3 4 8 C C # <- Value from the last column in the DataFrame (index -1)
It is also notable that not reindexing the DataFrame means converting the entire DataFrame to numpy. This can be very costly if there are many unrelated columns that all need converted:
import numpy as np
import pandas as pd
df = pd.DataFrame({1: 10,
2: 20,
3: 't',
4: 40,
5: np.nan,
'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df.to_numpy()
[[10 20 't' 40 nan 'B' 1 5 5]
[10 20 't' 40 nan 'A' 2 6 2]
[10 20 't' 40 nan 'A' 3 7 3]
[10 20 't' 40 nan 'B' 4 8 8]]
Compared to the reindexing approach which only contains columns relevant to the column values:
df.reindex(columns=['B', 'A']).to_numpy()
[[5 1]
[6 2]
[7 3]
[8 4]]

Another option is to build a tuple of the lookup columns, pivot the dataframe, and select the relevant columns with the tuples:
cols = [(ent, ent) for ent in df.Col.unique()]
df.assign(Val = df.pivot(index = None, columns = 'Col')
.reindex(columns = cols)
.ffill(axis=1)
.iloc[:, -1])
Col A B Val
0 B 1 5 5.0
2 A 2 6 2.0
8 A 3 7 3.0
9 B 4 8 8.0

Another possible method is to use melt:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output:
Col A B value
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This method also works with Missing/Non-Corresponding Values:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output
Col A B value
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN
3 NaN 4 8 NaN
You can replace .loc[...] by query(...) but it's little slower although more expressive:
df['value'] = df.melt('Col', ignore_index=False).query('Col == variable')['value']

pandas: join dataframe on condition in next row

I have a pandas data frame df1 that follows a chronological order where "id" is unique:
df1 = pd.DataFrame([[0, 17], [1, 5], [2, 11], [3, 15], [4, 10]], columns = ['seq', 'id'])
seq id
0 17
1 5
2 11
3 15
4 10
I need to join data from another data frame df2 where "id" can appear many times in the "id_1" column, but where the id1-id2 combination is unique:
df2 = pd.DataFrame([[17, 7, 'a'], [17, 5, 'b'], [17, 8, 'c'], [5, 4, 'd'], [5, 11, 'e'], [11, 9, 'f'], [11, 15, 'g'], [15, 21, 'h'], [15, 10, 'i']], columns = ['id_1', 'id_2', 'x1'])
id_1 id_2 x1
17 7 a
17 5 b
17 8 c
5 4 d
5 11 e
11 9 f
11 15 g
15 21 h
15 10 i
The result joined to df1 should be based on the next row. For example, for the first row where seq=0, the data from df2 must be joined only if id_1 = id & id_2 = id on the next row at seq=1. The final result, should be like so:
seq id x1
0 17 b
1 5 e
2 11 g
3 15 i
4 10
Any idea how to achieve that?

Multiplying Dataframe by Column Value

I'm currently trying to multiply a dataframe of local currency values and converting it to its relevant Canadian value by multiplying its relevant FX rate.
However, I keep getting this error:
ValueError: operands could not be broadcast together with shapes (12252,) (1021,)
This is the code I'm working with right now. It works when I have a handful rows of data, but keeps getting the ValueError once I use it on the full file (1022 rows of data incl. headers).
import pandas as pd
Local_File = ('RawData.xlsx')
df = pd.read_excel(Local_File, sheet_name = 'Local')
df2 = df.iloc[:,[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].multiply(df['FX Spot Rate'],axis='index')
print (df2)
My dataframe looks something like this with 1022 rows of data (incl. header)
Appreciate any help! Thank you!

df = pd.DataFrame({'A': [1, 2, 3, 3, 1],
'B': [1, 2, 3, 3, 1],
'C': [9, 7, 4, 3, 9]})
A B C
0 1 1 9
1 2 2 7
2 3 3 4
3 3 3 3
4 1 1 9
df.iloc[:,1:] = df.iloc[:,1:].multiply(df['A'][:], axis="index")
df
A B C
0 1 1 9
1 2 4 14
2 3 9 12
3 3 9 9
4 1 1 9

How to concatenate multiple dataframes from multiple sources in pandas

I have three dataframes as follows.
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
dummy_data3 = {
'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'Feature1': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23],
'Feature2': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}
I want to concatenate these three dataframes where I do it as follows.
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])
df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature1', 'Feature2'])
df = pd.concat([df1, df2], ignore_index=True)
df_ = pd.concat([df, df3], ignore_index=True)
The output I get is as follows.
id Feature1 Feature2
0 1 A B
1 2 C D
2 3 E F
3 4 G H
4 5 I J
5 1 K L
6 2 M N
7 6 O P
8 7 Q R
9 8 S T
10 1 12 12
11 2 13 13
12 3 14 14
13 4 15 15
14 5 16 16
15 7 17 17
16 8 15 15
17 9 12 12
18 10 13 13
19 11 23 23
Now, I want to add a seperate column to the merged dataframes indicating what is the source of it. i.e. my output shout look as follows.
id Feature1 Feature2 source
0 1 A B df1
1 2 C D df1
2 3 E F df1
3 4 G H df1
4 5 I J df1
5 1 K L df2
6 2 M N df2
7 6 O P df2
8 7 Q R df2
9 8 S T df2
10 1 12 12 df3
11 2 13 13 df3
12 3 14 14 df3
13 4 15 15 df3
14 5 16 16 df3
15 7 17 17 df3
16 8 15 15 df3
17 9 12 12 df3
18 10 13 13 df3
19 11 23 23 df3
Just wondering how to do this in pandas. Moreover, I would also like to know whether I can do the concatenation of the theree dataframes in one line (without doing it one by one).
I am happy to provide more details if needed.

Add new column by DataFrame.assign and pass all 3 DataFrames to concat:
df = pd.concat([df1.assign(source='df1'),
df2.assign(source='df2'),
df3.assign(source='df3')], ignore_index=True)
print (df)
id Feature1 Feature2 source
0 1 A B df1
1 2 C D df1
2 3 E F df1
3 4 G H df1
4 5 I J df1
5 1 K L df2
6 2 M N df2
7 6 O P df2
8 7 Q R df2
9 8 S T df2
10 1 12 12 df3
11 2 13 13 df3
12 3 14 14 df3
13 4 15 15 df3
14 5 16 16 df3
15 7 17 17 df3
16 8 15 15 df3
17 9 12 12 df3
18 10 13 13 df3
19 11 23 23 df3
Another idea is use parameter keys in concat:
df = (pd.concat([df1, df2, df3], keys=('df1','df2','df3'))
.rename_axis(('source', 'tmp'))
.reset_index(level=0)
.reset_index(drop=True))
print (df)
source id Feature1 Feature2
0 df1 1 A B
1 df1 2 C D
2 df1 3 E F
3 df1 4 G H
4 df1 5 I J
5 df2 1 K L
6 df2 2 M N
7 df2 6 O P
8 df2 7 Q R
9 df2 8 S T
10 df3 1 12 12
11 df3 2 13 13
12 df3 3 14 14
13 df3 4 15 15
14 df3 5 16 16
15 df3 7 17 17
16 df3 8 15 15
17 df3 9 12 12
18 df3 10 13 13
19 df3 11 23 23

import pandas as pd
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
dummy_data3 = {
'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'Feature1': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23],
'Feature2': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}
df1 = pd.DataFrame(dummy_data1)
df1['source'] = 'df1'
df2 = pd.DataFrame(dummy_data2)
df2['source'] = 'df2'
df3 = pd.DataFrame(dummy_data3)
df3['source'] = 'df3'
df = pd.concat([df1, df2, df3], axis=0)
Output:
id Feature1 Feature2 source
0 1 A B df1
1 2 C D df1
2 3 E F df1
3 4 G H df1
4 5 I J df1
0 1 K L df2
1 2 M N df2
2 6 O P df2
3 7 Q R df2
4 8 S T df2
0 1 12 12 df3
1 2 13 13 df3
2 3 14 14 df3
3 4 15 15 df3
4 5 16 16 df3
5 7 17 17 df3
6 8 15 15 df3
7 9 12 12 df3
8 10 13 13 df3
9 11 23 23 df3

Add rows as columns in pandas

I'm trying to change my dataset by making all the rows into columns in pandas.
5 6 7
8 9 10
Needs to be changed as
5 6 7 8 9 10
with different headers of course, any suggestions??

Use pd.DataFrame([df.values.flatten()]) as follows:
In [18]: df
Out[18]:
0 1 2
0 5 6 7
1 8 9 10
In [19]: pd.DataFrame([df.values.flatten()])
Out[19]:
0 1 2 3 4 5
0 5 6 7 8 9 10
Explanation:
df.values returns numpy.ndarray:
In [18]: df.values
Out[18]:
array([[ 5, 6, 7],
[ 8, 9, 10]], dtype=int64)
In [19]: type(df.values)
Out[19]: numpy.ndarray
and numpy arrays have .flatten() method:
In [20]: df.values.flatten?
Docstring:
a.flatten(order='C')
Return a copy of the array collapsed into one dimension.
In [21]: df.values.flatten()
Out[21]: array([ 5, 6, 7, 8, 9, 10], dtype=int64)
Pandas.DataFrame constructor expects lists/arrays of rows:
If we try this:
In [22]: pd.DataFrame([ 5, 6, 7, 8, 9, 10])
Out[22]:
0
0 5
1 6
2 7
3 8
4 9
5 10
Pandas thinks that it's a list of rows, where each row has one element.
So i've enclosed that array into square brackets:
In [23]: pd.DataFrame([[ 5, 6, 7, 8, 9, 10]])
Out[23]:
0 1 2 3 4 5
0 5 6 7 8 9 10
which will be understood as one row with 6 columns.

or just in one line:
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.values.flatten()
#out: array([1, 2, 3, 4, 5, 6])

you can also use reduce()
from import pandas as pd
from functools import reduce
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
df = pd.DataFrame([reduce(lambda x,y: list(x[1]) + list(y[1]), df.iterrows())])
df
0 1 2 3 4 5
0 5 6 7 8 9 10

Use the reshape function from numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
nparray = np.array(df.iloc[:,:])
x = np.reshape(nparray, -1)
df = pd.DataFrame(x) #to convert back to a dataframe

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Transform pandas dataframe rename columns based on row values - pandas

You can use a pivot: (df .pivot('s', 'x') .pipe(lambda d: d.set_axis(d. columns.map(lambda x: '_'.join(map(str, x))), axis=1)) .reset_index() ) Output: s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33 0 123 1 4 7 2 5 8 3 6 9 1 124 11 14 17 12 15 18 3 16 19

Here's another option: df_out = df.melt(['s','x']).set_index(['s','x', 'variable']).unstack([2,1])['value'] df_out.columns = [f'{i}_{j}' for i, j in df_out.columns] print(df_out.reset_index()) Output: s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33 0 123 1 4 7 2 5 8 3 6 9 1 124 11 14 17 12 15 18 3 16 19

One option is with pivot_wider from pyjanitor : # pip install pyjanitor import janitor import pandas as pd df.pivot_wider(index = 's', names_from = 'x') s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33 0 123 1 4 7 2 5 8 3 6 9 1 124 11 14 17 12 15 18 3 16 19

Related

Retrieving values from different columns in Pandas based on a column condition [duplicate]

pandas: join dataframe on condition in next row

Multiplying Dataframe by Column Value

How to concatenate multiple dataframes from multiple sources in pandas

Add rows as columns in pandas

Categories

Resources