Using pandas, how to join two tables on variable index? - pandas

There are two tables, the entries may have different id type. I need to join two tables based on id_type of df1 and the correct column of df2. For the background of the problem, the ids are security id in financial world, the id type may be CUSIP, ISIN, RIC etc..
print(df1)
id id_type value
0 11 type_A 0.1
1 22 type_B 0.2
2 13 type_A 0.3
print(df2)
type_A type_B type_C
0 11 21 xx
1 12 22 yy
2 13 23 zz
The desired output is
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3

Here is an alternative approach, which generalizes to many security types (CUSIP, ISIN, RIC, SEDOL, etc.).
First, create df1 and df2 along the lines of the original example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'sec_id': [11, 22, 33],
'sec_id_type': ['CUSIP', 'ISIN', 'RIC'],
'value': [100, 200, 300]})
df2 = pd.DataFrame({'CUSIP': [11, 21, 31],
'ISIN': [21, 22, 23],
'RIC': [31, 32, 33],
'SEDOL': [41, 42, 43]})
Second, create an intermediate data frame x1. We will use the first column for one join, and the second and third columns for a different join:
index = [idx for idx in df2.index for _ in df2.columns]
sec_id_types = df2.columns.to_list() * df2.shape[0]
sec_ids = df2.values.ravel()
data = [
(idx, sec_id_type, sec_id)
for idx, sec_id_type, sec_id in zip(index, sec_id_types, sec_ids)
]
x1 = pd.DataFrame.from_records(data, columns=['index', 'sec_id_type', 'sec_id'])
Join df1 and x1 to extract values from df1:
x2 = (x1.merge(df1, on=['sec_id_type', 'sec_id'], how='left')
.dropna()
.set_index('index'))
Finally, join df2 and x1 (from previous step) to get final result
print(df2.merge(x2, left_index=True, right_index=True, how='left'))
CUSIP ISIN RIC SEDOL sec_id_type sec_id value
0 11 21 31 41 CUSIP 11 100.0
1 21 22 32 42 ISIN 22 200.0
2 31 23 33 43 RIC 33 300.0
The columns sec_id_type and sec_id show the joins work as expected.

NEW Solution 1: create a temporary column that determines the ID with np.where
df2['id'] = np.where(df2['type_A'] == df1['id'], df2['type_A'], df2['type_B'])
df = pd.merge(df2,df1[['id','value']],how='left',on='id').drop('id', axis=1)
NEW Solution 2: Can you simply merge on the index? If not go with solution #1.
df = pd.merge(df2, df1['value'], how ='left', left_index=True, right_index=True)
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
OLD Solution:
Through a combination of pd.merge, pd.melt and pd.concat, I found a solution, although I wonder if there is a shorter way (probably):
df_A_B = pd.merge(df2[['type_A']], df2[['type_B']], how='left', left_index=True, right_index=True) \
.melt(var_name = 'id_type', value_name='id')
df_C = pd.concat([df2[['type_C']]] * 2).reset_index(drop=True)
df_A_B_C = pd.merge(df_A_B, df_C, how='left', left_index=True, right_index=True)
df3 = pd.merge(df_A_B_C, df1, how='left', on=['id_type', 'id']).dropna().drop(['id_type', 'id'], axis=1)
df4 = pd.merge(df2, df3, how='left', on=['type_C'])
df4
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3

Related

Properly map values between two dataframes

I have dataframe df
d = {'Col1': [10,67], 'Col2': [30,10],'Col3': [70,40]}
df = pd.DataFrame(data=d)
which results in
Col1 Col2 Col3
0 10 30 70
1 67 10 40
and df2
df2=pd.DataFrame(data=([25,36,47,(0,20)],[70,85,95,(20,40)],
[12,35,49,(40,60)],[50,49,21,(60,80)],[60,75,38,(80,100)]),
columns=["Col1","Col2","Col3","Range"])
which results in:
Col1 Col2 Col3 Range
0 25 36 47 (0, 20)
1 70 85 95 (20, 40)
2 12 35 49 (40, 60)
3 50 49 21 (60, 80)
4 60 75 38 (80, 100)
Both frames are just for example purposes and might be much bigger in reality. Both frames have the same columns but one.
I want to apply some function (x/y) between each value from df and a value in df2 from the same column. The value from df2 however maybe in varying rows depending on the Range column.
For example 10 from df (Col1) falls in range (0,20) in df2 therefore I want to use 25 from Col1 (df2) and do 10/25.
30 from df (Col2) falls in range (20,40) in df2 therefore I want to take 85 from Col2 (df2) and do 30/85.
70 from df (Col3) falls in range (60,80) in df2 therefore I want to take 21 from Col3 (df2) and do 70/21.
I want to do this for each row in df.
Don't really know how to do the proper mapping; I always tend to start with some for loops which are not very pretty especially if both dataframes are of bigger shape. Expected output can be any array, dataframe or the like composed of the resulting numbers.
Here is one way to do it by defining a helper function:
def find_denominator_for(v):
"""Helper function.
>>> find_denominator_for(10)
{'Col1': 25, 'Col2': 36, 'Col3': 47}
"""
for tup, sub_dict in df2.set_index("Range").to_dict(orient="index").items():
if min(tup) <= v <= max(tup):
return sub_dict
for col in df.columns:
df[col] = df[col] / df[col].apply(lambda x: find_denominator_for(x)[col])
Then:
print(df)
# Output
Col1 Col2 Col3
0 0.40 0.352941 3.333333
1 1.34 0.277778 0.421053

Split rows in Pandas into multiple rows based on a semicolon, but the change appears only in one column and not in others [duplicate]

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a should become b:
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [8]: b
Out[8]:
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can't get .transform to work. Any suggestions would be much appreciated!
Example data:
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
{'var1': 'b', 'var2': 1},
{'var1': 'c', 'var2': 1},
{'var1': 'd', 'var2': 2},
{'var1': 'e', 'var2': 2},
{'var1': 'f', 'var2': 2}])
I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:
def fun(row):
letters = row['var1']
letters = letters.split(',')
out = np.array([row] * len(letters))
out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)
UPDATE 3: it makes more sense to use Series.explode() / DataFrame.explode() methods (implemented in Pandas 0.25.0 and extended in Pandas 1.3.0 to support multi-column explode) as is shown in the usage example:
for a single column:
In [1]: df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
...: 'B': 1,
...: 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
In [2]: df
Out[2]:
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
In [3]: df.explode('A')
Out[3]:
A B C
0 0 1 [a, b, c]
0 1 1 [a, b, c]
0 2 1 [a, b, c]
1 foo 1 NaN
2 NaN 1 []
3 3 1 [d, e]
3 4 1 [d, e]
for multiple columns (for Pandas 1.3.0+):
In [4]: df.explode(['A', 'C'])
Out[4]:
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
UPDATE 2: more generic vectorized function, which will work for multiple normal and multiple list columns
def explode(df, lst_cols, fill_value='', preserve_index=False):
# make sure `lst_cols` is list-alike
if (lst_cols is not None
and len(lst_cols) > 0
and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
# preserve original index values
idx = np.repeat(df.index.values, lens)
# create "exploded" DF
res = (pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols},
index=idx)
.assign(**{col:np.concatenate(df.loc[lens>0, col].values)
for col in lst_cols}))
# append those rows that have empty lists
if (lens == 0).any():
# at least one list in cells is empty
res = (res.append(df.loc[lens==0, idx_cols], sort=False)
.fillna(fill_value))
# revert the original index order
res = res.sort_index()
# reset index if requested
if not preserve_index:
res = res.reset_index(drop=True)
return res
Demo:
Multiple list columns - all list columns must have the same # of elements in each row:
In [134]: df
Out[134]:
aaa myid num text
0 10 1 [1, 2, 3] [aa, bb, cc]
1 11 2 [] []
2 12 3 [1, 2] [cc, dd]
3 13 4 [] []
In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
aaa myid num text
0 10 1 1 aa
1 10 1 2 bb
2 10 1 3 cc
3 11 2
4 12 3 1 cc
5 12 3 2 dd
6 13 4
preserving original index values:
In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
Out[136]:
aaa myid num text
0 10 1 1 aa
0 10 1 2 bb
0 10 1 3 cc
1 11 2
2 12 3 1 cc
2 12 3 2 dd
3 13 4
Setup:
df = pd.DataFrame({
'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
'myid': {0: 1, 1: 2, 2: 3, 3: 4},
'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
})
CSV column:
In [46]: df
Out[46]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
using this little trick we can convert CSV-like column to list column:
In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
UPDATE: generic vectorized approach (will work also for multiple columns):
Original DF:
In [177]: df
Out[177]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
Solution:
first let's convert CSV strings to lists:
In [178]: lst_col = 'var1'
In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
In [180]: x
Out[180]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
Now we can do this:
In [181]: pd.DataFrame({
...: col:np.repeat(x[col].values, x[lst_col].str.len())
...: for col in x.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
...:
Out[181]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
OLD answer:
Inspired by #AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):
In [2]: df = pd.DataFrame(
...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
...: )
In [3]: df
Out[3]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
...: .var1.str.split(',', expand=True)
...: .stack()
...: .reset_index()
...: .rename(columns={0:'var1'})
...: .loc[:, df.columns]
...: )
Out[4]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.
If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.
b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1
Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
df
var1 var2
0 a,b,c 1
1 d,e,f 2
df.assign(var1=df['var1'].str.split(',')).explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Note that explode only works on a single column (for now). To explode multiple columns at once, see below.
NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.
df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
df
var1 var2
0 d,e,f 1
1 2
2 NaN 3
df['var1'].str.split(',')
0 [d, e, f]
1 []
2 NaN
df.assign(var1=df['var1'].str.split(',')).explode('var1')
var1 var2
0 d 1
0 e 1
0 f 1
1 2 # empty list entry becomes empty string after exploding
2 NaN 3 # NaN left un-touched
This is a serious advantage over ravel/repeat -based solutions (which ignore empty lists completely, and choke on NaNs).
Exploding Multiple Columns
pandas 1.3 update
df.explode works on multiple columns starting from pandas 1.3:
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'],
'var2': ['i,j,k', 'l,m,n'],
'var3': [1, 2]})
df
var1 var2 var3
0 a,b,c i,j,k 1
1 d,e,f l,m,n 2
(df.set_index(['var3'])
.apply(lambda col: col.str.split(','))
.explode(['var1', 'var2'])
.reset_index()
.reindex(df.columns, axis=1))
var1 var2 var3
0 a i 1
1 b j 1
2 c k 1
3 d l 2
4 e m 2
5 f n 2
On older versions, you would move the explode column inside the apply which is a lot less performant:
(df.set_index(['var3'])
.apply(lambda col: col.str.split(',').explode())
.reset_index()
.reindex(df.columns, axis=1))
The idea is to set as the index, all the columns that should NOT be exploded, then explode the remaining columns via apply. This works well when the lists are equally sized.
How about something like this:
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Then you just have to rename the columns
Here's a function I wrote for this common task. It's more efficient than the Series/stack methods. Column order and names are retained.
def tidy_split(df, column, sep='|', keep=False):
"""
Split the values of a column and expand so the new DataFrame has one split
value per row. Filters rows where the column is missing.
Params
------
df : pandas.DataFrame
dataframe with the column to split and expand
column : str
the column to split and expand
sep : str
the string used to split the column's values
keep : bool
whether to retain the presplit value as it's own row
Returns
-------
pandas.DataFrame
Returns a dataframe with the same columns as `df`.
"""
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
With this function, the original question is as simple as:
tidy_split(a, 'var1', sep=',')
Similar question as: pandas: How do I split text in a column into multiple rows?
You could do:
>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
var2 var1
0 1 a
0 1 b
0 1 c
1 2 d
1 2 e
1 2 f
There is a possibility to split and explode the dataframe without changing the structure of dataframe
Split and expand data of specific columns
Input:
var1 var2
0 a,b,c 1
1 d,e,f 2
#Get the indexes which are repetative with the split
df['var1'] = df['var1'].str.split(',')
df = df.explode('var1')
Out:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Edit-1
Split and Expand of rows for Multiple columns
Filename RGB RGB_type
0 A [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402... [r, g, b]
1 B [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141... [r, g, b]
Re indexing based on the reference column and aligning the column value information with stack
df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
df.reset_index(drop=True).ffill()
Out:
Filename RGB_type Top 1 colour Top 1 frequency Top 2 colour Top 2 frequency
Filename
A 0 A r 0 1650 6 39
1 A g 0 1691 1 59
2 A b 50 1402 49 187
B 0 B r 0 1423 16 38
1 B g 0 1445 16 46
2 B b 0 1419 16 39
TL;DR
import pandas as pd
import numpy as np
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
def explode_list(df, col):
s = df[col]
i = np.arange(len(s)).repeat(s.str.len())
return df.iloc[i].assign(**{col: np.concatenate(s)})
Demonstration
explode_str(a, 'var1', ',')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Let's create a new dataframe d that has lists
d = a.assign(var1=lambda d: d.var1.str.split(','))
explode_list(d, 'var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
General Comments
I'll use np.arange with repeat to produce dataframe index positions that I can use with iloc.
FAQ
Why don't I use loc?
Because the index may not be unique and using loc will return every row that matches a queried index.
Why don't you use the values attribute and slice that?
When calling values, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.
Why do you use assign?
When I use assign using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.
Why are the index values repeat?
By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
This can be reset with reset_index(drop=True)
For Strings
I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.
I then use that sep to join the strings then split.
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
For Lists
Similar as for strings except I don't need to count occurrences of sep because its already split.
I use Numpy's concatenate to jam the lists together.
import pandas as pd
import numpy as np
def explode_list(df, col):
s = df[col]
i = np.arange(len(s)).repeat(s.str.len())
return df.iloc[i].assign(**{col: np.concatenate(s)})
I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column's entries at a time).
def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split
returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row,row_accumulator,target_column,separator):
split_row = row[target_column].split(separator)
for s in split_row:
new_row = row.to_dict()
new_row[target_column] = s
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pandas.DataFrame(new_rows)
return new_df
Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array.
The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat.
var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))
pd.DataFrame({'var1': var1,
'var2': var2})
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote. I tested five scenarios with varying proportions of the list length to the number of lists. Sharing the results below:
Time: (less is better, click to view large version)
Peak memory usage: (less is better)
Conclusions:
#MaxU's answer (update 2), codename concatenate offers the best speed in almost every case, while keeping the peek memory usage low,
see #DMulligan's answer (codename stack) if you need to process lots of rows with relatively small lists and can afford increased peak memory,
the accepted #Chang's answer works well for data frames that have a few rows but very large lists.
Full details (functions and benchmarking code) are in this GitHub gist. Please note that the benchmark problem was simplified and did not include splitting of strings into the list - which most solutions performed in a similar fashion.
One-liner using split(___, expand=True) and the level and name arguments to reset_index():
>>> b = a.var1.str.split(',', expand=True).set_index(a.var2).stack().reset_index(level=0, name='var1')
>>> b
var2 var1
0 1 a
1 1 b
2 1 c
0 2 d
1 2 e
2 2 f
If you need b to look exactly like in the question, you can additionally do:
>>> b = b.reset_index(drop=True)[['var1', 'var2']]
>>> b
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Based on the excellent #DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
def split_df(dataframe, col_name, sep):
orig_col_index = dataframe.columns.tolist().index(col_name)
orig_index_name = dataframe.index.name
orig_columns = dataframe.columns
dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge
index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
df_split = pd.DataFrame(
pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
.stack().reset_index(level=1, drop=1), columns=[col_name])
df = dataframe.drop(col_name, axis=1)
df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
df = df.set_index(index_col_name)
df.index.name = orig_index_name
# merge adds the column to the last place, so we need to move it back
return change_column_order(df, col_name, orig_col_index)
Example:
df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]],
columns=['Name', 'A', 'B'], index=[10, 12, 13])
df
Name A B
10 a:b 1 4
12 c:d 2 5
13 e:f:g:h 3 6
split_df(df, 'Name', ':')
Name A B
10 a 1 4
10 b 1 4
12 c 2 5
12 d 2 5
13 e 3 6
13 f 3 6
13 g 3 6
13 h 3 6
Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.
The string function split can take an option boolean argument 'expand'.
Here is a solution using this argument:
(a.var1
.str.split(",",expand=True)
.set_index(a.var2)
.stack()
.reset_index(level=1, drop=True)
.reset_index()
.rename(columns={0:"var1"}))
I do appreciate the answer of "Chang She", really, but the iterrows() function takes long time on large dataset. I faced that issue and I came to this.
# First, reset_index to make the index a column
a = a.reset_index().rename(columns={'index':'duplicated_idx'})
# Get a longer series with exploded cells to rows
series = pd.DataFrame(a['var1'].str.split('/')
.tolist(), index=a.duplicated_idx).stack()
# New df from series and merge with the old one
b = series.reset_index([0, 'duplicated_idx'])
b = b.rename(columns={0:'var1'})
# Optional & Advanced: In case, there are other columns apart from var1 & var2
b.merge(
a[a.columns.difference(['var1'])],
on='duplicated_idx')
# Optional: Delete the "duplicated_index"'s column, and reorder columns
b = b[a.columns.difference(['duplicated_idx'])]
One-liner using assign and explode:
col1 col2
0 a,b,c 1
1 d,e,f 2
df.assign(col1 = df.col1.str.split(',')).explode('col1', ignore_index=True)
Output:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Just used jiln's excellent answer from above, but needed to expand to split multiple columns. Thought I would share.
def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split
returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
split_rows = []
for target_column in target_columns:
split_rows.append(row[target_column].split(separator))
# Seperate for multiple columns
for i in range(len(split_rows[0])):
new_row = row.to_dict()
for j in range(len(split_rows)):
new_row[target_columns[j]] = split_rows[j][i]
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df
upgraded MaxU's answer with MultiIndex support
def explode(df, lst_cols, fill_value='', preserve_index=False):
"""
usage:
In [134]: df
Out[134]:
aaa myid num text
0 10 1 [1, 2, 3] [aa, bb, cc]
1 11 2 [] []
2 12 3 [1, 2] [cc, dd]
3 13 4 [] []
In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
aaa myid num text
0 10 1 1 aa
1 10 1 2 bb
2 10 1 3 cc
3 11 2
4 12 3 1 cc
5 12 3 2 dd
6 13 4
"""
# make sure `lst_cols` is list-alike
if (lst_cols is not None
and len(lst_cols) > 0
and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
# preserve original index values
idx = np.repeat(df.index.values, lens)
res = (pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols},
index=idx)
.assign(**{col:np.concatenate(df.loc[lens>0, col].values)
for col in lst_cols}))
# append those rows that have empty lists
if (lens == 0).any():
# at least one list in cells is empty
res = (res.append(df.loc[lens==0, idx_cols], sort=False)
.fillna(fill_value))
# revert the original index order
res = res.sort_index()
# reset index if requested
if not preserve_index:
res = res.reset_index(drop=True)
# if original index is MultiIndex build the dataframe from the multiindex
# create "exploded" DF
if isinstance(df.index, pd.MultiIndex):
res = res.reindex(
index=pd.MultiIndex.from_tuples(
res.index,
names=['number', 'color']
)
)
return res
My version of the solution to add to this collection! :-)
# Original problem
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
{'var1': 'b', 'var2': 1},
{'var1': 'c', 'var2': 1},
{'var1': 'd', 'var2': 2},
{'var1': 'e', 'var2': 2},
{'var1': 'f', 'var2': 2}])
### My solution
import pandas as pd
import functools
def expand_on_cols(df, fuse_cols, delim=","):
def expand_on_col(df, fuse_col):
col_order = df.columns
df_expanded = pd.DataFrame(
df.set_index([x for x in df.columns if x != fuse_col])[fuse_col]
.apply(lambda x: x.split(delim))
.explode()
).reset_index()
return df_expanded[col_order]
all_expanded = functools.reduce(expand_on_col, fuse_cols, df)
return all_expanded
assert(b.equals(expand_on_cols(a, ["var1"], delim=",")))
I have come up with the following solution to this problem:
def iter_var1(d):
for _, row in d.iterrows():
for v in row["var1"].split(","):
yield (v, row["var2"])
new_a = DataFrame.from_records([i for i in iter_var1(a)],
columns=["var1", "var2"])
Another solution that uses python copy package
import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
new_observations = list()
for row in df.to_dict(orient='records'):
explode_values = row[column_to_explode]
del row[column_to_explode]
if type(explode_values) is list or type(explode_values) is tuple:
for explode_value in explode_values:
new_observation = copy.deepcopy(row)
new_observation[column_to_explode] = explode_value
new_observations.append(new_observation)
else:
new_observation = copy.deepcopy(row)
new_observation[column_to_explode] = explode_values
new_observations.append(new_observation)
return_df = pd.DataFrame(new_observations)
return return_df
df = pandas_explode(df, column_name)
There are a lot of answers here but I'm surprised no one has mentioned the built in pandas explode function. Check out the link below:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode
For some reason I was unable to access that function, so I used the below code:
import pandas_explode
pandas_explode.patch()
df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people')
Above is a sample of my data. As you can see the people column had series of people, and I was trying to explode it. The code I have given works for list type data. So try to get your comma separated text data into list format. Also since my code uses built in functions, it is much faster than custom/apply functions.
Note: You may need to install pandas_explode with pip.
I had a similar problem, my solution was converting the dataframe to a list of dictionaries first, then do the transition. Here is the function:
import re
import pandas as pd
def separate_row(df, column_name):
ls = []
for row_dict in df.to_dict('records'):
for word in re.split(',', row_dict[column_name]):
row = row_dict.copy()
row[column_name]=word
ls.append(row)
return pd.DataFrame(ls)
Example:
>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
>>> a
var1 var2
0 a,b,c 1
1 d,e,f 2
>>> separate_row(a, "var1")
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
You can also change the function a bit to support separating list type rows.
Upon adding few bits and pieces from all the solutions on this page, I was able to get something like this(for someone who need to use it right away).
parameters to the function are df(input dataframe) and key(column that has delimiter separated string). Just replace with your delimiter if that is different to semicolon ";".
def split_df_rows_for_semicolon_separated_key(key, df):
df=df.set_index(df.columns.drop(key,1).tolist())[key].str.split(';', expand=True).stack().reset_index().rename(columns={0:key}).loc[:, df.columns]
df=df[df[key] != '']
return df
Try:
vals = np.array(a.var1.str.split(",").values.tolist())
var = np.repeat(a.var2, vals.shape[1])
out = pd.DataFrame(np.column_stack((var, vals.ravel())), columns=a.columns)
display(out)
var1 var2
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 2 f
In recent version of pandas you can use split followed by explode
a.assign(var1=a['var1'].str.split(',')).explode('var1')
a
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
A short and simple way to change the format of the column using .apply() so that it can be used by .explod():
import string
import pandas as pd
from io import StringIO
file = StringIO(""" var1 var2
0 a,b,c 1
1 d,e,f 2""")
df = pd.read_csv(file, sep=r'\s\s+')
df['var1'] = df['var1'].apply(lambda x : str(x).split(','))
df.explode('var1')
Output:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2

Pandas Column Transformation with list of dict in column

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)
Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

regarding controlling the setup of index column [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Replace cell values in df based on complex condition

Hello friends,
I would like to iterate trough all the numeric columns in the df (in a generic way).
For each unique df["Type"] group in each numeric column:
Replace all values that are greater than each column mean + 2 standard
deviation values with "nan"
df = pd.DataFrame(data=d)
df = pd.DataFrame(data=d)
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
Sample df:
PRODUCT Test1 Test2 Type
A 7 99 Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C 90 87 X
Expected output:
RODUCT Test1 Test2 Type
A 7 nan Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C nan nan X
Logically, it can go like this:
test_cols = ['Test1', 'Test2']
# calculate mean and std with groupby
groups = df.groupby('Type')
test_mean = groups[test_cols].transform('mean')
test_std = groups[test_cols].transform('std')
# threshold
thresh = test_mean + 2 * test_std
# thresholding
df[test_cols] = np.where(df[test_cols]>thresh, np.nan, df[test_cols])
However, from your sample data set, thresh is:
Test1 Test2
0 10.443434 141.707912
1 133.195890 123.898159
2 133.195890 123.898159
3 10.443434 141.707912
4 10.443434 141.707912
5 133.195890 123.898159
So, it wouldn't change anything.
You can get this through a groupby and transform:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['Product'] = ['A', 'B', 'C', 'A', 'B', 'C']
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
df = df.set_index('Product')
def nan_out_values(type_df):
type_df[type_df > type_df.mean() + 2*type_df.std()] = np.nan
return type_df
df[['Test1', 'Test2']] = df.groupby('Type').transform(nan_out_values)