pandas joining strings in a group, skipping na values - pandas

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:

First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

Related

Add rows from another df based on keys pandas

EDITED*
I have a large df with many rows that share the same value in some of the columns.
I want to do the following:
new df = identify the rows in df that have a value in a certain column (not empty).
'''
df = pd.DataFrame({"a": [1, 2,2,2, 3, 4],
"b":['A','B','B', 'B','C','D'],
"c":[NaN, 2,NaN,NaN,NaN,NaN]})
'''
df1=df[~df['c'].isnull()]
'''
add to 'new_df' the rows from df that share 2 keys.
I tried to use merge:
df2 = pd.merge(df1,df,on=['a','b'], how='left')
But the result was that It added the same row a few times and not the unique rows
a b c_x c_y
0 2 B 2.0 2.0
1 2 B 2.0 NaN
2 2 B 2.0 NaN
I want to keep only one 'c' column with all the values. Not sure what approach to use.
Hope I made it clear...
Thanks!
As far as I understand, you would like to group by 'a' and 'b' and return only those groups where at least one row does not have a NaN in column 'c'. If that's the case. here you go
Load the df:
df = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[None, None,None,2,None,None,None,None]})
filter for any non-NaNs:
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
output:
a b c
3 2 B 2.0
4 2 B NaN
5 2 B NaN

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

How to concatenate two dfs having a similar datetime column?

I have two dfs which have one identical datetime column. I want to concatenate columns from one df to another, skipping where the data is missing. I want to print NaN for missing data.
I tried writing a while loop to concatenate. It gave this error:
ValueError: Can only compare identically-labeled Series objects
while df['TIMESTAMP'] == x['TIMESTAMP']:
z = pd.concat([df,x],axis=1)
I expect to concatenate two dfs, x and df. df is full timestamp range and x has some missing values. I want to write the data from x to df w.r.t. datetime column. Write NaN for missing values.
When you concatenate dataframes it will add one to the bottom of another:
DF1:
A B C
1 2 5
2 5 3
DF2:
A D E
1 2 3
3 4 7
Given my two example dataframes if you concatenate you will get
DF_Concat:
A B C D E
1 2 5 NULL NULL
2 5 3 NULL NULL
1 NULL NULL 2 3
3 NULL NULL 4 7
Whereas a merge will return
DF_Merge:
A B C D E
1 2 5 2 3
2 5 3 NULL NULL
3 NULL NULL 4 7
It sounds to me like you are looking for a merge:
pd.merge(DF1, DF2, on='A')

In DataFrame, how do you concatenate two columns with separator only if both values exist?

For example, say I have the following DataFrame
col1 col2
a b
NaN b
a NaN
If I simply do
df['col1'].fillna('')+'-'+df['col2'].fillna('')
I'll get
a-b
-b
a-
What I want instead is
a-b
b
a
I only want to include the separator if there are values on both sides
Add str.strip
(df['col1'].fillna('')+'-'+df['col2'].fillna('')).str.strip('-')
Out[367]:
0 a-b
1 b
2 a
dtype: object
stack with agg
df.stack().groupby(level=0).agg('-'.join)
Out[371]:
0 a-b
1 b
2 a
dtype: object