Pandas columns headers split - pandas

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]

With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')

Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN

Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]

I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

Related

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

selecting nan values in a pandas dataframe using loc [duplicate]

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')