Add rows from another df based on keys pandas - pandas

EDITED*
I have a large df with many rows that share the same value in some of the columns.
I want to do the following:
new df = identify the rows in df that have a value in a certain column (not empty).
'''
df = pd.DataFrame({"a": [1, 2,2,2, 3, 4],
"b":['A','B','B', 'B','C','D'],
"c":[NaN, 2,NaN,NaN,NaN,NaN]})
'''
df1=df[~df['c'].isnull()]
'''
add to 'new_df' the rows from df that share 2 keys.
I tried to use merge:
df2 = pd.merge(df1,df,on=['a','b'], how='left')
But the result was that It added the same row a few times and not the unique rows
a b c_x c_y
0 2 B 2.0 2.0
1 2 B 2.0 NaN
2 2 B 2.0 NaN
I want to keep only one 'c' column with all the values. Not sure what approach to use.
Hope I made it clear...
Thanks!

As far as I understand, you would like to group by 'a' and 'b' and return only those groups where at least one row does not have a NaN in column 'c'. If that's the case. here you go
Load the df:
df = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[None, None,None,2,None,None,None,None]})
filter for any non-NaNs:
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
output:
a b c
3 2 B 2.0
4 2 B NaN
5 2 B NaN

Related

Copying and pasting values from one dataframe to another dataframe

I have two dataframes (df1, df2).
df1
df2
I would like to have the final dataframe like this:
Final dataframe
How do I do this using Pandas?
EDIT The following solution is suggested by #sophocles
df1 = pd.DataFrame({'name':['a','b','c'],
'val1':[1,None,3],
'val2':[4,5,6] })
df2 = pd.DataFrame({'name':['b'],
'val1':2})
df1 and df2:
name val1 val2
0 a 1.0 4
1 b NaN 5
2 c 3.0 6
name val1
0 b 2
Simply using fillna
df1.set_index('name').fillna(df2.set_index('name')).reset_index()
This one is much faster than using merge method

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Map a pandas column with column names

I have two data frames:
import pandas as pd
# Column contains column name
df1 = pd.DataFrame({"Column": pd.Series(['a', 'b', 'b', 'c']),
"Item": pd.Series(['x', 'y', 'z', 'x']),
"Result": pd.Series([3, 4, 5, 6])})
df2 = pd.DataFrame({"a": pd.Series(['x', 'n', 'n']),
"b": pd.Series(['x', 'y', 'n']),
"c": pd.Series(['x', 'z', 'n'])})
How can I add "Result" to df2 based on the "Item" in the "Column"?
Expected dataframe df2 is:
a b c Result
- - - ------
x x x 3
n y z 4
n n n null
How can the above question be a duplicate of 3 questions, 2 of which are marked with an 'or' by #smci?
This is a lot more complicated than at first glance. df1 is in long-form, it has two entries for 'b'. So first it needs to be stacked/unstacked/pivoted into a 3x3 table of 'Result' where 'Column' becomes the index, and the values from 'Item' = 'x'/'y'/'z' are expanded to a full 3x3 matrix with NaN for missing values:
>>> df1_full = df1.pivot(index='Column', columns='Item', values='Result')
Item x y z
Column
a 3.0 NaN NaN
b NaN 4.0 5.0
c 6.0 NaN NaN
(Note the unwanted type-conversion to float, this is because numpy doesn't have NaN for integers, see Issue 17013 in pre-pandas-0.22.0 versions. No problem, we'll just cast back to int at the end.)
Now we want to do df1_full.merge(df2, left_index=True, right_on=??)
But first we need another trick/intermediate column to find the leftmost valid value in df2 which corresponds to a valid column-name from df1; the value n is invalid, maybe we replace it with NaN to make life easier:
>>> df2.replace('n', np.NaN)
a b c
0 x x x
1 NaN y z
2 NaN NaN NaN
>>> df2_nan.columns = [0,1,2]
0 1 2
0 x x x
1 NaN y z
2 NaN NaN NaN
And we want to successively test df2's columns from L-to-R as to whether their value is in df1_full.columns, similar to Computing the first non-missing value from each column in a DataFrame
, except testing successive columns (axis=1). Then store that intermediate column-name into a new column, 'join_col' :
>>> df2['join_col'] = df2.replace('n', np.NaN).apply(pd.Series.first_valid_index, axis=1)
a b c join_col
0 x x x a
1 n y z b
2 n n n None
Actually we want to index into the column-names of df1, but it blows up on the NaN:
>>> df1.columns[ df2_nan.apply(pd.Series.first_valid_index, axis=1) ]
(Well that's not exactly working, but you get the idea.)
Finally we do the merge df1_full.merge(df2, left_index=True, right_on='join_col'). And maybe take the desired column slice ['a','b','c','Result']. And cast Result back to int, or map 'Nan' -> 'null'.

Pandas: merge miscellaneous keys into the "others" row

I have a DataFrame like this
DataFrame({"key":["a","b","c","d","e"], "value": [5,4,3,2,1]})
I am mainly interested in row "a", "b" and "c". I want to merge everything else into an "others" row like this
key value
0 a 5
1 b 4
2 c 3
3 others 3
I wonder how can this be done.
First create a dataframe without d and e:
df2 = df[df.key.isin(["a","b","c"])]
Then find the value that you want the other column to have (using the sum function in this example):
val = df[~df["key"].isin(["a","b","c"])].sum()["value"]
Finally, append this column to the second df:
df2.append({"key":"others", "value":val},ignore_index=True)
df2 is now:
key value
0 a 5
1 b 4
2 c 3
3 others 3
I have found a way to do it. Not sure if it is the best way.
In [3]: key_map = {"a":"a", "b":"b", "c":"c"}
In [4]: data['key1'] = data['key'].map(lambda k: key_map.get(k, "others"))
In [5]: data.groupby("key1").sum()
Out[5]:
value
key1
a 5
b 4
c 3
others 3