map one column in a df to another df where all words are present - pandas

I am trying to map a column to a dataframe from another dataframe where all words exist from the target dataframe
multiple matches are fine as I can filter them out after.
Thanks in advance!
df1
ColA
this is a sentence
with some words
in a column
and another
for fun
df2
ColB ColC
this a 123
in column 456
fun times 789
Some attempts
dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)
dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))
desired output
dfResult
ColA ColC
this is a sentence 123
with some words NaN
in a column 456
and another NaN
for fun NaN

Turn to set and look for subsets with Numpy broadcasting
Disclaimer: No assurances that this will be fast.
A = df1.ColA.str.split().apply(set).to_numpy() # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy() # instead of `.to_numpy()`
C = df2.ColC.to_numpy()
# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values. Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype()) # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]
df1.assign(ColC=out)
ColA ColC
0 this is a sentence 123
1 with some words NaN
2 in a column 456
3 and another NaN
4 for fun NaN

By using loop and set.issubset
pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]:
0 123.0
1 NaN
2 456.0
3 NaN
4 NaN
dtype: float64

Related

Pandas: If condition on multiple columns having null values and fillna with 0

I have a below dataframe, and my requirement is that, if both columns have np.nan then no change, if either of column has empty value then fill na with 0 value. I wrote this code but why its not working. Please suggest.
import pandas as pd
import numpy as np
data = {'Age': [np.nan, np.nan, 22, np.nan, 50,99],
'Salary': [217, np.nan, 262, 352, 570, np.nan]}
df = pd.DataFrame(data)
print(df)
cond1 = (df['Age'].isnull()) & (df['Salary'].isnull())
if cond1 is False:
df['Age'] = df['Age'].fillna(0)
df['Salary'] = df['Salary'].fillna(0)
print(df)
You can just assign it with update
c = ['Age','Salary']
df.update(df.loc[~df[c].isna().all(1),c].fillna(0))
df
Out[341]:
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
c1 = df['Age'].isna()
c2 = df['Salary'].isna()
df[np.c_[c1 & ~c2, ~c1 & c2]]=0
df
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
tmp=df.loc[(df['Age'].isna() & df['Salary'].isna())]
df.fillna(0,inplace=True)
df.loc[tmp.index]=np.nan
This might be a bit less sophisticated than the other answers but worked for me:
I first save the row(s) containing both Nan values at the same time
then fillna the original df as per normal
then set np.nan back to the location where we saved both rows containing Nan at the same time
Get the rows that are all nulls and use where to exclude them during the fill:
bools = df.isna().all(axis = 1)
df.where(bools, df.fillna(0))
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
Your if statement won't work because you need to check each row for True or False; cond1 is a series, and cannot be compared correctly to False (it will just return False, which is not entirely true), there can be multiple False and True in the series.
An inefficient way would be to traverse the rows:
for row, index in zip(cond1, df.index):
if not row:
df.loc[index] = df.loc[index].fillna(0)
apart from the inefficiency, you are keeping track of index positions; the pandas options try to abstract the process while being quite efficient, since the looping is in C

Generate dataframe with same key but different value

How can I generate a dataframe with common key but different value?
import pandas as pd
A = {"ID":["A", "B","C"], "Weight":[500,300,200]}
B = {"ID":["A", "B","D"], "Weight":[500,100,100]}
dfA = pd.DataFrame(data=A)
dfB = pd.DataFrame(data=B)
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
dfC
Current output is:
ID Weight_x Weight_y
0 A 500.0 500.0
1 B 300.0 100.0
2 C 200.0 NaN
3 D NaN 100.0
But my expected output (ID is common key, so A is identical value but C and D not common element):
ID Weight_x Weight_y
0 B 300.0 100.0
Use a simple merge with the default parameter how='inner' as suggested by #ALollz and query to keep different weight values:
>>> pd.merge(dfA, dfB, on='ID').query("Weight_x != Weight_y")
ID Weight_x Weight_y
1 B 300 100

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

selecting nan values in a pandas dataframe using loc [duplicate]

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

Map a pandas column with column names

I have two data frames:
import pandas as pd
# Column contains column name
df1 = pd.DataFrame({"Column": pd.Series(['a', 'b', 'b', 'c']),
"Item": pd.Series(['x', 'y', 'z', 'x']),
"Result": pd.Series([3, 4, 5, 6])})
df2 = pd.DataFrame({"a": pd.Series(['x', 'n', 'n']),
"b": pd.Series(['x', 'y', 'n']),
"c": pd.Series(['x', 'z', 'n'])})
How can I add "Result" to df2 based on the "Item" in the "Column"?
Expected dataframe df2 is:
a b c Result
- - - ------
x x x 3
n y z 4
n n n null
How can the above question be a duplicate of 3 questions, 2 of which are marked with an 'or' by #smci?
This is a lot more complicated than at first glance. df1 is in long-form, it has two entries for 'b'. So first it needs to be stacked/unstacked/pivoted into a 3x3 table of 'Result' where 'Column' becomes the index, and the values from 'Item' = 'x'/'y'/'z' are expanded to a full 3x3 matrix with NaN for missing values:
>>> df1_full = df1.pivot(index='Column', columns='Item', values='Result')
Item x y z
Column
a 3.0 NaN NaN
b NaN 4.0 5.0
c 6.0 NaN NaN
(Note the unwanted type-conversion to float, this is because numpy doesn't have NaN for integers, see Issue 17013 in pre-pandas-0.22.0 versions. No problem, we'll just cast back to int at the end.)
Now we want to do df1_full.merge(df2, left_index=True, right_on=??)
But first we need another trick/intermediate column to find the leftmost valid value in df2 which corresponds to a valid column-name from df1; the value n is invalid, maybe we replace it with NaN to make life easier:
>>> df2.replace('n', np.NaN)
a b c
0 x x x
1 NaN y z
2 NaN NaN NaN
>>> df2_nan.columns = [0,1,2]
0 1 2
0 x x x
1 NaN y z
2 NaN NaN NaN
And we want to successively test df2's columns from L-to-R as to whether their value is in df1_full.columns, similar to Computing the first non-missing value from each column in a DataFrame
, except testing successive columns (axis=1). Then store that intermediate column-name into a new column, 'join_col' :
>>> df2['join_col'] = df2.replace('n', np.NaN).apply(pd.Series.first_valid_index, axis=1)
a b c join_col
0 x x x a
1 n y z b
2 n n n None
Actually we want to index into the column-names of df1, but it blows up on the NaN:
>>> df1.columns[ df2_nan.apply(pd.Series.first_valid_index, axis=1) ]
(Well that's not exactly working, but you get the idea.)
Finally we do the merge df1_full.merge(df2, left_index=True, right_on='join_col'). And maybe take the desired column slice ['a','b','c','Result']. And cast Result back to int, or map 'Nan' -> 'null'.