Left merge two dataframes based on recordlinkage pair matches (multi index) - pandas

import pandas as pd
import recordlinkage as rl
lst_left = [...]
lst_right = [...]
df_left = pd.DataFrame(lst_left, columns=pd.Index(["city_id", "street_name"]))
df_right = pd.DataFrame(lst_right, columns=pd.Index(["city_id", "street_name"]))
indexer = rl.Index()
indexer.block("city_id")
pairs = indexer.index(df_left, df_right)
compare = rl.Compare(indexing_type="label")
compare.string("street_name", "street_name", method="damerau_levenshtein", threshold=0.7)
features = compare.compute(pairs, df_left, df_right)
matches = features[features[0] == 1.0]
And I get matches pairs MultiIndex
Out[4]:
0
0 0 1.0
1 1 1.0
2 2 1.0
4 3 1.0
6 5 1.0
7 6 1.0
8 7 1.0
10 8 1.0
12 9 1.0
13 10 1.0
14 11 1.0
15 12 1.0
And now I want to left join (sql left outer join) df_left and df_right dataframes based on those matches pairs keeping unmatched elements from df_left DataFrame.
How can I do that?
P.S. To get only matched records I use
df_left.loc[matches.index.get_level_values(0)].reset_index().merge(df_right.loc[matches.index.get_level_values(1)].reset_index(), how="left", left_index=True, right_index=True)
But I don't know how to merge and keep unmatched rows from left DataFrame.
Thank You

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

How to map different column values to one column

I have a data frame below:
import pandas as pd
df = pd.DataFrame({"SK":["EYF","EYF","RMK","MB","RMK","GYF","RMK","MYF"],
"SA":["a","b","tm","tmb","tm","cd","tms","alb"],
"C":["","11","12","13","","15","16","17"]})
df
I want to map some values of "SK","SA" and "C" to a new column:
df["D"]= df["SK"].map({"EYF":1,"MB":2,"GYF":3})
df
df["D"]= df["SA"].map({"tm":4})
df
df["D"]= df["C"].map({"16":5,"17":6})
df
But when I run the next map function, "D" column values mapped by previous map function turn to NaN.
I wanna get df below:
Any help will be appreciated.
You can create 3 Series and then replace misisng values from previous Series by Series.fillna or Series.combine_first:
a = df["SK"].map({"EYF":1,"MB":2,"GYF":3})
b = df["SA"].map({"tm":4})
c = df["C"].map({"16":5,"17":6})
df["D"] = a.fillna(b).fillna(c)
#alternative
df["D"] = a.combine_first(b).combine_first(c)
print (df)
SK SA C D
0 EYF a 1.0
1 EYF b 11 1.0
2 RMK tm 12 4.0
3 MB tmb 13 2.0
4 RMK tm 4.0
5 GYF cd 15 3.0
6 RMK tms 16 5.0
7 MYF alb 17 6.0
Order is important for priority if same match for some value:
df = pd.DataFrame({"SK":["EYF","EYF"],
"SA":["a","tm"],
"C":["16","17"]})
a = df["SK"].map({"EYF":1,"MB":2,"GYF":3})
b = df["SA"].map({"tm":4})
c = df["C"].map({"16":5,"17":6})
df["D1"] = a.fillna(b).fillna(c)
df["D2"] = b.fillna(a).fillna(c)
df["D3"] = c.fillna(b).fillna(a)
print (df)
SK SA C D1 D2 D3
0 EYF a 16 1 1.0 5
1 EYF tm 17 1 4.0 6

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Iterating over rows and columns in Pandas

I am trying to fill mean values of columns for all NaNs values in the column.
import numpy as np
import pandas as pd
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(table):
for column in table:
for value in column:
if value == 'NaN':
value = column.mean(skipna=True)
else:
value = value
impute_missing_values(table)
table
Why I am getting an error for this code?
IIUC:
table.fillna(table.mean())
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6
Okay, I am adding this as another answer because this isn't something I recommend at all. Using pandas methods vectorizes operations for better performance.
Using loops is not recommended when possible to avoid.
However, here is a quick fix to your code:
import pandas as pd
import numpy as np
import math
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(df):
for column in df:
for idx, value in df[column].iteritems():
if math.isnan(value):
df.loc[idx,column] = df[column].mean(skipna=True)
else:
pass
return df
impute_missing_values(table)
table
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6
You can try the SimpleImputer from scikit learn (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) using the mean option.
import pandas as pd
from sklearn.impute import SimpleImputer
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
print(table, '\n')
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
table_means = pd.DataFrame(imp.fit_transform(table), columns = {'C','B','A'})
print(table_means)
The print commands results in:
A B C
0 1.0 3.0 4
1 2.0 NaN 5
2 NaN NaN 6
A C B
0 1.0 3.0 4.0
1 2.0 3.0 5.0
2 1.5 3.0 6.0
To correct your code (as per my comment below):
def impute_missing_values(table):
for column in table:
table.loc[:,column] = np.where(table[column].isna(), table[column].mean(), table[column])
return table