Data Imputation in Pandas Dataframe column - pandas

I have 2 tables which I am merging( Left Join) based on common column but other column does not have exact column values and hence some of the column values are blank. I want to fill the missing value with closest tenth . for example I have these two dataframes:
d = {'col1': [1.31, 2.22,3.33,4.44,5.55,6.66], 'col2': ['010100', '010101','101011','110000','114000','120000']}
df1=pd.DataFrame(data=d)
d2 = {'col2': ['010100', '010102','010144','114218','121212','166110'],'col4': ['a','b','c','d','e','f']}
df2=pd.DataFrame(data=d2)
# df1
col1 col2
0 1.31 010100
1 2.22 010101
2 3.33 101011
3 4.44 110000
4 5.55 114000
5 6.66 120000
# df2
col2 col4
0 010100 a
1 010102 b
2 010144 c
3 114218 d
4 121212 e
5 166110 f
After left merging on col2,
I get:
df1.merge(df2,how='left',on='col2')
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 NaN
2 3.33 101011 NaN
3 4.44 111100 NaN
4 5.55 114100 NaN
5 6.66 166100 NaN
Vs what I want, for for all values where NaN, my col2 value firstly converts to closest 10 and then matches in col2 of table 1 if there is a match, place col4 accordingly, if not then closest 100, then closest thousand, ten thousand..
Ideally my answer should be:
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 a
2 3.33 101011 f
3 4.44 111100 d
4 5.55 114100 d
5 6.66 166100 f
Please help me in coding this

Related

How to align several different dataframes with different shapes on common column?

I have a few different data frames like this below
df1
idx col1 col2 col3
2020-11-20 01:00:00 1 5 9
2020-11-20 02:00:00 2 6 10
2020-11-20 03:00:00 3 7 11
2020-11-20 04:00:00 4 8 12
df2
idx col4 col5 col6
2020-11-20 02:00:00 13 15 17
2020-11-20 03:00:00 14 16 18
df3
idx col7 col8 col9
2020-11-20 01:00:00 19 20 21
and essentially I need to keep all the columns from all DF's but align the values on the timestamp that is the index for each dataframe. My expected output is this
df_merged
idx col1 col2 col3 col4 col5 col6 col7 col8 col9
2020-11-20 01:00:00 1 5 9 NaN NaN NaN 19 20 21
2020-11-20 02:00:00 2 6 10 13 15 17 NaN NaN NaN
2020-11-20 03:00:00 3 7 11 14 16 18 NaN NaN NaN
2020-11-20 04:00:00 4 8 12 NaN NaN NaN NaN NaN NaN
I have tried various things like merge, concat, join, manually doing it for hours now and I am stumped why it wont work. These df's are simplified versions, but my issue with these approaches are that my df1 has a length of 1619, df2 has a length of 1619, df3 has a length of 1617, and df4 (not here but follows same idea) has a length of 1613. When I try this
df_merged = reduce(lambda left,right: pd.merge(left,right,how='left'), [df1,df2,df3,df4]) what happens is that the df_merged size is now 12k rows (not 1619 like the original df). I tried dropping duplicates as well on the final df_merged and that only left me with like 600 rows. I also have tried manually combining them with loc, iloc and isin() but still no luck.
Really any help would be greatly appreciated!
Use merge with how = 'outer'.
Demonstration:
# data preparation
string = """idx col1 col2 col3
2020-11-20 01:00:00 1 5 9
2020-11-20 02:00:00 2 6 10
2020-11-20 03:00:00 3 7 11
2020-11-20 04:00:00 4 8 12"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
string = """idx col4 col5 col6
2020-11-20 02:00:00 13 15 17
2020-11-20 03:00:00 14 16 18"""
data = [x.split(' ') for x in string.split('\n')]
df2 = pd.DataFrame(data[1:], columns = data[0])
string = """idx col7 col8 col9
2020-11-20 01:00:00 19 20 21"""
data = [x.split(' ') for x in string.split('\n')]
df3 = pd.DataFrame(data[1:], columns = data[0])
#solution
df.merge(df2, on = 'idx', how = 'outer').merge(df3, on='idx', how='outer')
Output:
idx col1 col2 col3 col4 col5 col6 col7 col8 col9
0 2020-11-20 01:00:00 1 5 9 NaN NaN NaN 19 20 21
1 2020-11-20 02:00:00 2 6 10 13 15 17 NaN NaN NaN
2 2020-11-20 03:00:00 3 7 11 14 16 18 NaN NaN NaN
3 2020-11-20 04:00:00 4 8 12 NaN NaN NaN NaN NaN NaN

one column as denominator and many as nominator pandas

I have a data frame including many columns. I want the col1 as the denominator and all other columns as the nominator. I have done this for just col2 (see below code). I want to do this for all other columns in shortcode.
df
Town col1 col2 col3 col4
A 8 7 5 2
B 8 4 2 3
C 8 5 8 5
here is my code for col2:
df['col2'] = df['col2'] / df['col1'
here is my result:
df
A 8 0.875000 1.0 5 2
B 8 0.500000 0.0 2 3
C 8 0.625000 1.0 8 5
I want to do the same with all cols (i.e. col3, col4....)
If this could be done in pivot_table then it will be awsome
Thanks for your help
Use df.iloc with df.div:
In [2084]: df.iloc[:, 2:] = df.iloc[:, 2:].div(df.col1, axis=0)
In [2085]: df
Out[2085]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625
OR use df.filter , pd.concat with df.div
In [2073]: x = df.filter(like='col').set_index('col1')
In [2078]: out = pd.concat([df.Town, x.div(x.index).reset_index()], 1)
In [2079]: out
Out[2079]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625

How to use SQL minus query equivalent between two dataframes properly

I have two dataframes each having 1000 rows. The dataframes are same, however, row by row is not same. The following examples can be assumed as truncated version of the dataframes.
df1:
col1 col2 col3
1 2 3
2 3 4
5 6 6
8 9 9
df2:
col1 col2 col3
5 6 6
8 9 9
1 2 3
2 3 4
The dataframes don't have indices and I expect null returned when I implement sql minus query on these. I used the following query, but did not obtain the result as expected. Is there any way to achieve my desired result ?
df3 = df1.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
For instance, if I consider df1 as table1 and df2 as table2, and if I ran the following query in SQL server, I would get null returned (empty table).
SELECT * FROM table1
EXCEPT
SELECT * FROM table2
Yes, you can use the indicator like this:
df1.merge(df2, how='left', indicator='ind').query('ind=="left_only"')
Where df1 is:
col1 col2 col3
0 1.0 2.0 3.0
1 2.0 3.0 4.0
2 5.0 6.0 6.0
3 8.0 9.0 9.0
4 10.0 10.0 10.0
and df2 is:
col1 col2 col3
0 5 6 6
1 8 9 9
2 1 2 3
3 2 3 4
Output:
col1 col2 col3 ind
4 10.0 10.0 10.0 left_only

Appending a list to a dataframe

I have a dataframe let's say:
col1 col2 col3
1 x 3
1 y 4
and I have a list:
2
3
4
5
Can I append the list to the data frame like this:
col1 col2 col3
1 x 3
1 y 4
2 Nan Nan
3 Nan Nan
4 Nan Nan
5 Nan Nan
Thank you.
Use concat or append with DataFrame contructor:
df = df.append(pd.DataFrame([2,3,4,5], columns=['col1']))
df = pd.concat([df, pd.DataFrame([2,3,4,5], columns=['col1'])])
print (df)
col1 col2 col3
0 1 x 3.0
1 1 y 4.0
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
3 5 NaN NaN

Updating pandas dataframe values assigns Nan

I have a dataframe with 3 columns: Col1, Col2 and Col3.
Toy example
d = {'Col1':['hello','k','hello','we','r'],
'Col2':[10,20,30,40,50],
'Col3':[1,2,3,4,5]}
df = pd.DataFrame(d)
Which gets:
Col1 Col2 Col3
0 hello 10 1
1 k 20 2
2 hello 30 3
3 we 40 4
4 r 50 5
I am selecting the values of Col2 such that the value in Col1 is 'hello'
my_values = df.loc[df['Col1']=='hello']['Col2']
this returns me a Series where I can see the values of Col2 as well as the index.
0 10
2 30
Name: Col2, dtype: int64
Now suppose I want to assign this values to a Col3.
I only want to replace those values(index 0 and 2), keeping the other values in Col3 unmodified
I tried:
df['Col3'] = my_values
But this assigns Nan to the other values (the ones where Col1 is not hello)
Col1 Col2 Col3
0 hello 10 10
1 k 20 NaN
2 hello 30 30
3 we 40 NaN
4 r 50 NaN
How can I update certain values in Col3 leaving the others untouched?
Col1 Col2 Col3
0 hello 10 10
1 k 20 2
2 hello 30 30
3 we 40 4
4 r 50 5
So, in short:
Having my_values I want to put them in Col3
Or just base on np.where
df['Col3']=np.where(df['Col1'] == 'hello',df.Col2,df.Col3)
If base on your myvalue
df.loc[my_values.index,'col3']=my_values
Or you can just do update
df['Col3'].update(my_values)