Is there a way to merge pandas dataframes on row and column index? - pandas

I want to merge two pandas data frames that share the same index as well as some columns. pd.merge creates duplicate columns, but I would like to merge on both axes at the same time.
tried pd.merge and pd.concat but did not get the right result.
my try: df3=pd.merge(df1, df2, left_index=True, right_index=True, how='left')
df1
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7
ID
323 7 6 8 7.0 2.0 2.0 10.0
324 2 1 5 3.0 4.0 2.0 1.0
675 9 8 1 NaN NaN NaN NaN
676 3 7 2 NaN NaN NaN NaN
df2
Var#6 Var#7 Var#8 Var#9
ID
675 1 9 2 8
676 3 2 0 7
ideally I would get:
df3
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1 9 2 8
676 3 7 2 NaN NaN 3 2 0 7

IIUC, use df.combine_first():
df3=df1.combine_first(df2)
print(df3)
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1.0 9.0 2.0 8.0
676 3 7 2 NaN NaN 3.0 2.0 0.0 7.0

You can concat and group the data
pd.concat([df1, df2], 1).groupby(level = 0, axis = 1).first()
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7.0 6.0 8.0 7.0 2.0 2.0 10.0 NaN NaN
324 2.0 1.0 5.0 3.0 4.0 2.0 1.0 NaN NaN
675 9.0 8.0 1.0 NaN NaN 1.0 9.0 2.0 8.0
676 3.0 7.0 2.0 NaN NaN 3.0 2.0 0.0 7.0

Related

How to keep True and None Value using pandas?

I've one DataFrame
import pandas as pd
data = {'a': [1,2,3,None,4,None,2,4,5,None],'b':[6,6,6,'NaN',4,'NaN',11,11,11,'NaN']}
df = pd.DataFrame(data)
condition = (df['a']>2) | (df['a'] == None)
print(df[condition])
a b
0 1.0 6
1 2.0 6
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
6 2.0 11
7 4.0 11
8 5.0 11
9 NaN NaN
Here, i've to keep where condition is coming True and Where None is there i want to keep those rows as well.
Expected output is :
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN
Thanks in Advance
You can use another | or condition (Note: See #ALlolz's comment, you shouldnt compare a series with np.nan)
condition = (df['a']>2) | (df['a'].isna())
df[condition]
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

Add header to .data file in Pandas

Given a file with the extention of .data, I have read it with pd.read_fwf("./input.data", sep=",", header = None):
Out:
0
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3...
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5...
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6...
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5...
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4...
... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2...
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2...
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4...
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2...
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0...
How can I add the following column names to it? Thanks.
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
Update:
pd.read_fwf("./input.data", names = col_names)
Out:
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0... NaN NaN NaN NaN NaN NaN
If check read_fwf:
Read a table of fixed-width formatted lines into DataFrame.
So if there is separator , use read_csv:
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
df = pd.read_csv("input.data", names=col_names)
print (df)
age sex cp restbp chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
.. ... ... ... ... ... ... ... ... ... ...
292 57.0 0.0 4.0 140.0 241.0 0.0 0.0 123.0 1.0 0.2
293 45.0 1.0 1.0 110.0 264.0 0.0 0.0 132.0 0.0 1.2
294 68.0 1.0 4.0 144.0 193.0 1.0 0.0 141.0 0.0 3.4
295 57.0 1.0 4.0 130.0 131.0 0.0 0.0 115.0 1.0 1.2
296 57.0 0.0 2.0 130.0 236.0 0.0 2.0 174.0 0.0 0.0
slope ca thal num
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 1
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
.. ... ... ... ...
292 2.0 0.0 7.0 1
293 2.0 0.0 7.0 1
294 2.0 2.0 7.0 1
295 2.0 1.0 7.0 1
296 2.0 1.0 3.0 1
[297 rows x 14 columns]
Just do a read_csv without header and pass col_names:
df = pd.read_csv('input.data', header=None, names=col_names);
Output (head):
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
-- ----- ----- ---- -------- ------ ----- --------- --------- ------- --------- ------- ---- ------ -----
0 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0

How to select NaN values in pandas in specific range

I have a dataframe like this:
df = pd.DataFrame({'col1': [5,6,np.nan, np.nan,np.nan, 4, np.nan, np.nan,np.nan, np.nan,7,8,8, np.nan, 5 , np.nan]})
df:
col1
0 5.0
1 6.0
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
10 7.0
11 8.0
12 8.0
13 NaN
14 5.0
15 NaN
These NaN values should be replaced in the following way. The first selection should look like this.
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
And then these Nan values should be replace with the only value in that selection, 4.
The second selection is:
13 NaN
14 5.0
15 NaN
and these NaN values should be replaced with 5.
With isnull() you can select the NaN values in a dataframe but how are able to filter/select these specific ranges in pandas?
Solution if missing values are around one non missing val - solution create unique groups and replace in groups by forward and back filling:
#test missing values
s = df['col1'].isna()
#create unique groups
v = s.ne(s.shift()).cumsum()
#count groups and get only 1 value around, filter only misising values groups
mask = v.map(v.value_counts()).eq(1) | s
#groups for replacement per groups
g = mask.ne(mask.shift()).cumsum()
df['col2'] = df.groupby(g)['col1'].apply(lambda x: x.ffill().bfill())
print (df)
col1 col2
0 5.0 5.0
1 6.0 6.0
2 NaN 4.0
3 NaN 4.0
4 NaN 4.0
5 4.0 4.0
6 NaN 4.0
7 NaN 4.0
8 NaN 4.0
9 NaN 4.0
10 7.0 7.0
11 8.0 8.0
12 8.0 8.0
13 NaN 5.0
14 5.0 5.0
15 NaN 5.0

Compute a sequential rolling mean in pandas as array function?

I am trying to calculate a rolling mean on dataframe with NaNs in pandas, but pandas seems to reset the window when it meets a NaN, hears some code as an example...
import numpy as np
from pandas import *
foo = DataFrame(np.arange(0.0,13.0))
foo['1'] = np.arange(13.0,26.0)
foo.ix[4:6,0] = np.nan
foo.ix[4:7,1] = np.nan
bar = rolling_mean(foo, 4)
gives the rolling mean that resets the window after each NaN's, not just skipping out the NaNs
bar =
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 8.5 NaN
11 9.5 22.5
12 10.5 23.5
I have found an ugly iter/ dropna() work around that gives the right answer
def sparse_rolling_mean(df_data, window):
...: f_data = DataFrame(np.nan,index=df_data.index, columns=df_data.columns)
...: for i in f_data.columns:
...: f_data.ix[:,i] = rolling_mean(df_data.ix[:,i].dropna(),window)
...: return f_data
bar = sparse_rolling_mean(foo,4)
bar
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
does anyone know if it is possible to do this as an array function ?
many thanks in advance.
you may do:
>>> def sparse_rolling_mean(ts, window):
... return rolling_mean(ts.dropna(), window).reindex_like(ts)
...
>>> foo.apply(sparse_rolling_mean, args=(4,))
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
[13 rows x 2 columns]
you can control what get's naned out with the min_periods arg
In [12]: rolling_mean(foo, 4,min_periods=1)
Out[12]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 2.0 15.0
5 2.5 15.5
6 3.0 16.0
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
You can do this if you want results, except when the original was nan
In [27]: rolling_mean(foo, 4,min_periods=1)[foo.notnull()]
Out[27]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
Your expected are a bit odd, as the first 3 rows should have values.