How to keep all values from a dataframe except where NaN is present in another dataframe? - pandas

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.

You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

Related

How to select the rows having same id and have all missing value in another column

I have the following dataframe:
ID col_1
1 NaN
2 NaN
3 4.0
2 NaN
2 NaN
3 NaN
3 3.0
1 NaN
I need the following output:
ID col_1
1 NaN
1 NaN
2 NaN
2 NaN
2 NaN
how to do this in pandas
You can create a boolean mask with isna then group this mask by ID and transform using all, then you can filter the rows with the help of this mask:
mask = df['col_1'].isna().groupby(df['ID']).transform('all')
df[mask].sort_values('ID')
Alternatively you can use groupby + filter to filter out the groups which satisfy the condition where all values in col_1 are NaN but this method should be slower than the above:
df.groupby('ID').filter(lambda g: g['col_1'].isna().all()).sort_values('ID')
ID col_1
0 1 NaN
7 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
Let us try with isin after groupby with all
s = df['col_1'].isna().groupby(df['ID']).all()
df = df.loc[df.ID.isin(s[s].index.tolist())]
df
Out[73]:
ID col_1
0 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
7 1 NaN
import pandas as pd
import numpy as np
df=pd.read_excel(r"D:\Stack_overflow\test12.xlsx")
df1=(df[df['cols_1'].isnull()]).sort_values(by=['ID'])
I think we can simply take out the null values.

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output

Pandas DataFrame + object type + HDF + PyTables 'table'

(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype
Here's a simple example along the lines of what you have described
In [17]: df = DataFrame(randn(10,10))
In [18]: df.iloc[5:10,7:9] = np.nan
In [19]: df.iloc[7:10,4:9] = np.nan
In [22]: df.iloc[7:10,-1] = np.nan
In [23]: df
Out[23]:
0 1 2 3 4 5 6 7 8 9
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN
In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)
In [25]: df
Out[25]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3
Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)
In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [27]: pd.read_hdf('test.h5','df')
Out[27]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3