Find out null values between two columns in a DataFrame - pandas

I have to check if there any Null values in between the two columns I have in Dataframe. I have fetched the location of the first non null value and the last non value in the dataframe using these :
x.first_valid_index()
x.last_valid_index()
Now i need to find if there any null values in between these two locations

I think it is same as check all NaN values to boolean mask and sum True values:
x = pd.DataFrame({'a':[1,2,np.nan, np.nan],
'b':[np.nan, 7,np.nan,np.nan],
'c':[4,5,6,np.nan]})
print (x)
a b c
0 1.0 NaN 4.0
1 2.0 7.0 5.0
2 NaN NaN 6.0
3 NaN NaN NaN
cols = ['a','b']
f = x[cols].first_valid_index()
l = x[cols].last_valid_index()
print (f)
0
print (l)
1
print (x.loc[f:l, cols].isnull().sum().sum())
1

Related

How to split dictionary column in dataframe and make a new columns for each key values

I have a dataframe which has a column containing multiple values, separated by ",".
id data
0 {'1':A, '2':B, '3':C}
1 {'1':A}
2 {'0':0}
How can I split up the keys-values of 'data' column and make a new column for each key values present in it, without removing the original 'data' column.
desired output.
id data 1 2 3 0
0 {'1':A, '2':B, '3':C} A B C Nan
1 {'1':A} A Nan Nan Nan
2 {'0':0} Nan Nan Nan 0
Thank you in advance :).
You'll need a regular expression to convert the data into a format that can be parsed as JSON. Then, pd.json_normalize will do the job nicely:
df['data'] = df['data'].str.replace(r'(["\'])\s*:(.+?)\s*(,?\s*["\'}])', '\\1:\'\\2\'\\3', regex=True)
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = pd.concat([df, pd.json_normalize(df['data'])], axis=1)
Output:
>>> df
data 1 2 3 0
0 {'1': 'A', '2': 'B', '3': 'C'} A B C NaN
1 {'1': 'A'} A NaN NaN NaN
2 {'0': '0'} NaN NaN NaN 0

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

subtract each value in column by entire column

I have the following df1 :
prueba
12-03-2018 7
08-03-2018 1
06-03-2018 9
05-03-2018 5
I would like to get each value in the column beggining by the last (5) and substract the entire column by that value. then iterate upwards and subtract the remaining values in the column. for each subtraction I would like to generate a column and generate a df with the results of each subtraction:
The desired output would be something like this:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2 -2 6 0
08-03-2018 -4 -8 0 NaN
06-03-2018 4 0 NaN NaN
05-03-2018 0 NaN NaN NaN
What I tried to obtain the desired output was, first take df1 and
df2=df1.sort_index(ascending=True)
create an empty df:
main_df=pd.DataFrame()
and then iterate over the values in the column df2 and subtract to the df1 column
for index, row in df2.iterrows():
datos=df1-row['pruebas']
df=pd.DataFrame(data=datos,index=index)
if main_df.empty:
main_df= df
else:
main_df=main_df.join(df)
print(main_df)
However the following error outputs:
TypeError: Index(...) must be called with a collection of some kind, '05-03-2018' was passed
You can using np.triu, with array subtract
s=df.prueba.values.astype(float)
s=np.triu((s-s[:,None]).T)
s[np.tril_indices(s.shape[0], -1)]=np.nan
pd.DataFrame(s,columns=df.index,index=df.index).reindex(columns=df.index[::-1])
Out[482]:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2.0 -2.0 6.0 0.0
08-03-2018 -4.0 -8.0 0.0 NaN
06-03-2018 4.0 0.0 NaN NaN
05-03-2018 0.0 NaN NaN NaN
Kind of messy but does the work:
temp = 0
count = 0
df_new = pd.DataFrame()
for i, v, date in zip(df.index, df["prueba"][::-1], df.index[::-1]):
print(i,v)
new_val = df["prueba"] - v
if count > 0:
new_val[-count:] = np.nan
df_new[date] = new_val
temp += v
count += 1
df_new

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40