Drop 'NaN' value in dataframe - pandas

Is there any way to drop only 'nan' from a dataset not to remove the entire row or column which contains 'nan'? I have tried below code but the result was not the one that i wanted.
df = pd.read_csv('...csv')
df.stack()
Here is the part of csv
And here is after '.stack()'
The headers are mixed up with the actual data. I don't want to be mixed up!

You can use:
df.fillna('')
Which will fill na with an empty string ''. Or you can fill it whatever you like.

using dropna with condition.
nan vlaue not equal itself.
and
you can drop column or row by using,
column: del df.column_name
row: df.drop([row_index])

Consider the dataframe df
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.iloc[1, 1] = np.nan
print(df)
0 1 2
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
You can drop just the middle cell but only if we stack
df.stack()
0 0 0.0
1 1.0
2 2.0
1 0 3.0
2 5.0
2 0 6.0
1 7.0
2 8.0
dtype: float64

Related

Pandas transformation, duplicate index values to column values

I have the following pandas dataframe:
0
0
A 0
B 0
C 0
C 4
A 1
A 7
Now there are some index letter (A and C) that appear multiple times. I want the values of these index letters on a extra column beside instead of a extra row. The desired pandas dataframe looks like:
0 1 3
0
A 0 1 7
B 0 np.nan np.nan
C 0 4 np.nan
Anything would help!
IIUC, you need to add a helper column:
(df.assign(group=df.groupby(level=0).cumcount())
.set_index('group', append=True)[0] # 0 is the name of the column here
.unstack('group')
)
or:
(df.reset_index()
.assign(group=lambda d: d.groupby('index').cumcount())
.pivot('index', 'group', 0) # col name here again
)
output:
group 0 1 2
A 0.0 1.0 7.0
B 0.0 NaN NaN
C 0.0 4.0 NaN

Filling data in Pandas Series with help of a function

I want to fill values of a column on a certain condition, as in the example in the image:
What's the reason for the TypeError? How can I go about it?
I do not think you are using df.apply() correctly. Remember to post the code as text next time. Here is a working example:
df = pd.DataFrame({'A': [x for x in range (5,11)], 'B':[np.nan, np.nan, 5,11,4,np.nan]})
df['C'] = df.apply(lambda row: '' if pd.isna(row['B']) else row['A'], axis=1)
df
Output:
A B C
0 5 NaN
1 6 NaN
2 7 5.0 7
3 8 11.0 8
4 9 4.0 9
5 10 NaN

subtract each value in column by entire column

I have the following df1 :
prueba
12-03-2018 7
08-03-2018 1
06-03-2018 9
05-03-2018 5
I would like to get each value in the column beggining by the last (5) and substract the entire column by that value. then iterate upwards and subtract the remaining values in the column. for each subtraction I would like to generate a column and generate a df with the results of each subtraction:
The desired output would be something like this:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2 -2 6 0
08-03-2018 -4 -8 0 NaN
06-03-2018 4 0 NaN NaN
05-03-2018 0 NaN NaN NaN
What I tried to obtain the desired output was, first take df1 and
df2=df1.sort_index(ascending=True)
create an empty df:
main_df=pd.DataFrame()
and then iterate over the values in the column df2 and subtract to the df1 column
for index, row in df2.iterrows():
datos=df1-row['pruebas']
df=pd.DataFrame(data=datos,index=index)
if main_df.empty:
main_df= df
else:
main_df=main_df.join(df)
print(main_df)
However the following error outputs:
TypeError: Index(...) must be called with a collection of some kind, '05-03-2018' was passed
You can using np.triu, with array subtract
s=df.prueba.values.astype(float)
s=np.triu((s-s[:,None]).T)
s[np.tril_indices(s.shape[0], -1)]=np.nan
pd.DataFrame(s,columns=df.index,index=df.index).reindex(columns=df.index[::-1])
Out[482]:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2.0 -2.0 6.0 0.0
08-03-2018 -4.0 -8.0 0.0 NaN
06-03-2018 4.0 0.0 NaN NaN
05-03-2018 0.0 NaN NaN NaN
Kind of messy but does the work:
temp = 0
count = 0
df_new = pd.DataFrame()
for i, v, date in zip(df.index, df["prueba"][::-1], df.index[::-1]):
print(i,v)
new_val = df["prueba"] - v
if count > 0:
new_val[-count:] = np.nan
df_new[date] = new_val
temp += v
count += 1
df_new

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40