Filling data in Pandas Series with help of a function - pandas

I want to fill values of a column on a certain condition, as in the example in the image:
What's the reason for the TypeError? How can I go about it?

I do not think you are using df.apply() correctly. Remember to post the code as text next time. Here is a working example:
df = pd.DataFrame({'A': [x for x in range (5,11)], 'B':[np.nan, np.nan, 5,11,4,np.nan]})
df['C'] = df.apply(lambda row: '' if pd.isna(row['B']) else row['A'], axis=1)
df
Output:
A B C
0 5 NaN
1 6 NaN
2 7 5.0 7
3 8 11.0 8
4 9 4.0 9
5 10 NaN

Related

Make all values after a label have the same value of that label

I have a data frame:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 2)), columns=['col1', 'col2'])
Which generates the following frame:
col1 col2
0 6 3
1 7 4
2 6 9
3 2 6
4 7 4
I want to replace all values from row 2 forward with whatever value on row 1. So I type:
df.loc[2:] = df.loc[1:1]
But the resulting frame is filled with nan:
col1 col2
0 6.0 3.0
1 7.0 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
I know I can use fillna(method='ffill') to get what I want but why did the broadcasting not work and result is NaN? Expected result:
col1 col2
0 6 3
1 7 4
2 7 4
3 7 4
4 7 4
Edit: pandas version 0.24.2
I believe df.loc[1:1] is just the empty array, hence converted to NaN? It should be df.loc[2:, 'Value'] = df.loc[1, 'Value'].

How to use pivot to get 2 columns into a column multiindex in pandas

I have a data frame with 4 columns (a, b, c, d are column names):
df =
a b c d
1 2 3 4
5 2 7 8
Is it possible to use df.pivot() to get 2 columns into the column multiindex? The following doesn't work:
df.pivot('a', ['b', 'c'])
I want
b 2
c 3 7
a
1 4 NA
5 NA 8
I know I can use pivot_table to get this done easily (pd.pivot_table(df, index='a', columns=['b', 'c'])) but I'm curious about the flexibility of pivot as the documentation isn't clear.
There are obviously missing bits of implementation and I think you've found one. We have work arounds but you are correct, the documentation says that the columns parameter can be an object but nothing seems to work. I trust #MaxU and #jezrael gave it a good try and none of us seem to be able to get it to work as the documentation says is should. I call it bug! I may report it if someone else hasn't already or doesn't before I get to it.
That said, I found this, which is bizarre. I planned on passing a list to the index parameter instead and then transpose. But instead, the strings 'c' and 'b' are used as index values... that isn't at all what I wanted.
What's stranger is this
df.pivot(['c', 'b'], 'a', 'd')
a 1 5
b NaN 8.0
c 4.0 NaN
Also, this looks fine:
df.pivot('a', 'b', 'd')
b 2
a
1 4
5 8
But the error here is confusing
print(df.pivot('a', ['b'], 'd'))
KeyError: 'Level b not found'
The quest continues...
OP's Own Answer
disregard
Using pivot_table
df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
df.pivot_table('d', 'a', ['b', 'c'])
b 2
c 3 7
a
1 4.0 NaN
5 NaN 8.0
we can also use pd.crosstab:
In [80]: x
Out[80]:
a b c d
0 1 2 3 4
1 5 2 7 8
In [81]: pd.crosstab(x.a, [x.b, x.c], x.d, aggfunc='mean')
Out[81]:
b 2
c 3 7
a
1 4.0 NaN
5 NaN 8.0
The closest solution without aggregating is set_index + unstack:
df = df.set_index(['b','c','a'])['d'].unstack([0,1])
print (df)
b 2
c 3 7
a
1 4.0 NaN
5 NaN 8.0
Solution with pivot, but a bit crazy - need create MultiIndex and last transpose:
df = df.set_index(['b','c'])
df = df.pivot(columns='a')['d'].T
print (df)
b 2
c 3 7
a
1 4.0 NaN
5 NaN 8.0

Drop 'NaN' value in dataframe

Is there any way to drop only 'nan' from a dataset not to remove the entire row or column which contains 'nan'? I have tried below code but the result was not the one that i wanted.
df = pd.read_csv('...csv')
df.stack()
Here is the part of csv
And here is after '.stack()'
The headers are mixed up with the actual data. I don't want to be mixed up!
You can use:
df.fillna('')
Which will fill na with an empty string ''. Or you can fill it whatever you like.
using dropna with condition.
nan vlaue not equal itself.
and
you can drop column or row by using,
column: del df.column_name
row: df.drop([row_index])
Consider the dataframe df
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.iloc[1, 1] = np.nan
print(df)
0 1 2
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
You can drop just the middle cell but only if we stack
df.stack()
0 0 0.0
1 1.0
2 2.0
1 0 3.0
2 5.0
2 0 6.0
1 7.0
2 8.0
dtype: float64

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40