Why do I get warning with this function? - pandas

I am trying to generate a new column containing boolean values of whether a value of each row is Null or not. I wrote the following function,
def not_null(row):
null_list = []
for value in row:
null_list.append(pd.isna(value))
return null_list
df['not_null'] = df.apply(not_null, axis=1)
But I get the following warning message,
A value is trying to be set on a copy of a slice from a DataFrame.
Is there a better way to write this function?
Note: I want to be able to apply this function to each row regardless of knowing the header row name or not
Final output ->
Column1 | Column2 | Column3 | null_idx
NaN | Nan | Nan | [0, 1, 2]
1 | 23 | 34 | []
test1 | Nan | Nan | [1, 2]

First your error means there is some filtering before in your code and need DataFrame.copy:
df = df[df['col'].gt(100)].copy()
Then your solution should be improved:
df = pd.DataFrame({'a':[np.nan, 1, np.nan],
'b':[np.nan,4,6],
'c':[4,5,3]})
df['list_boolean_for_missing'] = [x[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c list_boolean_for_missing
0 NaN NaN 4 [True, True]
1 1.0 4.0 5 []
2 NaN 6.0 3 [True]
Your function:
dd = lambda x: [pd.isna(value) for value in x]
df['list_boolean_for_missing'] = df.apply(not_null, axis=1)
If need:
I am trying to generate a new column containing boolean values of whether a value of each row is Null or not
df['not_null'] = df.notna().all(axis=1)
print (df)
a b c not_null
0 NaN NaN 4 False
1 1.0 4.0 5 True
2 NaN 6.0 3 False
EDIT: For list of positions create helper array by np.arange and filter it:
arr = np.arange(len(df.columns))
df['null_idx'] = [arr[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c null_idx
0 NaN NaN 4 [0, 1]
1 1.0 4.0 5 []
2 NaN 6.0 3 [0]

Related

Filling data in Pandas Series with help of a function

I want to fill values of a column on a certain condition, as in the example in the image:
What's the reason for the TypeError? How can I go about it?
I do not think you are using df.apply() correctly. Remember to post the code as text next time. Here is a working example:
df = pd.DataFrame({'A': [x for x in range (5,11)], 'B':[np.nan, np.nan, 5,11,4,np.nan]})
df['C'] = df.apply(lambda row: '' if pd.isna(row['B']) else row['A'], axis=1)
df
Output:
A B C
0 5 NaN
1 6 NaN
2 7 5.0 7
3 8 11.0 8
4 9 4.0 9
5 10 NaN

What is the difference between the 'set' operation using loc vs iloc?

What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?
There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.
Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C
try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C
Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

Find out null values between two columns in a DataFrame

I have to check if there any Null values in between the two columns I have in Dataframe. I have fetched the location of the first non null value and the last non value in the dataframe using these :
x.first_valid_index()
x.last_valid_index()
Now i need to find if there any null values in between these two locations
I think it is same as check all NaN values to boolean mask and sum True values:
x = pd.DataFrame({'a':[1,2,np.nan, np.nan],
'b':[np.nan, 7,np.nan,np.nan],
'c':[4,5,6,np.nan]})
print (x)
a b c
0 1.0 NaN 4.0
1 2.0 7.0 5.0
2 NaN NaN 6.0
3 NaN NaN NaN
cols = ['a','b']
f = x[cols].first_valid_index()
l = x[cols].last_valid_index()
print (f)
0
print (l)
1
print (x.loc[f:l, cols].isnull().sum().sum())
1

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40