how to regularize numeric/non-numeric entries in pandas dataframes - pandas

I want to control for non-numeric entries in my pandas dataframe.
Say I have the following:
>>> df
col_1 col_2 col_3
0 0.01 NaN 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
I can take the row-means as follows, while python properly skips over the NaN values:
>>> df.mean(axis=1)
0 0.055000
1 0.550000
2 0.155000
3 0.436667
dtype: float64
Great!. but now suppose one of the values from my imported table is a string
>>> df.iloc[0,1]="str1"
>>> df
col_1 col_2 col_3
0 0.01 str1 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
>>> df.mean(axis=1)
0 0.055
1 0.200
2 0.155
3 0.205
dtype: float64
DANGER: the output looks plausible, but is wrong, because once I changed the value in position [0,1] to a string, the values in position [1,1] and [3,1] changed from being the number 0.9 to become the string "0.9", and all the strings are omitted from averaging (I guess each column has to be of the same type? there's probably a reason, but boy this is dangerously subtle.)
What I want to do now is force all the entries of the dataframe back into numeric type. Anything that can be sensibly coerced into a number should become that number, and anything that cannot should become nan (regardless of what string or type it might have been).
Pandas series have a function pandas.to_numeric where you can set errors='coerce', but unfortunately the analogous function for df's (DataFrame.astype()) doesn't allow this option.
Is there a function for "make every element of the dataFrame that looks like a number numeric, and make everything else nan"?

I think you can use to_numeric on a subset of columns with apply. This answer might help.

You can apply, which by default will perform on the columns:
df.apply(pd.to_numeric, errors='coerce').mean(1)

Related

Pandas: If condition on multiple columns having null values and fillna with 0

I have a below dataframe, and my requirement is that, if both columns have np.nan then no change, if either of column has empty value then fill na with 0 value. I wrote this code but why its not working. Please suggest.
import pandas as pd
import numpy as np
data = {'Age': [np.nan, np.nan, 22, np.nan, 50,99],
'Salary': [217, np.nan, 262, 352, 570, np.nan]}
df = pd.DataFrame(data)
print(df)
cond1 = (df['Age'].isnull()) & (df['Salary'].isnull())
if cond1 is False:
df['Age'] = df['Age'].fillna(0)
df['Salary'] = df['Salary'].fillna(0)
print(df)
You can just assign it with update
c = ['Age','Salary']
df.update(df.loc[~df[c].isna().all(1),c].fillna(0))
df
Out[341]:
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
c1 = df['Age'].isna()
c2 = df['Salary'].isna()
df[np.c_[c1 & ~c2, ~c1 & c2]]=0
df
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
tmp=df.loc[(df['Age'].isna() & df['Salary'].isna())]
df.fillna(0,inplace=True)
df.loc[tmp.index]=np.nan
This might be a bit less sophisticated than the other answers but worked for me:
I first save the row(s) containing both Nan values at the same time
then fillna the original df as per normal
then set np.nan back to the location where we saved both rows containing Nan at the same time
Get the rows that are all nulls and use where to exclude them during the fill:
bools = df.isna().all(axis = 1)
df.where(bools, df.fillna(0))
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
Your if statement won't work because you need to check each row for True or False; cond1 is a series, and cannot be compared correctly to False (it will just return False, which is not entirely true), there can be multiple False and True in the series.
An inefficient way would be to traverse the rows:
for row, index in zip(cond1, df.index):
if not row:
df.loc[index] = df.loc[index].fillna(0)
apart from the inefficiency, you are keeping track of index positions; the pandas options try to abstract the process while being quite efficient, since the looping is in C

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

selecting nan values in a pandas dataframe using loc [duplicate]

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

How to manipulate data in arrays using pandas

Have data in dataframe and need to compare current value of one column and prior of value of another column. Current time is row 5 in this dataframe and here's the desired output:
target data is streamed and captured into a DataFrame, then that array is multiplied by a constant to generate another column, however unable to generate the third column comp, which should compare current value of prod with prior value of the comp from comp.
df['temp'] = self.temp
df['prod'] = df['temp'].multiply(other=const1)
Another user had suggested using this logic but it is generates errors because the routine's array doesn't match the size of the DataFrame:
for i in range(2, len(df['temp'])):
df['comp'].append(max(df['prod'][i], df['comp'][i - 1]))
Let's try this, I think this will capture your intended logic:
df = pd.DataFrame({'col0':[1,2,3,4,5]
,'col1':[5,4.9,5.5,3.5,6.3]
,'col2':[2.5,2.45,2.75,1.75,3.15]
})
df['col3'] = df['col2'].shift(-1).cummax().shift()
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 6.3 3.15 3.15

map one column in a df to another df where all words are present

I am trying to map a column to a dataframe from another dataframe where all words exist from the target dataframe
multiple matches are fine as I can filter them out after.
Thanks in advance!
df1
ColA
this is a sentence
with some words
in a column
and another
for fun
df2
ColB ColC
this a 123
in column 456
fun times 789
Some attempts
dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)
dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))
desired output
dfResult
ColA ColC
this is a sentence 123
with some words NaN
in a column 456
and another NaN
for fun NaN
Turn to set and look for subsets with Numpy broadcasting
Disclaimer: No assurances that this will be fast.
A = df1.ColA.str.split().apply(set).to_numpy() # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy() # instead of `.to_numpy()`
C = df2.ColC.to_numpy()
# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values. Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype()) # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]
df1.assign(ColC=out)
ColA ColC
0 this is a sentence 123
1 with some words NaN
2 in a column 456
3 and another NaN
4 for fun NaN
By using loop and set.issubset
pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]:
0 123.0
1 NaN
2 456.0
3 NaN
4 NaN
dtype: float64