Pandas Check Excel import for isnumeric() data in Dataframe - pandas

I am importing data into Pandas from Excel and I need to verify that the data is numeric based on the Columns.
month value dp wd ... mg fee pr comment
0 2013-07-31 208372.33 4206.84 4692.22 ... 0 0 0 some comment
1 2013-08-31 210669.77 0.00 1270.28 ... 0 0 0
There are about 20 columns and I only need to exclude the "month" and "comment" columns.
Is there something like: df.iloc[:, 2: 18].isnumeric() or will this require a loop?
I would like to get a True / False response.
thank you.

One way is select_dtypes and compare:
np.array_equal(df.select_dtypes(include='number').columns, df.columns[1:-1])

You can use apply method to apply series methods to columns of a dataframe.
df2 = df.drop(["month", "comment"], axis=1)
df2 = df2.apply(lambda x: x.str.isnumeric())

Related

Create new column based of two columns

I have two columns in a dataframe. I want to create third column such that if first column > second column than 1 ow 0. As below
Df
Value1 value 2. Newcolumn
101. 0
97. 1
Comparing two columns in a Pandas DataFrame and write the results of the comparison to a third column. It can do easily by these syntaxes
conditions=[(condition1),(condition2)]
choices=["choice1","choice2"]
df["new_column_name"]=np.select(conditions, choices, default)
conditions are the conditions to check for between the two columns
choices are the results to return based on the conditions
np.select is used to return the results to the new column
The dataframe is:
import numpy as np
import pandas as pd
#create DataFrame
df = pd.DataFrame({'Value1': [100,101],
'value 2': [101,97]})
#define conditions
conditions = [df['Value1'] < df['value 2'],
df['Value1'] > df['value 2']]
#define choices
choices = ['0', '1']
#create new column in DataFrame that displays results of comparisons
df['Newcolumn'] = np.select(conditions, choices, default='Tie')
Final dataframe
print(df)
Output:
Value1 value 2 Newcolumn
0 100 101 0
1 101 97 1

Drop rows where a subset of columns are empty in Pandas

I have a pandas dataframe in the below format
No ei1 ei2 ei3 ei4 ei1_val ei2_val ei3_val ei4_val
123
124
125 0 0 0 1 low low high high
To simplify, I have shown only a subset of columns here but actually the pandas dataframe has columns from ei1 to ei24 and ei1_val to ei24_val.
I have retrieved the column names using the below code:
val_cols = df[[col for col in df.columns if col.endswith("_val")]]
cols = [col.replace('_val', '') for col in val_cols.columns]
After that, I need to drop the rows from dataframe df if all columns in val_cols and all columns in cols are empty. Hence the output dataframe would drop rows with No's 123 and 124. Not sure whether is there a way to do it efficiently in Pandas rather than looping over the columns and checking the values.
Any suggestions would be appreciated.
IIUC, try:
m = ~df.filter(regex='.*_val').isna().all(axis=1)
df[m]
Output:
No ei1 ei2 ei3 ei4 ei1_val ei2_val ei3_val ei4_val
2 125 0.0 0.0 0.0 1.0 low low high high
Find all the columns where the column header ends with _val using regex in the pd.DataFrame.filter method.
Check to see if all values are NaN using isna and all with axis=1

Count occurrences of inf in pandas dataframe

We can count occurrences of nan with df.isna().count()
Is there is a similar function to count inf?
This worked for me:
number_inf = df[df == np.inf].count()
use np.isinf()
df = pd.DataFrame({'data' : [0,0,float('inf'),float('inf')]})
print(df)
data
0 0.0
1 0.0
2 inf
3 inf
df.groupby(np.isinf(df['data'])).count()
data
data
False 2
True 2
you can use isinf() and ravel() in one line
# Pandas Series.ravel() function returns the flattened underlying data as an ndarray.
np.isinf(df["Col"]).values.ravel().sum()

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))