How to remove columns that have all values below a certain threshold - dataframe

I am trying to remove any columns in my dataframe that do not have one value above .9. I know this probably isn't the most efficient way to do it but I can't find the problem with it. I know it isn't correct because it only removes one column and I know it should be closer to 20. So I do a count to see how many values are below .9 and then if the count equals the length of the list of column values then drop that column. Thanks in advance.
for i in range(len(df3.columns)):
count=0
for j in df3.iloc[:,i].tolist():
if j<.9:
count+=1
if len(df3.iloc[:,i].tolist())==count:
df4=df3.drop(df3.columns[i], axis=1)
df4

You can loop through each column in the dataframe and check the maximum value in each column against your defined threshold, 0.9 in this case, if there are no values more than 0.9, drop the column.
The input:
col1 col2 col3
0 0.2 0.8 1.0
1 0.3 0.5 0.5
Code:
# define dataframe
df = pd.DataFrame({'col1':[0.2, 0.3], 'col2':[0.8, 0.5], 'col3':[1, 0.5]})
# define threshold
threshold = 0.9
# loop through each column in dataframe
for col in df:
# get the maximum value in column
# check if it is less than or equal to the defined threshold
if df[col].max() <= threshold:
# if true, drop the column
df = df.drop([col], axis=1)
This outputs:
col3
0 1.0
1 0.5

Related

return column name of min greater than 0 pandas

I have a dataframe with one date column and rest numeric columns, something like this
date col1 col2 col3 col4
2020-1-30 0 1 2 3
2020-2-1 0 2 3 4
2020-2-2 0 2 2 5
I want to now find the name of the column which gives me minimum sum per column, but only when greater than 0. So in the above case, I want it to give me col2 as a result because the sum of this (5) is least of all other columns other than col1 which is 0. Appreciate any help with this
I would use:
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
# get index of min
out = df2.loc[:, df2.ne(0).all()].sum().idxmin()
If you want to ignore a column only if all values are 0, use any in place of all:
df2.loc[:, df2.ne(0).any()].sum().idxmin()
Output: 'col2'
all minima
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
s = df2.loc[:, df2.ne(0).any()].sum()
# get all minimal
out = s[s.eq(s.min())].index.tolist()
Output:
['col2']

Transforming a dataframe of dict of dict specific format

I have this df dataset:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
Where I'm trying to transform it into this:
And also considering that:
epochs always have the same order for every cell, but instead of only 3 epochs, it could reach 1.000 or 10.000.
The column names and axis could change. For example another day the data could have f1 instead of logloss, or val instead of train. But no matter the names, in df each row will always be a metric name, and each column will always be a dataset name.
The number of columns and rows in df could change too. There are some models with 5 datasets, and 7 metrics for example (which would give a df with 5 columns and 7 rows)
The columname of the output table should be datasetname_metricname
So I'm trying to build some generic code transformation where at the same time avoiding brute force transformations. Just if it's helpful, the df source is:
df = pd.DataFrame(model_xgb.evals_result())
df.columns = ['train', 'test'] # This is the line that can change (and the metrics inside `model_xgb`)
Where model_xgb = xgboost.XGBClassifier(..), but after using model_xgb.fit(..)
Here's a generic way to get the result you've specified, irrespective of the number of epochs or the number or labels of rows and columns:
df2 = df.stack().apply(pd.Series)
df2.index = ['_'.join(reversed(x)) for x in df2.index]
df2 = df2.T.assign(epochs=range(1, len(df2.columns) + 1)).set_index('epochs').reset_index()
Output:
epochs train_auc test_auc train_logloss test_logloss
0 1 0.432 0.456 0.123 0.321
1 2 0.543 0.567 0.234 0.432
2 3 0.523 0.678 0.345 0.543
Explanation:
Use stack() to convert the input dataframe to a series (of lists) with a multiindex that matches the desired column sequence in the question
Use apply(pd.Series) to convert the series of lists to a dataframe with each list converted to a row and with column count equal to the uniform length of the list values in the input series (in other words, equal to the number of epochs)
Create the desired column labels from the latest multiindex rows transformed using join() with _ as a separator, then use T to transpose the dataframe so these index labels (which are the desired column labels) become column labels
Use assign() to add a column named epochs enumerating the epochs beginning with 1
Use set_index() followed by reset_index() to make epochs the leftmost column.
Try this:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
de=df.explode(['train', 'test'])
df_out = de.set_index(de.groupby(level=0).cumcount()+1, append=True).unstack(0)
df_out.columns = df_out.columns.map('_'.join)
df_out = df_out.reset_index().rename(columns={'index':'epochs'})
print(df_out)
Output:
epochs train_auc train_logloss test_auc test_logloss
0 1 0.432 0.123 0.456 0.321
1 2 0.543 0.234 0.567 0.432
2 3 0.523 0.345 0.678 0.543

Python Pandas - how delete some columns

I have a file with many columns to be analysed with Pandas. How can I delete columns if the percentage of missing values is higher than a certain percentage value?
threshold = 0.4 # Your value here
cols_to_be_dropped = []
for column in df.columns:
if df[column].isna().sum() / len(df[column]) > threshold:
cols_to_be_dropped.append(column)
df.drop(cols_to_be_dropped, axis=1, inplace=True)

How can i create a column from 2 related columns of lists in python?

sampleID
testnames
results
23939332
[32131,34343,35566]
[NEGATIVE,0.234,3.331]
32332323
[34343,96958,39550,88088]
[0,312,0.008,0.1,0.2]
The table above is what I have, and the one below is what I want to achieve:
sampleID
32131
34343
39550
88088
96985
35566
23939332
NEGATIVE
0.234
NaN
NaN
NaN
3.331
32332323
NaN
0,312
0.1
0.2
0.008
NaN
So I need to create columns of unique values from the testnames column and fill the cells with the corresponding values from the results column.
Considering this is as a sample from a very large dataset (table).
Here is a commented solution:
(df.set_index(['sampleID']) # keep sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack()) #
It gives the result you expected, though with a different column order:
results
testnames 32131 34343 35566 39550 88088 96958
sampleID
23939332 NEGATIVE 0.234 3.331 NaN NaN NaN
32332323 NaN 0.312 NaN 0.1 0.2 0.008
Let's see how it does on generated data:
def build_df(n_samples, n_tests_per_sample, n_test_types):
df = pd.DataFrame(columns=['sampleID', 'testnames', 'results'])
test_types = np.random.choice(range(0,100000), size=n_test_types, replace=False)
for i in range(n_samples):
testnames = list(np.random.choice(test_types,size=n_tests_per_sample))
results = list(np.random.random(size=n_tests_per_sample))
df = df.append({'sampleID': i, 'testnames':testnames, 'results':results}, ignore_index=True)
return df
def reshape(df):
df2 = (df.set_index(['sampleID']) # keep the sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack())
return df2
%time df = build_df(60000, 10, 100)
# Wall time: 9min 48s (yes, it was ugly)
%time df2 = reshape(df)
# Wall time: 1.01 s
reshape() breaks when n_test_types becomes too large, with ValueError: Unstacked DataFrame is too big, causing int32 overflow.

How can I filter all the rows of my dataframe

my data frame is an election stat => its containing ages of voters and 6 parties with percentages
After that i filter the data frame by the age => i filter all the age below 18
i got a new data frame.
now i got all the voters that are more than 18
df_election_filter1 = self.election
filter_below_age = df_election_filter1.age > 18
print( df_election_filter1[filter_below_age] )
df = df_election_filter1[filter_below_age]
Now i want to filter in my data frame all the parties that have score value below 0.2
how can I do it ? for each parties i need to be more than 0.2
You can select all columns except the age column and query the data:
df.loc[:, df.columns != 'age'] >= 0.2
You might try:
df = df.loc[(df['Column1'] > 0.2) | (df['Column2'] > 0.2), :] # you can add other columns if needed
In this code you select all rows whose values are larger than 0.2 in columns 'Column1' or 'Column2'. You can add more columns, e.g. for each of the parties.
Using your column names:
df = df.loc[(df['HopeforChange'] > 0.2) | (df['WeWillWin'] > 0.2) | (df[etc] > 0.2), :] # fill etc with other columns
Let me know if this is not what you need.