How to filter df by column's value NOT in value list with Polars? - dataframe

My last question about filter df by value list had a nice solution:
How to filter df by value list with Polars?
But now I have inverse task.
I have a list with some int values: black_list = [45, 87, 555]
And I have df with some values in column cid1.
df = pl.DataFrame(
{
"cid1": [45, 99, 177],
"cid2": [4, 5, 6],
"cid3": [7, 8, 9],
}
)
How I can filter df by my black_list to result df contains only rows without blacklisted values in the "cid1" column?
I can't filter by some white_list according to the conditions of my task.
The code .filter((pl.col("cid1").is_not(black_list)) not suitable. I tried it but it get me an error TypeError: Expr.is_not() takes 1 positional argument but 2 were givenand I don't catch another way.

You can just add ~ to get reversed Series of bool values
df.filter(~col("cid1").is_in(black_list))
or you can use .is_not() to reverse bool values
df.filter(col("cid1").is_in(black_list).is_not())

Related

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

Identifying indices where any one of a plurality of columns has a certain value?

I need to determine which indices in a dataframe have any one of a set of columns having a specified value. The dataframe has several hundred columns, and a few dozen I need to use for the filtering, so it's impractical to write them all out. My strategy is as follows, to determine indices where any column having 'temp' in its name is equal to 1:
columns = [col for col in df.columns if 'temp' in col]
indices = list(np.where(df[columns]==1)[0])
However, this is returning an unexpected result - it seems return a value for every single index in the df. Any clues where this is going wrong?
You could try this:
import pandas as pd
# Toy dataframe: two columns have "temp" in their name
# and rows 0 and 3 have a value of 1
df = pd.DataFrame(
{"SJDRtemp": [0, 0, 0, 1], "TR": [0, 0, 2, 1], "LDtemp": [1, 3, 0, 0]}
)
# Select columns which name contains "temp"
columns = [col for col in df.columns if "temp" in col]
# Get indices of rows where "temp columns" have a value of 1
indices = list(df[df[columns] == 1].dropna(how="all").index)
print(indices)
# Outputs
[0, 3]

Apply function to list of columns from a dataframe

I'm creating a function that accepts 3 inputs: a dataframe, a column and a list of columns.
The function should apply a short calculation to the single column, and a different short calculation to the list of other columns. It should return a dataframe containing just the amended columns (and their amended rows) from the original dataframe.
import numpy as np
df = pd.DataFrame([[1, 2, 3, 4], [1, 3, 5, 6], [4, 6, 7, 8], [5, 4, 3, 6], columns=['A', 'B', 'C', 'D'])
def pre_process(dataframe, y_col_name, x_col_names):
return = new_dataframe
The calculation to be applied to y_col_name's rows is each value of y_col_name divided by the mean of y_col_name.
The calculation to be applied to each of the list of columns in x_col_name is each value of each column, divided by the column's standard deviation.
I would like some help to write the function. I think I need to use an "apply" or a "lambda" function but I'm unsure.
This is what calling the command would look like:
pre_process_data = preprocess(df,'A', ['B','D'])
Thanks
def pre_process(dataframe, y_col_name, x_col_names):
new_dataframe = dataframe.copy()
new_dataframe[y_col_name] = new_dataframe[y_col_name]/new_dataframe[y_col_name].mean()
new_dataframe[x_col_names] = new_dataframe[x_col_names]/new_dataframe[x_col_names].std()
return new_dataframe
Is this what you mean?

Numpy remove rows with same column values

How do I remove rows from ndarray arrays which have the same nth column value?
For eg,
a = np.ndarray([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
And I want to have rows unique by third column.
I want to have just the [1, 3, 5] row left.
numpy.unique does not do it. It will check for uniqueness in every column; I can't specify the
column by which to check uniqueness.
How can I do this efficiently for thousand + rows?
Thank you.
You could try a combination of bincount, nonzero and in1d
import numpy as np
a = np.array([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
#A tuple containing the values which are unique in column 3
unique_in_column = (np.bincount(a[:,2]) == 1).nonzero()
a[:,2] == unique_in_column[0]
unique_index = np.in1d(a[:,2], unique_in_column[0])
unique_a = a[unique_index]
This should do the trick. However, I'm not sure how this method scales with 1000+ rows.
I had done this finally:
repeatdict = {}
todel = []
for i, row in enumerate(kplist):
if repeatdict.get(row[2], 0):
todel.append(i)
else:
repeatdict[row[2]] = 1
kplist = np.delete(kplist, todel, axis=0)
Basically, I iterated over the list store the values of the third column, and if in the next iteration the same value is already found in the repeatdict dict, that row is marked for deletion, by storing its index in todel list.
Then we can get rid of the unwanted rows by calling np.delete with the list of all row indexes which we want to delete.
Also, I'm not picking my answer as the picked answer, because I know there's probably a better way to do this with just numpy magic.
I'll wait.

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.