Boolean Masking on pandas dataframe - pandas

I have multiple columns in a data-frame. I have set a condition for one column and i got a true/false array. Now i want to remove the false row values from that column which will also remove the corresponding row values from the other columns too.
Example:
import pandas as pd
sample = {
"COL_1" : [10,45,747,120,45,78],
"COL_2" : [11,45,78,45,10,25],
"COL_3" : [44,55,77,50,60,40]
}
df = pd.DataFrame(sample)
mask = df['COL_1']>100
Now i want to filter my whole data-frame using the mask array.

Related

How to filter in rows where any column is null in pyspark dataframe

It has to be somewhere on stackoverflow already but I'm only finding ways to filter the rows of a pyspark dataframe where 1 specific column is null, not where any column is null.
import pandas as pd
import pyspark.sql.functions as f
my_dict = {"column1":list(range(100)),"column2":["a","b","c",None]*25,"column3":["a","b","c","d",None]*20}
my_pandas_df = pd.DataFrame(my_dict)
sparkDf = spark.createDataFrame(my_pandas_df)
sparkDf.show(5)
I'm trying to include any row with null values on any column of my dataframe, basically the opposite of this:
sparkDf.na.drop()
For including rows having any columns with null:
sparkDf.filter(F.greatest(*[F.col(i).isNull() for i in sparkDf.columns])).show(5)
For excluding the same:
sparkDf.na.drop(how='any').show(5)

Different behaviour between two ways of dropping duplicate values in a dataframe

I tested two ways of dropping duplicated rows in a dataframe but they didn't obtain the same result and I don't understand why.
First code:
file_df1 = open('df1.csv', 'r')
df1_list = []
for line in fila_df1:
new_line = line.rsplit(',')
df1_firstcolumn = new_line[0]
if df1_firstcolumn not in df1_list:
df1_list.append(df1_firstcolumn)
#else:
#print('firstcolumn: ' + df1_firstcolumn + ' is duplicated')
file_df1.close()
The second-way using pandas:
import pandas as pd
df1 = pd.read_csv('df1.csv', header=None, names=['firstcolumn','second','third','forth'])
df1.drop_duplicates(inplace=True)
I obtained more unique values using pandas.
The first way you post will "drop duplicates" when duplicates occur based on data in the first column only.
The pandas drop_duplicates function by default is checking that values in all four columns have been duplicated. The version below will remove duplicates based on the first column only.
df1.drop_duplicates(subset=['firstcolumn'], inplace=True)

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

How to apply custom string matching function to pandas dataframe and return summary dataframe about correct/ incorrect patterns?

I have written a pattern matching function to classify weather a dataframe column value matches a given pattern or not. I created a column 'Correct_Pattern' to store the boolean answers in that dataframe. I also created a new dataframe called Incorrect_Pattern_df, which only contains the values that do not match the desired pattern. I did this, because I later on would like to see if I can correct those incorrect numbers. Now, every time I corrected a batch of numbers I would like to check the number format again and regenerate the Incorrect_Pattern_df. Please see my code below. What do I need to do to make it work?
#data
mylist = ['850/07-498745', '850/07-148465', '07-499015']
#create dataframe
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
#function to check if my numbers follow the correct pattern
def check_number_format(dataframe, rm_pattern, column_name):
#create a column Correct_pattern that contains a boolean 'true or false' depending wheather the
pattern was matched or not
dataframe['Correct_pattern'] = dataframe[column_name].str.match(pattern)
#filter all incorrect patterns and put them in a dataframe called Incorrect-Pattern_df
Incorrect_Pattern_df = dataframe[dataframe.Correct_pattern == False]
#return both the original dataframe with the added Correct_pattern_df and the dataframe containing
the Incorrect_Pattern_df
return Incorrect_Pattern_df
#apply the check_Schadennumer_Format to a dataframe
Incorrect_Pattern_df = df['mycolumn'].apply(check_number_format, args=(df, r'^\d{2}-\d+$',
'mycolumn'))
The desired output should look as follows: