How to filter in rows where any column is null in pyspark dataframe - dataframe

It has to be somewhere on stackoverflow already but I'm only finding ways to filter the rows of a pyspark dataframe where 1 specific column is null, not where any column is null.
import pandas as pd
import pyspark.sql.functions as f
my_dict = {"column1":list(range(100)),"column2":["a","b","c",None]*25,"column3":["a","b","c","d",None]*20}
my_pandas_df = pd.DataFrame(my_dict)
sparkDf = spark.createDataFrame(my_pandas_df)
sparkDf.show(5)
I'm trying to include any row with null values on any column of my dataframe, basically the opposite of this:
sparkDf.na.drop()

For including rows having any columns with null:
sparkDf.filter(F.greatest(*[F.col(i).isNull() for i in sparkDf.columns])).show(5)
For excluding the same:
sparkDf.na.drop(how='any').show(5)

Related

Pandas remove all columns before a column with a matching value is found

I have multiple csv files that I want to import and remove all columns before a column that contains a date and then concatenate them together to have the date in the first column matching all dataframes. All files have date columns which may have different column indexes. I also insert the filename as a column.
As of now I'm removing all lines with no dates but have no idea how to find the index of the matching column to remove all the columns before?
import os
import pd as pandas
import glob as glob
globbed_files = glob.glob("*.csv")
data = []
for csv in globbed_files:
frame = pd.read_csv(csv)
frame['g'] = os.path.basename(csv).split(".")[0]
data.append(frame)
bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("processed.csv")

Boolean Masking on pandas dataframe

I have multiple columns in a data-frame. I have set a condition for one column and i got a true/false array. Now i want to remove the false row values from that column which will also remove the corresponding row values from the other columns too.
Example:
import pandas as pd
sample = {
"COL_1" : [10,45,747,120,45,78],
"COL_2" : [11,45,78,45,10,25],
"COL_3" : [44,55,77,50,60,40]
}
df = pd.DataFrame(sample)
mask = df['COL_1']>100
Now i want to filter my whole data-frame using the mask array.

SQL - Count All Cells In The Entire Table That Are Not NULL And Not Empty

I have recently been asked to do a count of all the cells in some tables that are not NULL and not empty/blank.
The issue is, I have about 80 tables and some of those tables have dozens of columns and others have hundreds of columns.
Is there a query I could use to count all cells from all columns that fit a specific criteria (in this case not NULL and not empty/blank)?
I have done some searching and it seems most answers revolve around single columns or tables that only have like 3-5 columns.
Thanks!
Try connecting SQL with pandas using pymysql or pyodbc connector and then iterate over each column using for loop and apply the count function on it.
import pymysql
import pandas as pd
import numpy as np
con = pymysql.connect('[host name]', '[user name]','[your password]', '[database name]')
cursor = con.cursor()
df = pd.read_sql('select * from [table name]',con) # SQL converted to pandas dataframe
print(df)
for col in df.columns: # loops through column
count_ = df[col].count()
print(count_) # returns count for non-nan values

Replace pyspark column based on other columns

In my "data" dataframe, I have 2 columns, 'time_stamp' and 'hour'. I want to insert 'hour' column values where 'time_stamp' values is missing. I do not want to create a new column, instead fill missing values in 'time_stamp'
What I'm trying to do is replace this pandas code to pyspark code:
data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)
Something like this should work
from pyspark.sql import functions as f
df = (df.withColumn('time_stamp',
f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped
Alternatively, if you don't like sql:
df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')