pyspark sql dataframe keep only null [duplicate] - sql

This question already has answers here:
Filter Pyspark dataframe column with None value
(10 answers)
Closed 6 years ago.
I have a sql dataframe df and there is a column user_id, how do I filter the dataframe and keep only user_id is actually null for further analysis? From the pyspark module page here, one can drop na rows easily but did not say how to do the opposite.
Tried df.filter(df.user_id == 'null'), but the result is 0 column. Maybe it is looking for a string "null". Also df.filter(df.user_id == null) won't work as it is looking for a variable named 'null'

Try
df.filter(df.user_id.isNull())

Related

How to print the value of a row that returns false using .isin method in python [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 4 months ago.
I am new to writing code and currently working on a project to compare two columns of an excel sheet using python and return the rows that does not match.
I tried using the .isin funtion and was able to identify output the values comparing the columns however i am not sure on how to print the actual row that returns the value "False"
For Example:
import pandas as pd
data = ["Darcy Hayward","Barbara Walters","Ruth Fraley","Minerva Ferguson","Tad Sharp","Lesley Fuller","Grayson Dolton","Fiona Ingram","Elise Dolton"]
df = pd.DataFrame(data, columns=['Names'])
df
data1 = ["Darcy Hayward","Barbara Walters","Ruth Fraley","Minerva Ferguson","Tad Sharp","Lesley Fuller","Grayson Dolton","Fiona Ingram"]
df1 = pd.DataFrame(data1, columns=['Names'])
df1
data_compare = df["Names"].isin(df1["Names"])
for data in data_compare:
if data==False:
print(data)
However, i want to know that 8 index returned False, something like the below format
Could you please advise how i can modify the code to get the output printed with the Index, Name that returned False?

How do create lists of items for every unique ID in a Pandas DataFrame? [duplicate]

This question already has answers here:
How to get unique values from multiple columns in a pandas groupby
(3 answers)
Python pandas unique value ignoring NaN
(4 answers)
Closed 1 year ago.
Imagine I have a table that looks like this.
original table
How do I convert it into this?
converted table
Attached sample data. Thanks.

Pandas - list of unique strings in a column [duplicate]

This question already has answers here:
Find the unique values in a column and then sort them
(8 answers)
Closed 1 year ago.
i have a dataframe column which contains these values:
A
A
A
F
R
R
B
B
A
...
I would like to make a list summarizing the different strings, as [A,B,F,...].
I've used groupby with nunique(), but I don't need counting.
How can I make the list ?
Thanks
unique() is enough
df['col'].unique().tolist()
pandas.Series.nunique() is to return the number of unique items.

filter multiple separate rows in a DataFrame that meet the condition in another DataFrame with pandas? [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 2 years ago.
This is my DataFrame
df = pd.DataFrame({'uid': [109200005, 108200056, 109200060, 108200085, 108200022],
'grades': [69.233627, 70.130900, 83.357011, 88.206387, 74.342212]})
This is my condition list which comes from another DataFrame
condition_list = [109200005, 108200085]
I use this code to filter records that meet the condition
idx_list = []
for i in condition_list:
idx_list.append(df[df['uid']==i].index.values[0])
and get what I need
>>> df.iloc[idx_list]
uid grades
0 109200005 69.233627
3 108200085 88.206387
Job is done. I'd just like to know is there a simpler way to do the job?
Yes, use isin:
df[df['uid'].isin(condition_list)]

What is the use of lit() in spark? The below two piece of code returns the same output, what is the benefit of using lit() [duplicate]

This question already has answers here:
Where do you need to use lit() in Pyspark SQL?
(2 answers)
Closed 2 years ago.
I have two piece of codes here
gooddata = gooddata.withColumn("Priority",when(gooddata.years_left < 5 & (gooddata.Years_left >= 0),lit("CRITICAL"))).fillna("LOW").show(5)
gooddata=gooddata.withColumn("Priority",when((gooddata.Years_left < 5) & (gooddata.Years_left >= 0),"CRITICAL").otherwise("LOW")).show(5)
For both spark and pyspark:
literals in certain statements
comparing with nulls
getting the name of a dataframe column instead of the contents of the dataframe column
E.g.
val nonNulls = df.columns.map(x => when(col(x).isNotNull, concat(lit(","), lit(x))).otherwise(",")).reduce(concat(_, _))
from question: Add a column to spark dataframe which contains list of all column names of the current row whose value is not null
val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_value1"))