Remove null rows in pyspark dataframe [duplicate] - dataframe

This question already has answers here:
Filter Pyspark dataframe column with None value
(10 answers)
Closed 4 years ago.
When I loaded a fairly large dataset (i.e. Wikipedia's archives) into a spark dataframe, I received the below error:
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
What is the best way to remove Null values within a pyspark dataframe?

you can use na.drop() in order to remove all rows including Null values:
df.na.drop()

Related

Pandas Stack Column Number Mismatch [duplicate]

This question already has answers here:
Pandas: Adding new column to dataframe which is a copy of the index column
(3 answers)
Closed 1 year ago.
Try to stack and result in 3 columns not 1
Hello, I am trying to use the stack function in pandas, but when I use it results in only 1 column when using shape, but displays 3. I see that they are on different levels and I have tried stuff with levels with no success. What can I do I need 3 columns!?
-Thanks
Use new_cl_traff.reset_index()
As you can see in your screenshot you have a multi-index on your dataframe with year and month - see the line where you name the two index levels:
new_cl_traf.index.set_names(["Year","Month"], inplace=True)
You can see the documentation for pandas.stack here
if you use new_cl_traff.reset_index() the index or a subset of levels will be reset - see documentation here

How do create lists of items for every unique ID in a Pandas DataFrame? [duplicate]

This question already has answers here:
How to get unique values from multiple columns in a pandas groupby
(3 answers)
Python pandas unique value ignoring NaN
(4 answers)
Closed 1 year ago.
Imagine I have a table that looks like this.
original table
How do I convert it into this?
converted table
Attached sample data. Thanks.

filter multiple separate rows in a DataFrame that meet the condition in another DataFrame with pandas? [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 2 years ago.
This is my DataFrame
df = pd.DataFrame({'uid': [109200005, 108200056, 109200060, 108200085, 108200022],
'grades': [69.233627, 70.130900, 83.357011, 88.206387, 74.342212]})
This is my condition list which comes from another DataFrame
condition_list = [109200005, 108200085]
I use this code to filter records that meet the condition
idx_list = []
for i in condition_list:
idx_list.append(df[df['uid']==i].index.values[0])
and get what I need
>>> df.iloc[idx_list]
uid grades
0 109200005 69.233627
3 108200085 88.206387
Job is done. I'd just like to know is there a simpler way to do the job?
Yes, use isin:
df[df['uid'].isin(condition_list)]

How to convert ndarray to pandas DataFrame [duplicate]

This question already has answers here:
Convert two numpy array to dataframe
(3 answers)
Closed 3 years ago.
I have ndarray data with the shape of (231,31). now I want to convert this ndarray to pandas DataFrame with 31 columns. I am using this code:
for i in range (1,32):
dataset = pd.DataFrame({'Column{}'.format(i):data[:,i-1]})
but this code just creates the last column, it means with 231 indexes and just 1 column, but I need 31 columns. is there any way to fix this problem and why it happens?
Every time you are creating a new dataframe, that is why only the last column remains.
You need to create the dataframe with pd.DataFrame(data).

pyspark sql dataframe keep only null [duplicate]

This question already has answers here:
Filter Pyspark dataframe column with None value
(10 answers)
Closed 6 years ago.
I have a sql dataframe df and there is a column user_id, how do I filter the dataframe and keep only user_id is actually null for further analysis? From the pyspark module page here, one can drop na rows easily but did not say how to do the opposite.
Tried df.filter(df.user_id == 'null'), but the result is 0 column. Maybe it is looking for a string "null". Also df.filter(df.user_id == null) won't work as it is looking for a variable named 'null'
Try
df.filter(df.user_id.isNull())