Filter negative values from a pyspark dataframe - dataframe

I have a spark dataframe with >40 columns with mixed values. How can I select only positive values from all columns at once & filter out negative ones? I visited [Python Pandas: DataFrame filter negative values but none of the solutions are working. I would like to fit Naive Bayes in pyspark where one of the assumptions are all the features have to be positive. How can I prepare data for the same by selecting only positive values from my features?

suppose you have a dataframe like this
data = [(0,-1,3,4,5, 'a'), (0,-1,3,-4,5, 'b'), (5,1,3,4,5, 'c'),
(10,1,13,14,5,'a'),(7,1,3,4,2,'b'), (0,1,23,4,-5,'c')]
df = sc.parallelize(data).toDF(['f1', 'f2','f3','f4', 'f5', 'class'])
use a VectorAssembler to assemble all the columns in a vector.
from pyspark.ml.feature import VectorAssembler
transformer = VectorAssembler(inputCols =['f1','f2','f3','f4','f5'], outputCol='features')
df2 = transformer.transform(df)
Now, filter the dataframe using a udf
from pyspark.sql.types import *
foo = udf(lambda x: not np.any(np.array(x)<0), BooleanType())
df2.drop('f1','f2','f3','f4','f5').filter(foo('features')).show()
result
+-----+--------------------+
|class| features|
+-----+--------------------+
| c|[5.0,1.0,3.0,4.0,...|
| a|[10.0,1.0,13.0,14...|
| b|[7.0,1.0,3.0,4.0,...|
+-----+--------------------+

Related

How to combine 2 ideally rows that met specific conditions from a dataframe into another dataframe?

I am new to pandas. Now I successfully retrieve the 2 rows with seperated 'return' code as below:
df = pd.read_csv ('all_time_olympic_medals.csv')
df2 = df.iloc[:-1]
return df2[df2['no_summer_golds']==df2['no_summer_golds'].max()]
return df2[df2['no_winter_golds']==df2['no_winter_golds'].max()]
The question is how to make it to dataframe shape (2, 17) as below:
>>> the_king_of_summer_winter_olympics.shape
(2, 17)
Use boolean indexing with chain conditions by | for bitwise OR:
return df2[(df2['no_summer_golds']==df2['no_summer_golds'].max()) |
(df2['no_winter_golds']==df2['no_winter_golds'].max())]

How to replicate the between_time function of Pandas in PySpark

I want to replicate the between_time function of Pandas in PySpark.
Is it possible since in Spark the dataframe is distributed and there is no indexing based on datetime?
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts.between_time('0:45', '0:15')
Is something similar possible in PySpark?
pandas.between_time - API
If you have a timestamp column, say ts, in a Spark dataframe, then for your case above, you can just use
import pyspark.sql.functions as F
df2 = df.filter(F.hour(F.col('ts')).between(0,0) & F.minute(F.col('ts')).between(15,45))

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

How to select some rows from a Pyspark dataframe column and add it to a new dataframe?

I have 10 dataframes,df1...df10 with 2 columns:
df1
id | 2011_result,
df2
id | 2012_result,
...
...
df3
id| 2018_result
I want to do select some ids with 2011_result values less than a threshold.
sample_ids=df1['2011_result']<threshold].sample(10)['id'].values
After this, I need to select the values for other columns from all other data frames for the list.
Something like this:
df2[df2['id'].isin(sample_ids)]['2012_result']
df3[df3['id'].isin(sample_ids)]['2013_result']
Could you please help out?
Firstly you can filter with:
import pyspark.sql.functions as F
sample_ids=df1.filter(F.col("2011_result") < threshold)
Then you can use left_anti join to filter out df2, df3, etc:
df2 = df2.join(sample_ids.select("id"), on="id", how="left_anti")

Pyspark add sequential and deterministic index to dataframe

I need to add an index column to a dataframe with three very simple constraints:
start from 0
be sequential
be deterministic
I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?
1, 2, 3, 4, 5
What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)
You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.
I don't want to zip with index and then have to separate the previously separated columns that are now in a single column
You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
Not sure about the performance but here is a trick.
Note - toPandas will collect all the data to driver
from pyspark.sql import SparkSession
# speed up toPandas using arrow
spark = SparkSession.builder.appName('seq-no') \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.sql.execution.arrow.enabled", "true") \
.getOrCreate()
df = spark.createDataFrame([
('id1', "a"),
('id2', "b"),
('id2', "c"),
], ["ID", "Text"])
df1 = spark.createDataFrame(df.toPandas().reset_index()).withColumnRenamed("index","seq_no")
df1.show()
+------+---+----+
|seq_no| ID|Text|
+------+---+----+
| 0|id1| a|
| 1|id2| b|
| 2|id2| c|
+------+---+----+