Remove row in Pyspark data frame that contains less than n word - dataframe

I have a **Pyspark dataframe** consisting of about 6 Million lines. The dataset has the following structure:
root
|-- content: string (nullable = true)
|-- score: string (nullable = true)
+--------------------+-----+
| text |score|
+--------------------+-----+
|word word hello d...| 5|
|hi man how are yo...| 5|
|come on guys let ...| 5|
|do you like some ...| 1|
|accept | 1|
+--------------------+-----+
Is there a way to remove all lines that contain only sentences of at least 4 words in length? In order to delete all the lines with a few words.
I did it this way, but it takes a long time:
pandasDF = df.toPandas()
cnt = 0
ind = []
for index, row in pandasDF.iterrows():
txt = row["text"]
spl = txt.split()
if((len(spl)) < 4):
ind.append(index)
cnt += 1
pandasDF = pandasDF.drop(labels=ind, axis=0)
Is there a way to do this faster and without turning my Pyspark dataframe into a Pandas data frame?

Each text can be split into single words with split and the number of words can then be counted with size:
from pyspark.sql import functions as F
df.filter( F.size(F.split('text', ' ')) >= 4).show()
This statements keeps only rows that contain at least 4 words.

Related

How to filter and select columns and merge streaming dataframes in spark?

I have a streaming dataframe and I am not sure what the best way is to solve this issue
ID
lattitude
longitude
A
28
30
B
40
52
Transform to:
A
B.
Distance
(28,30)
(40,52)
calculate distance
I need to transform it to this and add a distance column in which I pass the coordinates.
I am thinking about producing 2 data streams that are filtered with all the A coordinates and B coordinates. I would then A.join(B).withColumn(distance) and stream the output. Is this the way to go about solving this problem?
Is there a way I could pivot without aggregation to readstream data into the format needed which could be faster than making 2 streaming dataframes filtered and merging them?
Can I add an array column of coordinates in a streaming dataset?
I am not sure how performant this will be, but you can use pivot to force rows of the ID column to become new columns and sum the individual latitude and longitude as a way to obtain the value itself (since there is no F.identity). This will get you the following result:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
)
+----------+-----------+----------+-----------+
|A_latitude|A_longitude|B_latitude|B_longitude|
+----------+-----------+----------+-----------+
| 28| 30| 40| 52|
+----------+-----------+----------+-----------+
Then you can use F.struct to create columns A and B using the latitude and longitude columns:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
)
+----------+-----------+----------+-----------+--------+--------+
|A_latitude|A_longitude|B_latitude|B_longitude| A| B|
+----------+-----------+----------+-----------+--------+--------+
| 28| 30| 40| 52|{28, 30}|{40, 52}|
+----------+-----------+----------+-----------+--------+--------+
The last step is to use a udf to calculate geographic distance, which has been answered here. Putting this all together:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from geopy.distance import geodesic
#F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
).withColumn(
'distance', geodesic_udf(F.array('B.B_longitude','B.B_latitude'), F.array('A.A_longitude','A.A_latitude'))
).select(
'A','B','distance'
)
+--------+--------+---------+
| A| B| distance|
+--------+--------+---------+
|{28, 30}|{40, 52}|2635478.5|
+--------+--------+---------+
EDIT: When I answered your question, I let pyspark infer the datatype of each column, but I also tried to more closely reproduce the schema for your streaming dataframe by specifying the column types:
streaming_df = spark.createDataFrame(
[
("A", 28., 30.),
("B", 40., 52.),
],
StructType([
StructField("ID", StringType(), True),
StructField("latitude", DoubleType(), True),
StructField("longitude", DoubleType(), True),
])
)
streaming_df.printSchema()
root
|-- ID: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
The end result is still the same:
+------------+------------+---------+
| A| B| distance|
+------------+------------+---------+
|{28.0, 30.0}|{40.0, 52.0}|2635478.5|
+------------+------------+---------+

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

How to properly import CSV files with PySpark

I know, that one can load files with PySpark for RDD's using the following commands:
sc = spark.sparkContext
someRDD = sc.textFile("some.csv")
or for dataframes:
spark.read.options(delimiter=',') \
.csv("some.csv")
My file is a .csv with 10 columns, seperated by ',' . However, the very last column contains some text, that also has a lot of ",". Splitting by "," will result in different column sizes for each row and moreover, I do not have the whole text in one column.
I am just looking for a good way to load a .csv file into a dataframe that has multiple "," at the very last index.
Maybe there is way to only split on the first n columns? Because it is guaranteed, that all columns before the text column are only seperated by one ",". Interestingly, using pd.read_csv does not cause this issue! So far my workaround has been to load the file with
csv = pd.read_csv("some.csv", delimiter=",")
csv_to_array = csv.values.tolist()
df = createDataFrame(csv_to_array)
which is not a pretty solution. Moreover, it did not allow me to use some schema on my dataframe.
If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+

Equivalent of `takeWhile` for Spark dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?
Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64