How to filter and select columns and merge streaming dataframes in spark? - dataframe

I have a streaming dataframe and I am not sure what the best way is to solve this issue
ID
lattitude
longitude
A
28
30
B
40
52
Transform to:
A
B.
Distance
(28,30)
(40,52)
calculate distance
I need to transform it to this and add a distance column in which I pass the coordinates.
I am thinking about producing 2 data streams that are filtered with all the A coordinates and B coordinates. I would then A.join(B).withColumn(distance) and stream the output. Is this the way to go about solving this problem?
Is there a way I could pivot without aggregation to readstream data into the format needed which could be faster than making 2 streaming dataframes filtered and merging them?
Can I add an array column of coordinates in a streaming dataset?

I am not sure how performant this will be, but you can use pivot to force rows of the ID column to become new columns and sum the individual latitude and longitude as a way to obtain the value itself (since there is no F.identity). This will get you the following result:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
)
+----------+-----------+----------+-----------+
|A_latitude|A_longitude|B_latitude|B_longitude|
+----------+-----------+----------+-----------+
| 28| 30| 40| 52|
+----------+-----------+----------+-----------+
Then you can use F.struct to create columns A and B using the latitude and longitude columns:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
)
+----------+-----------+----------+-----------+--------+--------+
|A_latitude|A_longitude|B_latitude|B_longitude| A| B|
+----------+-----------+----------+-----------+--------+--------+
| 28| 30| 40| 52|{28, 30}|{40, 52}|
+----------+-----------+----------+-----------+--------+--------+
The last step is to use a udf to calculate geographic distance, which has been answered here. Putting this all together:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from geopy.distance import geodesic
#F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
).withColumn(
'distance', geodesic_udf(F.array('B.B_longitude','B.B_latitude'), F.array('A.A_longitude','A.A_latitude'))
).select(
'A','B','distance'
)
+--------+--------+---------+
| A| B| distance|
+--------+--------+---------+
|{28, 30}|{40, 52}|2635478.5|
+--------+--------+---------+
EDIT: When I answered your question, I let pyspark infer the datatype of each column, but I also tried to more closely reproduce the schema for your streaming dataframe by specifying the column types:
streaming_df = spark.createDataFrame(
[
("A", 28., 30.),
("B", 40., 52.),
],
StructType([
StructField("ID", StringType(), True),
StructField("latitude", DoubleType(), True),
StructField("longitude", DoubleType(), True),
])
)
streaming_df.printSchema()
root
|-- ID: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
The end result is still the same:
+------------+------------+---------+
| A| B| distance|
+------------+------------+---------+
|{28.0, 30.0}|{40.0, 52.0}|2635478.5|
+------------+------------+---------+

Related

Remove row in Pyspark data frame that contains less than n word

I have a **Pyspark dataframe** consisting of about 6 Million lines. The dataset has the following structure:
root
|-- content: string (nullable = true)
|-- score: string (nullable = true)
+--------------------+-----+
| text |score|
+--------------------+-----+
|word word hello d...| 5|
|hi man how are yo...| 5|
|come on guys let ...| 5|
|do you like some ...| 1|
|accept | 1|
+--------------------+-----+
Is there a way to remove all lines that contain only sentences of at least 4 words in length? In order to delete all the lines with a few words.
I did it this way, but it takes a long time:
pandasDF = df.toPandas()
cnt = 0
ind = []
for index, row in pandasDF.iterrows():
txt = row["text"]
spl = txt.split()
if((len(spl)) < 4):
ind.append(index)
cnt += 1
pandasDF = pandasDF.drop(labels=ind, axis=0)
Is there a way to do this faster and without turning my Pyspark dataframe into a Pandas data frame?
Each text can be split into single words with split and the number of words can then be counted with size:
from pyspark.sql import functions as F
df.filter( F.size(F.split('text', ' ')) >= 4).show()
This statements keeps only rows that contain at least 4 words.

pyspark.sql SparkSession load() with schema : Non-StringType fields in schema make all values null

Hi,
I am having trouble using non-StringType as a part of the schema that I use in loading a csv file to create a dataframe.
I was expecting the given schema to be used to convert each field of each record to corresponding data type on the fly while loading.
Instead, all I get is null values.
Here is a simplified way of how to reproduce my problem. In this example, there is a small csv file with four columns that I want to treat, correspondingly, as str, date, int, and bool:
python
Python 3.6.5 (default, Jun 17 2018, 12:13:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark import SparkContext
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import *
>>>
>>> data_flnm = 'four_cols.csv'
>>> lines = [ln.rstrip() for ln in open(data_flnm).readlines()[:3]]
>>> lines
['zzzc7c09:66d7:47d6:9415:87e5010fe282|2019-04-08|0|f', 'zzz304fa:6fc0:4337:91d0:05ef4657a6db|2019-07-08|1|f', 'yy251cf0:aa11:44e9:88f4:f6f9c1899cee|2019-05-13|0|t']
>>> parts = [ln.split("|") for ln in lines]
>>> parts
[['zzzc7c09:66d7:47d6:9415:87e5010fe282', '2019-04-08', '0', 'f'], ['zzz304fa:6fc0:4337:91d0:05ef4657a6db', '2019-07-08', '1', 'f'], ['yy251cf0:aa11:44e9:88f4:f6f9c1899cee', '2019-05-13', '0', 't']]
>>> cols1 = [StructField('u_id', StringType(), True), StructField('week', StringType(), True), StructField('flag_0_1', StringType(), True), StructField('flag_t_f', StringType(), True)]
>>> cols2 = [StructField('u_id', StringType(), True), StructField('week', DateType(), True), StructField('flag_0_1', IntegerType(), True), StructField('flag_t_f', BooleanType(), True)]
>>> sch1 = StructType(cols1)
>>> sch2 = StructType(cols2)
>>> sch1
StructType(List(StructField(u_id,StringType,true),StructField(week,StringType,true),StructField(flag_0_1,StringType,true),StructField(flag_t_f,StringType,true)))
>>> sch2
StructType(List(StructField(u_id,StringType,true),StructField(week,DateType,true),StructField(flag_0_1,IntegerType,true),StructField(flag_t_f,BooleanType,true)))
>>> spark_sess = SparkSession.builder.appName("xyz").getOrCreate()
19/09/10 19:32:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df1 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch1)
>>> df2 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch2)
>>> df1.show(5)
+--------------------+----------+--------+--------+
| u_id| week|flag_0_1|flag_t_f|
+--------------------+----------+--------+--------+
|zzzc7c09:66d7:47d...|2019-04-08| 0| f|
|zzz304fa:6fc0:433...|2019-07-08| 1| f|
|yy251cf0:aa11:44e...|2019-05-13| 0| t|
|yy1d2f8e:d8f0:4db...|2019-07-08| 1| f|
|zzz5ccad:2cf6:44e...|2019-05-20| 1| f|
+--------------------+----------+--------+--------+
only showing top 5 rows
>>> df2.show(5)
+----+----+--------+--------+
|u_id|week|flag_0_1|flag_t_f|
+----+----+--------+--------+
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
+----+----+--------+--------+
only showing top 5 rows
>>>
I tried a few different versions of .read(...)....load(...) code.
None produce the expected result.
Please advice. Thank you!
PS: could not add tags "structfield" and "structtype" : not enough reputation (__.
While parsing, youu need to read the flag_t_f column as string. The following schema will work:
StructType(List(StructField(u_id,StringType,true),StructField(week,DateType,true),StructField(flag_0_1,IntegerType,true),StructField(flag_t_f,StringType,true)))
After that you can add a boolean column to the dataframe if required:
import pyspark.sql.functions as f
df = df.withColumn("flag_t_f",
f.when(f.col("flag_t_f") == 'f', 'False')
.when(f.col("flag_t_f") == 't', 'True')
)
If you have more than one boolean columns having values as 'f' and 't' you can convert all of those by iterating over all the columns
cols = df.columns
for col in cols:
df = df.withColumn(col,
f.when(f.col(col) == 'f', 'False')
.when(f.col(col) == 't','True')
.otherwise(f.col(col))
)

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.

convert pyspark groupedData object to spark Dataframe

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?
I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()
The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))

Why the types are all string while load csv to pyspark dataframe?

I have a csv file which contains numbers (no string in it).
It has int and float type. But when I read it in pyspark in this way:
df = spark.read.csv("s3://s3-cdp-prod-hive/novaya/instacart/data.csv",header=False)
all the columns' type of the dataframe are string.
How to read it into numbers with int and float automatically?
Some columns contain nan in it. In file it is represented by nan
0.18277,-0.188931,0.0893389,0.119931,0.318853,-0.132933,-0.0288816,0.136137,0.12939,-0.245342,0.0608182,0.0802028,-0.00625962,0.271222,0.187855,0.132606,-0.0451533,0.140501,0.0704631,0.0229986,-0.0533376,-0.319643,-0.029321,-0.160937,0.608359,0.0513554,-0.246744,0.0817331,-0.410682,0.210652,0.375154,0.021617,0.119288,0.0674939,0.190642,0.161885,0.0385196,-0.341168,0.138659,-0.236908,0.230963,0.23714,-0.277465,0.242136,0.0165013,0.0462388,0.259744,-0.397228,-0.0143719,0.0891644,0.222225,0.0987765,0.24049,0.357596,-0.106266,-0.216665,0.191123,-0.0164234,0.370766,0.279462,0.46796,-0.0835098,0.112693,0.231951,-0.0942302,-0.178815,0.259096,-0.129323,1165491,175882,16.5708805975,6,0,2.80890261184,4.42114773551,0,23,0,13.4645462866,18.0359037455,11,30.0,0.0,11.4435397208,84.7504967125,30.0,5370,136.0,1.0,9.61508192633,62.2006926209,1,0,0,22340,9676,322.71241867,17.7282900627,1,100,4.24701125287,2.72260519248,0,6,17.9743048247,13.3241271262,0,23,82.4988407009,11.4021333588,0.0,30.0,45.1319021862,7.76284691137,1.0,66.0,9.40127026245,2.30880529144,1,73,0.113021725659,0.264843289305,0.0,0.986301369863,1,30450,0
As you can see here:
inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.
For NaN values, refer to the same docs above:
nanValue – sets the string representation of a non-number value. If None is set, it uses the default value, NaN
By setting inferSchema as True, you will obtain a dataframe with types infered.
Here I put an example:
CSV file:
12,5,8,9
1.0,3,46,NaN
By default, inferSchema is False and all values are String:
from pyspark.sql.types import *
>>> df = spark.read.csv("prova.csv",header=False)
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]
>>> df.show()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| 12| 5| 8| 9|
|1.0| 3| 46|NaN|
+---+---+---+---+
If you set inferSchema as True:
>>> df = spark.read.csv("prova.csv",inferSchema =True,header=False)
>>> df.dtypes
[('_c0', 'double'), ('_c1', 'int'), ('_c2', 'int'), ('_c3', 'double')]
>>> df.show()
+----+---+---+---+
| _c0|_c1|_c2|_c3|
+----+---+---+---+
|12.0| 5| 8|9.0|
| 1.0| 3| 46|NaN|
+----+---+---+---+