Pyspark dataframes left join with conditions (spatial join)

Pyspark dataframes left join with conditions (spatial join) - sql

I use pyspark and I have created (from txt files) two dataframes
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
+---+--------------------+------------------+-------------------+
| id| name| lat| lon|
+---+--------------------+------------------+-------------------+
| 1|.
.
.
+---+-------------------+------------------+-------------------+
| id| name| lat| lon|
+---+-------------------+------------------+-------------------+
| 1||
.
.
What I want is, through Spark techniques, to get every pair between the items of the Dataframes where their euclidean distance is below a certain value (let's say "0.5"). Like:
record1, record2
or in any form like this, this is not the matter.
Any help will be appreciated, thank you.

Since Spark does not include any provisions for geospatial computations, you need a user-defined function that computes the geospatial distance between two points, for example by using the haversine formula (from here):
from math import radians, cos, sin, asin, sqrt
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
#udf(returnType=FloatType())
def haversine(lat1, lon1, lat2, lon2):
R = 6372.8
dLat = radians(lat2 - lat1)
dLon = radians(lon2 - lon1)
lat1 = radians(lat1)
lat2 = radians(lat2)
a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
c = 2*asin(sqrt(a))
return R * c
Then you simply perform a cross join conditioned on the result from calling haversine():
df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name, df2.name)
You need a cross join since Spark cannot embed the Python UDF in the join itself. That's expensive, but this is something that PySpark users have to live with.
Here is an example:
>>> df1.show()
+---------+-------------------+--------------------+
| lat| lon| name|
+---------+-------------------+--------------------+
|37.776181|-122.41341399999999|AAE SSFF European...|
|38.959716| -119.945595|Ambassador Motor ...|
| 37.66169| -121.887367|Alameda County Fa...|
+---------+-------------------+--------------------+
>>> df2.show()
+------------------+-------------------+-------------------+
| lat| lon| name|
+------------------+-------------------+-------------------+
| 34.19198813|-118.93756299999998|Daphnes Greek Cafe1|
| 37.755557|-122.25036084651899|Daphnes Greek Cafe2|
|38.423435999999995| -121.41361| Laguna Pizza|
+------------------+-------------------+-------------------+
>>> df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name.alias("name1"), df2.name.alias("name2")).show()
+--------------------+-------------------+
| name1| name2|
+--------------------+-------------------+
|AAE SSFF European...|Daphnes Greek Cafe2|
|Alameda County Fa...|Daphnes Greek Cafe2|
|Alameda County Fa...| Laguna Pizza|
+--------------------+-------------------+

Related

optimize pyspark code to find a keyword and its count in a dataframe

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work

Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

User Defined Aggregate Function in PySpark SQL

How to implement a User Defined Aggregate Function (UDAF) in PySpark SQL?
pyspark version = 3.0.2
python version = 3.7.10
As a minimal example, I'd like to replace the AVG aggregate function with a UDAF:
sc = SparkContext()
sql = SQLContext(sc)
df = sql.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
rv = sql.sql('SELECT id, AVG(value) FROM df GROUP BY id').toPandas()
where rv will be:
In [2]: rv
Out[2]:
id avg(value)
0 1 1.5
1 2 3.5
How can a UDAF replace AVG in the query?
For example this does not work
import numpy as np
def udf_avg(x):
return np.mean(x)
sql.udf.register('udf_avg', udf_avg)
rv = sql.sql('SELECT id, udf_avg(value) FROM df GROUP BY id').toPandas()
The idea is to implement a UDAF in pure Python for processing not supported by SQL aggregate functions (e.g. a low-pass filter).

A Pandas UDF can be used, where the definition is compatible from Spark 3.0 and Python 3.6+. See the issue and documentation for details.
Full implementation in Spark SQL:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
#pandas_udf(DoubleType())
def avg_udf(s: pd.Series) -> float:
return s.mean()
spark.udf.register('avg_udf', avg_udf)
rv = spark.sql('SELECT id, avg_udf(value) FROM df GROUP BY id').toPandas()
with return value
In [2]: rv
Out[2]:
id avg_udf(value)
0 1 1.5
1 2 3.5

You can use a Pandas UDF with GROUPED_AGG type. It receives columns from Spark as Pandas Series, so that you can call Series.mean on the column.
import pyspark.sql.functions as F
#F.pandas_udf('float', F.PandasUDFType.GROUPED_AGG)
def avg_udf(s):
return s.mean()
df2 = df.groupBy('id').agg(avg_udf('value'))
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+
To register it for use in SQL is also possible:
df.createTempView('df')
spark.udf.register('avg_udf', avg_udf)
df2 = spark.sql("select id, avg_udf(value) from df group by id")
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

I have a CSV like that:
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it).
Here's what I have tried:
Loading it with DoubleType() it gives me scientific notation:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])
csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show()
+-----+--------------------+
| COL| VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2| 2.000000001234E8|
|TEST3| 9999.1234679123|
+-----+--------------------+
Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot:
For example, using: StructField('VAL', DecimalType(38, 18)) I get:
[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]
Realise that in this case, I have zeros on the right side that I don't want in my new file.
The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:
to_decimal = udf(lambda n: str(float(n)))
df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))
There's any way to reach the same without using the UDF trick?
Thank you!

The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).
The CSV:
$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567
Load the CSV applying the default PySpark DecimalType precision and scale:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DecimalType(38, 18))
])
csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show(truncate=False)
output:
+-----+----------------------------+
|COL |VAL |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000 |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+
When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:
import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
.withColumn('VAL', normalize_decimals(F.col('VAL')))
.show(truncate=False))
output:
+-----+------------------+
|COL |VAL |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234 |
|TEST3|9999.1234679123 |
|TEST4|123456789.01234567|
+-----+------------------+

You can use spark to do that with sql query :
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
val sparkConf: SparkConf = new SparkConf(true)
.setAppName(this.getClass.getName)
.setMaster("local[*]")
implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")
val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)

How to replace the Timedelta Pandas function with a pure PySpark function?

I am developing a small script in PySpark that generates a date sequence (36 months before today's date) and (while applying a truncate to be the first day of the month). Overall I succeeded this task however
But with the help of the Pandas package Timedelta to calculate the time delta .
Is there a way to replace this Timedelta from Pandas with a pure PySpark function ?
import pandas as pd
from datetime import date, timedelta, datetime
from pyspark.sql.functions import col, date_trunc
today = datetime.today()
data = [((date(today.year, today.month, 1) - pd.Timedelta(36,'M')),date(today.year, today.month, 1))] # I want to replace this Pandas function
df = spark.createDataFrame(data, ["minDate", "maxDate"])
+----------+----------+
| minDate| maxDate|
+----------+----------+
|2016-10-01|2019-10-01|
+----------+----------+
import pyspark.sql.functions as f
df = df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\ #
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
+----------+
| date|
+----------+
|2016-10-01|
|2016-11-01|
|2016-12-01|
|2017-01-01|
|2017-02-01|
|2017-03-01|
etc...
+----------+

You can use Pyspark inbuilt trunc function.
pyspark.sql.functions.trunc(date, format)
Returns date truncated to the unit specified by the format.
Parameters:
format – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Imagine I have a below dataframe.
list = [(1,),]
df=spark.createDataFrame(list, ['id'])
import pyspark.sql.functions as f
df=df.withColumn("start_date" ,f.add_months(f.trunc(f.current_date(),"month") ,-36))
df=df.withColumn("max_date" ,f.trunc(f.current_date(),"month"))
>>> df.show()
+---+----------+----------+
| id|start_date| max_date|
+---+----------+----------+
| 1|2016-10-01|2019-10-01|
+---+----------+----------+
Here's a link with more details on Spark date functions.
Pyspark date Functions

How to convert timestamp to bigint in a pyspark dataframe

Am using python on spark environment and want to convert a dataframe coulmn from TIMESTAMP datatype to bigint (UNIX timestamp). The columns are as such: ("yyyy-MM-dd hh:mm:ss.SSSSSS")
timestamp_col
2014-06-04 10:09:13.334422
2015-06-03 10:09:13.443322
2015-08-03 10:09:13.232431
I have read around and tried this among others:
from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType
df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))
but the output gives rather NULL values.
+-------------+
|unix_time_col|
+-------------+
| null|
| null|
| null|
Am using python3.7 on spark on hadoop environment with spark & hadoop versions: spark-2.3.1-bin-hadoop2.7 on google-colaboratory
I must be missing out something. Please, any help?

please remove ".SSSSSS" in your code, then it will work while converting to unixtimestamp i.e. instead of "yyyy-MM-dd hh:mm:ss.SSSSSS" write as below:
df1 = df.select(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss"))

from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)
spark = SparkSession.builder.appName('abc').getOrCreate()
column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
output
+-------------+
|timestamp_col|
+-------------+
| 1401894553|
| 1433344153|
| 1438614553|
+-------------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark dataframes left join with conditions (spatial join) - sql

Related

optimize pyspark code to find a keyword and its count in a dataframe

User Defined Aggregate Function in PySpark SQL

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

How to replace the Timedelta Pandas function with a pure PySpark function?

How to convert timestamp to bigint in a pyspark dataframe

Categories

Resources