Spark Scala, write data with SaveMode.Append while overwriting some existing partition - dataframe

I have the following command, in my spark program
df.write
.mode(SaveMode.Append)
.partitionBy("year","month","day")
.format(format)
.option("path",path)
.saveAsTable(table_name)
When I run it twice on the same date, I have duplicates in my data. So i want it to append the data but when some partitions already exist, it should overwrite them.

Here using Hive Integration but can be Spark only catalog, a complete example:
1) Need to set up table
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
import org.apache.spark.sql.types._
val df = spark.range(9).map(x => (x, (x + 100) % 3)).toDF("c1", "c2")
df.repartition($"c2")
.write
.partitionBy("c2")
.mode("overwrite").saveAsTable("tabX")
2) Update a partition - contrived, in this way after setup
val df2 = spark.range(1).map(x => (x, (x + 100) % 3)).toDF("c1", "c2")
df2.repartition($"c2")
.write
.mode("overwrite").insertInto("tabX")
3) Look at effects:
// from 9 -> 7 entries, pls run
val df3 = spark.table("tabX")
df3.show(false)
returns:
+---+---+
|c1 |c2 |
+---+---+
|2 |0 |
|5 |0 |
|8 |0 |
|1 |2 |
|4 |2 |
|7 |2 |
|0 |1 |
+---+---+
Here is the evidence of partition overwrite, then.

This can be achieved in 2 steps:
add the following spark conf,
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
I used the following function to deal with the cases where I should overwrite or just append.
first import org.apache.hadoop.fs.{FileSystem, Path}
def writePartitionOverriding(df: DataFrame, tableName: String, path: String, spark: SparkSession): Unit = {
val fullPath = path + partition
if (FileSystem.get(spark.sparkContext.hadoopConfiguration).exists(new Path(fullPath)))
dfToWrite.write
.mode(SaveMode.Overwrite)
.format("orc")
.partitionBy("year","month","day")
.save(path)
else
writePartitionBySaveAsTable(dfToWrite, tableName, path, partition)
}
def writePartitionBySaveAsTable(df: DataFrame, tableName: String, path: String): Unit=
df.write
.mode(SaveMode.Append)
.format("orc")
.partitionBy("year","month","day")
.option("path", path)
.saveAsTable(tableName)

Related

How to keep data unique in a certain range in pysaprk dataframe?

Companies can select a section of a Road. Sections are denoted by a start & end.
pyspark dataframe below:
+--------------------+----------+--------+
|Road company |start(km) |end(km) |
+--------------------+----------+--------+
|classA |1 |3 |
|classA |4 |7 |
|classA |10 |15 |
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
+--------------------+----------+--------+
The classB company would pick the section of the road first. For classA entries, there should be overlap with classB. That is, classA Companies could not select a section of the road part that has been chosen by classB(company). The result should as below:
+--------------------+----------+--------+
|Road company |start(km) |end(km) |
+--------------------+----------+--------+
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
+--------------------+----------+--------+
The distinct() function does not support separating the frame into several parts to apply the distinct operation. What should I do to implement that?
If you could partially allocate the section of Road here's a different (very similar) strategy:
start="start(km)"
end="end(km)"
def emptyDFr():
schema = StructType([
StructField(start,IntegerType(),True),
StructField(end,IntegerType(),True),
StructField("Road company",StringType(),True),
StructField("ranged",IntegerType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,17]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
# create the sequence of kilometers that cover the 'start' to 'end'
ranged = df.withColumn("range", explode(sequence( col(start), col(end) )) )
whatsLeft = ranged.select( col("range") ).distinct()
result = emptyDFr()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = ranged.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["range"])
whatsLeft = whatsLeft.subtract( taken.select( col("range") ) )
result = result.union( taken.select( col("range") ,col(start), col(end),col("Road company") ) )
#convert our result back to the 'original style' of records with starts and ends.
result.groupBy( start, end, "Road company").agg(count("ranged").alias("count") )\
#figure out math to see if you got everything you asked for.
.withColumn("Partial", ((col(end)+lit(1)) - col(start)) != col("count"))\
.withColumn("Maths", ((col(end)+lit(1)) - col(start))).show() #helps show why this works not requried.
If you can can rely on the fact that sections will not ever overlap, you can solve this with the below logic. You could likely optimize it to rely on the "start(km)". But if you are talking more in-depth than that it might be more complicated.
from pyspark.sql.functions col, when
from pyspark.sql.types import *
def emptyDF():
schema = StructType([
StructField("start(km)",IntegerType(),True),
StructField("end(km)",IntegerType(),True),
StructField("Road company",StringType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,15]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
whatsLeft = df.select( col("start(km)") ,col("end(km)") ).distinct()
result = emptyDF()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = df.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["start(km)" ,"end(km)"])
whatsLeft = whatsLeft.subtract( taken.drop( col("Road company") ) )
result = result.union( taken )
result.show()
+---------+-------+------------+
|start(km)|end(km)|Road company|
+---------+-------+------------+
| 1| 3| classB|
| 4| 7| classB|
| 8| 15| classB|
| 16| 20| classA|
+---------+-------+------------+

How can I update Pyspark DataFrame column values under two column conditions using Bitwise or bit and function?

I need to update a column (Flag, containing many flags, each flag is 2^n int number, add up) in a pyspark dataframe under two conditions, i.e. column(Age) value >= 65 and column Flag does not contain the new flag value which is checked by a Bitwise or bit and function: (Flag & newFlag) == 0
I have demonstrated my work using a sample dataframe and python script(plelase see it below) but encountered an error message.
the error message is: AnalysisException: cannot resolve '(Flag AND 2)' due to data type mismatch: '(Flag AND 2)' requires boolean type, not int;
from pyspark.sql.types import StructType,StructField, StringType, IntegerType`
from pyspark.sql.functions import *
# create a data frame with two columns: Age and Flag and three rows
data = [
(61,0),
(65,1),
(66,10) #previous inserted Flag 2 and 8, add up to 10, Flag is 2^n
]
schema = StructType([ \
StructField("Age",IntegerType(), True), \
StructField("Flag",IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
#df.printSchema()
df.show(truncate=False)
N_FLAG_AGE65=2
new_column = when(
(col("Age") >= 65) & ((col("Flag") & lit(N_FLAG_AGE65) == 0)),
col("Flag")+N_FLAG_AGE65
).otherwise(col("Flag"))
df = df.withColumn("Flag", new_column)
df.show(truncate=False)
after input source df is constructed, the first display line of df.show(truncate=False) should be
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |1 |
|66 |10 |
+---+----+
My updating algorithm is to check both columns (Age and Flag), if age >=65 and Flag bit function does not contain N_FLAG_AGE65, we update Flag field by Flag = Flag+N_FLAG_AGE65. Thus, the expected result should be
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |3 |
|66 |10 |
+---+----+
I think that the original syntax of "new_column" conditional expression won't work with df = df.withColumn("Flag", new_column)
I did syntax change, it works now for the following code by adding a new constant lit(N_FLAG_AGE65) called column(Flag65_exp) and used expr("case when Age>=65 and Flag & lit(N_FLAG_AGE65)=0 then Flag+lit(N_FLAG_AGE65) Else Flag End") in df.withColumn("Flag",expr("..."))
%python
from pyspark.sql.types import StructType,StructField,
StringType, IntegerType
from pyspark.sql.functions import *
# create a data frame with two columns: Age and Flag and three
rows
data = [
(61,0),
(65,1),
(66,10) #previous inserted Flag 2 and 8, add up to 10, Flag is
2^n
]
schema = StructType([ \
StructField("Age",IntegerType(), True), \
StructField("Flag",IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
#df.printSchema()
df.show(truncate=False)
N_FLAG_AGE65=2
df=df.withColumn('Flag65_exp', lit(N_FLAG_AGE65))
df = df.withColumn("Flag", expr("case when Age>=65 and Flag &
lit(N_FLAG_AGE65)=0 then Flag+lit(N_FLAG_AGE65) Else Flag End"))
df.show(truncate=False)
#source df
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |1 |
|66 |10 |
+---+----+
#updated df
+---+----+----------+
|Age|Flag|Flag65_exp|
+---+----+----------+
|61 |0 |2 |
|65 |3 |2 |
|66 |10 |2 |
+---+----+----------+

Getting start and end indices of string in Pandas

I have a df that looks like this:
|Index|Value|Anomaly|
---------------------
|0 |4 | |
|1 |2 |Anomaly|
|2 |1 |Anomaly|
|3 |2 | |
|4 |6 |Anomaly|
I want to get the start and end indices of the consecutive anomaly counts so in this case, it will be [[1,2],[4]]
I understand I have to use .shift and .cumsum but I am lost and I hope someone would be able to enlighten me.
Get consecutive groups taking the cumsum of the Boolean Series that checks where the value is not 'Anomoly'. Use where so that we only only take the 'Anomoly' rows. Then we can loop over the groups and grab the indices.
m = df['Anomaly'].ne('Anomaly')
[[idx[0], idx[-1]] if len(idx) > 1 else [idx[0]]
for idx in df.groupby(m.cumsum().where(~m)).groups.values()]
#[[1, 2], [4]]
Or if you want to use a much longer groupby you can get the first and last index, then drop duplicates (to deal with streaks of only 1) and get it into a list of lists. This is much slower though
(df.reset_index().groupby(m.cumsum().where(~m))['index'].agg(['first', 'last'])
.stack()
.drop_duplicates()
.groupby(level=0).agg(list)
.tolist())
#[[1, 2], [4]]

How to group multiple arrays into one, then flatten and find distinct items

Having a dataframe like below:
val df = Seq(
(1, Seq("USD", "CAD")),
(2, Seq("AUD", "YEN", "USD")),
(2, Seq("GBP", "AUD", "YEN")),
(3, Seq("BRL", "AUS", "BND","BOB","BWP")),
(3, Seq("XAF", "CLP", "BRL")),
(3, Seq("XAF", "CNY", "KMF","CSK","EGP")
)
).toDF("ACC", "CCY")
+---+-------------------------+
|ACC|CCY |
+---+-------------------------+
|1 |[USD, CAD] |
|2 |[AUD, YEN, USD] |
|2 |[GBP, AUD, YEN] |
|3 |[BRL, AUS, BND, BOB, BWP]|
|3 |[XAF, CLP, BRL] |
|3 |[XAF, CNY, KMF, CSK, EGP]|
+---+-------------------------+
This has to be transformed as below by removing the duplicates too.
Spark Version = 2.0
Scala Version = 2.10
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD,CAD] |
|2 |[AUD,YEN,USD,GBP] |
|3 |[BRL,AUS,BND,BOB,BWP,XAF,CLP,CNY,KMF,CSK,EGP] |
+---+-------------------------------------------------------+
I tried grouping by ACC column and aggregating the CCY but not sure where to go from there.
Can this be done without using UDF? If NO, then how would I go about this using UDF?
Please advice.
The next code should return the expected results:
import scala.collection.mutable.WrappedArray
val df = Seq(
(1, Seq("USD", "CAD")),
(2, Seq("AUD", "YEN", "USD")),
(2, Seq("GBP", "AUD", "YEN")),
(3, Seq("BRL", "AUS", "BND", "BOB", "BWP")),
(3, Seq("XAF", "CLP", "BRL")),
(3, Seq("XAF", "CNY", "KMF", "CSK", "EGP")
)
).toDF("ACC", "CCY")
val castToArray = udf((ccy: WrappedArray[WrappedArray[String]]) => ccy.flatten.distinct.toArray)
val df2 = df.groupBy($"ACC")
.agg(collect_list($"CCY").as("CCY"))
.withColumn("CCY", castToArray($"CCY"))
.show(false)
First I use groupBy("ACC") then with the aggregate collect_list all arrays are concentrated into one. Next, inside the udf function values of CCY are being unwrapped and the results are flattened.
Output:
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD, CAD] |
|3 |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2 |[AUD, YEN, USD, GBP] |
+---+-------------------------------------------------------+
Good luck
UPDATE:
In Spark >= 2.4 you can use the build-in flatten and array_distinct functions and avoid the usage of udf:
df.groupBy($"ACC")
.agg(collect_list($"CCY").as("CCY"))
.select($"ACC", array_distinct(flatten($"CCY")).as("CCY"))
.show(false)
//Output
+---+-------------------------------------------------------+
|ACC|CCY |
+---+-------------------------------------------------------+
|1 |[USD, CAD] |
|3 |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2 |[AUD, YEN, USD, GBP] |
+---+-------------------------------------------------------+

find the closest time between two tables in spark

I am using pyspark and I have two dataframes like this:
user time bus
A 2016/07/18 12:00:00 1
B 2016/07/19 12:00:00 2
C 2016/07/20 12:00:00 3
bus time stop
1 2016/07/18 11:59:40 sA
1 2016/07/18 11:59:50 sB
1 2016/07/18 12:00:05 sC
2 2016/07/19 11:59:40 sB
2 2016/07/19 12:00:10 sC
3 2016/07/20 11:59:55 sD
3 2016/07/20 12:00:10 sE
Now I want to know at which stop the user reports according to the bus number and the closest time in the second table.
For example, in table 1, user A reports at 2016/07/18 12:00:00 and he is on bus No.1, and according to the second table, there are three records of bus No.1, but the closest time is 2016/07/18 12:00:05(the third record), so the user is in sC now.
The desired output should be like this:
user time bus stop
A 2016/07/18 12:00:00 1 sC
B 2016/07/19 12:00:00 2 sC
C 2016/07/20 12:00:00 3 sD
I have transferred the time into timestamp so that the only problem is to find the closest timestamp where the bus number is eqaul.
Because I'm not familiar with sql right now, I tried to use map function to find the closest time and its stop, which means I have to use sqlContext.sql in the map function, and spark dosen't seem to allow this:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
So how can I write a sql query to get the right output?
This can be done using window functions.
from pyspark.sql.window import Window
from pyspark.sql import Row, functions as W
def tm(str):
return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
#setup data
userTime = [ Row(user="A",time=tm("2016/07/18 12:00:00"),bus = 1) ]
userTime.append(Row(user="B",time=tm("2016/07/19 12:00:00"),bus = 2))
userTime.append(Row(user="C",time=tm("2016/07/20 12:00:00"),bus = 3))
busTime = [ Row(bus=1,time=tm("2016/07/18 11:59:40"),stop = "sA") ]
busTime.append(Row(bus=1,time=tm("2016/07/18 11:59:50"),stop = "sB"))
busTime.append(Row(bus=1,time=tm("2016/07/18 12:00:05"),stop = "sC"))
busTime.append(Row(bus=2,time=tm("2016/07/19 11:59:40"),stop = "sB"))
busTime.append(Row(bus=2,time=tm("2016/07/19 12:00:10"),stop = "sC"))
busTime.append(Row(bus=3,time=tm("2016/07/20 11:59:55"),stop = "sD"))
busTime.append(Row(bus=3,time=tm("2016/07/20 12:00:10"),stop = "sE"))
#create RDD
userDf = sc.parallelize(userTime).toDF().alias("usertime")
busDf = sc.parallelize(busTime).toDF().alias("bustime")
joinedDF = userDf.join(busDf,col("usertime.bus") == col("bustime.bus"),"inner").select(
userDf.user,
userDf.time.alias("user_time"),
busDf.bus,
busDf.time.alias("bus_time"),
busDf.stop)
additional_cols = joinedDF.withColumn("bus_time_diff", abs(unix_timestamp(col("bus_time")) - unix_timestamp(col("user_time"))))
partDf = additional_cols.select("user","user_time","bus","bus_time","stop","bus_time_diff", W.rowNumber().over(Window.partitionBy("user","bus").orderBy("bus_time_diff") ).alias("rank") ).filter(col("rank") == 1)
additional_cols.show(20,False)
partDf.show(20,False)
Output:
+----+---------------------+---+---------------------+----+-------------+
|user|user_time |bus|bus_time |stop|bus_time_diff|
+----+---------------------+---+---------------------+----+-------------+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:40.0|sA |20 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:50.0|sB |10 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 11:59:40.0|sB |20 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 12:00:10.0|sE |10 |
+----+---------------------+---+---------------------+----+-------------+
+----+---------------------+---+---------------------+----+-------------+----+
|user|user_time |bus|bus_time |stop|bus_time_diff|rank|
+----+---------------------+---+---------------------+----+-------------+----+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |1 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |1 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |1 |
+----+---------------------+---+---------------------+----+-------------+----+