pySpark dataframe data insert in to Hive table - hive

I have a requirement to create a final dataframe after applying mapping conditions.
For Ex. if I have the final dataframe like below
df.show()
+---+---+
|a |b |
+---+---+
|c |2 |
+---+---+
printSchema:
a: string (nullable - true)
b: integer (nullable - true)
I have to load the final table into Hive table with same columns but with different schema where some of the column values doesn't accept null values.
For example, if in above dataframe if column 'a' has any null value it should not update the particular row in hive table.
i'm using below command to write into table -
df.write.mode(append).format(parquet).saveAsTable(table_name)
So, shall I the change schema before proceed with the table append ??
schema = StructType([StructField("a", StringType, False), ("b", IntegerType(), True)])
df_updated = spark.createDataFrame(df.rdd, schema)

Related

How to filter and select columns and merge streaming dataframes in spark?

I have a streaming dataframe and I am not sure what the best way is to solve this issue
ID
lattitude
longitude
A
28
30
B
40
52
Transform to:
A
B.
Distance
(28,30)
(40,52)
calculate distance
I need to transform it to this and add a distance column in which I pass the coordinates.
I am thinking about producing 2 data streams that are filtered with all the A coordinates and B coordinates. I would then A.join(B).withColumn(distance) and stream the output. Is this the way to go about solving this problem?
Is there a way I could pivot without aggregation to readstream data into the format needed which could be faster than making 2 streaming dataframes filtered and merging them?
Can I add an array column of coordinates in a streaming dataset?
I am not sure how performant this will be, but you can use pivot to force rows of the ID column to become new columns and sum the individual latitude and longitude as a way to obtain the value itself (since there is no F.identity). This will get you the following result:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
)
+----------+-----------+----------+-----------+
|A_latitude|A_longitude|B_latitude|B_longitude|
+----------+-----------+----------+-----------+
| 28| 30| 40| 52|
+----------+-----------+----------+-----------+
Then you can use F.struct to create columns A and B using the latitude and longitude columns:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
)
+----------+-----------+----------+-----------+--------+--------+
|A_latitude|A_longitude|B_latitude|B_longitude| A| B|
+----------+-----------+----------+-----------+--------+--------+
| 28| 30| 40| 52|{28, 30}|{40, 52}|
+----------+-----------+----------+-----------+--------+--------+
The last step is to use a udf to calculate geographic distance, which has been answered here. Putting this all together:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from geopy.distance import geodesic
#F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
).withColumn(
'distance', geodesic_udf(F.array('B.B_longitude','B.B_latitude'), F.array('A.A_longitude','A.A_latitude'))
).select(
'A','B','distance'
)
+--------+--------+---------+
| A| B| distance|
+--------+--------+---------+
|{28, 30}|{40, 52}|2635478.5|
+--------+--------+---------+
EDIT: When I answered your question, I let pyspark infer the datatype of each column, but I also tried to more closely reproduce the schema for your streaming dataframe by specifying the column types:
streaming_df = spark.createDataFrame(
[
("A", 28., 30.),
("B", 40., 52.),
],
StructType([
StructField("ID", StringType(), True),
StructField("latitude", DoubleType(), True),
StructField("longitude", DoubleType(), True),
])
)
streaming_df.printSchema()
root
|-- ID: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
The end result is still the same:
+------------+------------+---------+
| A| B| distance|
+------------+------------+---------+
|{28.0, 30.0}|{40.0, 52.0}|2635478.5|
+------------+------------+---------+

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Get column from another map type column spark scala dataframe

I have a dataframe having 1 column as a case class format like this
case class FeaturizedDataset(
indices: Array[String],
values: Array[Float]
)
The table is something like this
|sourceId|scoreMapping |
|--------|-----------------------|
|3 |{[1,3,4],[0.1,0.2,0.3]}|
|4 |{[1,3,4],[0.1,0.2,0.3]}|
|1 |{[1,3,4],[0.1,0.2,0.3]}|
|4 |{[1,3,4],[0.1,0.2,0.3]}|
I would like to get the score corresponding the score ID in scala. How can I do this?
For e.g., if sourceId = 3 => score = 0.2, if sourceId = 1 => score =0.1 ...
Using arrays_zip + filter functions:
val df1 = df.withColumn(
"tmp",
arrays_zip(col("scoreMapping.indices"), col("scoreMapping.values"))
).withColumn(
"score",
filter(col("tmp"), x => x("0") === col("sourceId"))(0)("1")
).drop("tmp")
You zip scoreMapping indices and values then filtering where indices is equal to sourceId.

Merge multiple spark rows to one

I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)

Adding column from dataframe(df1) to another dataframe (df2)

I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)