How to separate a collection inside a column in spark-dataframe and pass the values to other column? - dataframe

Suppose this is a dataframe :
val data = Seq(
Row("Ramesh",List("English","German"),Map("hair"->"black","eye"->"brown")),
Row("Vijay",List("Spark","French",null),Map("hair"->"brown","eye"->null)),
Row("Yann",List("Mandrin",""),Map("hair"->"red","eye"->"")),
Row("Ram",null,null),
Row("Jefferson",List(),Map())
)
val schema = new StructType()
.add("name",StringType)
.add("languages", ArrayType(StringType))
.add("properties", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
OUTPUT :
+----------+--------------+
|name |languages |
+----------+--------------+
I want all the output with names with atomic values. Null should also be covered.

Use explode_outer from org.apache.spark.sql.functions. This flattens the collection in the dataframe. And outer command considers the null value.
df.select($"name",explode_outer($"languages"))
.show(false)

Related

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Merge multiple spark rows to one

I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)

Finding Unique Image name from Image column using Pyspark or SQL

I have data set which looks like this :
key|StateName_13|lon|lat|col5_13|col6_13|col7_13|ImageName|elevation_13|Counter_13
P00005K9XESU|FL|-80.854196|26.712385|128402000128038||183.30198669433594|USGS_NED_13_n27w081_IMG.img|3.7742109298706055|1
P00005KC31Y7|FL|-80.854196|26.712385|128402000128038||174.34959411621094|USGS_NED_13_n27w082_IMG.img|3.553356885910034|1
P00005KC320M|FL|-80.846966|26.713182|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.2236201763153076|1
P00005KC320M|FL|-80.84617434521485|26.713200344482424|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|2.7960102558135986|2
P00005KC320M|FL|-80.84538|26.713219|128402000100953||520.3673706054688|USGS_NED_13_n27w081_IMG.img|1.7564013004302979|3
P00005KC31Y6|FL|-80.854155|26.712083|128402000128038||169.80172729492188|USGS_NED_13_n27w081_IMG.img|3.2237753868103027|1
P00005KATEL2|FL|-80.861664|26.703649|128402000122910||38.789894104003906|USGS_NED_13_n27w081_IMG.img|3.235154628753662|1
In this dataset, I want to find the duplicate lon,lat and want the name of images corresponding to those lon and lat.
Output should look like this:
lon|lat|ImageName
-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img,USGS_NED_13_n27w082_IMG.img
Since the row 1 and 2 have similar lon and lat values but different image name.
Any pyspark code or sql query works.
Using #giser_yugang comment, we can do something like this :
from pyspark.sql import functions as F
df = df.groupby(
'lon',
'lat'
).agg(
F.collect_set('ImageName').alias("ImageNames")
).where(
F.size("ImageNames")>1
)
df.show(truncate=False)
+----------+---------+----------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+----------------------------------------------------------+
|-80.854196|26.712385|[USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img]|
+----------+---------+----------------------------------------------------------+
If you need to write it in a csv, as the format does not support ArrayType, then you can use concat_ws
df = df.withColumn(
"ImageNames",
F.concat_ws(
", "
"ImageNames"
)
)
df.show()
+----------+---------+--------------------------------------------------------+
|lon |lat |ImageNames |
+----------+---------+--------------------------------------------------------+
|-80.854196|26.712385|USGS_NED_13_n27w081_IMG.img, USGS_NED_13_n27w082_IMG.img|
+----------+---------+--------------------------------------------------------+

Spark SQL: How to apply specific functions to all specified columns

I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.
Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>") if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select (or df.selectExpr).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.
Suppose you have this DataFrame:
+-------+-------------+
|country| continent|
+-------+-------------+
| china| asia|
| brazil|south america|
| france| europe|
| china| asia|
+-------+-------------+
Here's how to take all the distinct countries and run a transformation:
df
.select("country")
.distinct
.withColumn("country", concat(col("country"), lit(" is fun!")))
.show()
+--------------+
| country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+
You can use dropDuplicates instead of distinct if you don't want to lose the continent information:
df
.dropDuplicates("country")
.withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
.show(false)
+-------+-------------+------------------------------------+
|country|continent |description |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe |france is a country in europe |
|china |asia |china is a country in asia |
+-------+-------------+------------------------------------+
See here for more information about filtering DataFrames and here for more information on dropping duplicates.
Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.
df = df.select("column1", "column2",....,..,"column N").distinct.[].collect()
in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.