Spark dataframes groupby into list - dataframe

I am trying to do some analysis on sets. I have a sample data set that looks like this:
orders.json
{"items":[1,2,3,4,5]}
{"items":[1,2,5]}
{"items":[1,3,5]}
{"items":[3,4,5]}
All it is, is a single field that is a list of numbers that represent IDs.
Here is the Spark script I am trying to run:
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("Dataframe Test")
val sc = new SparkContext(sparkConf)
val sql = new SQLContext(sc)
val dataframe = sql.read.json("orders.json")
val expanded = dataframe
.explode[::[Long], Long]("items", "item1")(row => row)
.explode[::[Long], Long]("items", "item2")(row => row)
val grouped = expanded
.where(expanded("item1") !== expanded("item2"))
.groupBy("item1", "item2")
.count()
val recs = grouped
.groupBy("item1")
Creating expanded and grouped is fine, in a nutshell expanded is a list of all the possible sets of two IDs where the two IDs were in the same original set. grouped filters out IDs that were matched with themselves, then groups together all the unique pairs of IDs and produces a count for each. The schema and data sample of grouped are:
root
|-- item1: long (nullable = true)
|-- item2: long (nullable = true)
|-- count: long (nullable = false)
[1,2,2]
[1,3,2]
[1,4,1]
[1,5,3]
[2,1,2]
[2,3,1]
[2,4,1]
[2,5,2]
...
So, my question is: how do I now group on the first item in each result so that I have a list of tuples? For the example data above, I would expect something similar to this:
[1, [(2, 2), (3, 2), (4, 1), (5, 3)]]
[2, [(1, 2), (3, 1), (4, 1), (5, 2)]]
As you can see in my script with recs, I thought you would start by doing a groupBy on 'item1' which is the first item in each row. But after that you are left with this GroupedData object that has very limited actions on it. Really, you are only left with doing aggregations like sum, avg, etc. I just want to list the tuples from each result.
I could easily use RDD functions at this point, but that departs from using Dataframes. Is there a way to do this with the dataframe functions.

You can build that with org.apache.spark.sql.functions (collect_list and struct) available since 1.6
val recs =grouped.groupBy('item1).agg(collect_list(struct('item2,'count)).as("set"))
+-----+----------------------------+
|item1|set |
+-----+----------------------------+
|1 |[[5,3], [4,1], [3,2], [2,2]]|
|2 |[[4,1], [1,2], [5,2], [3,1]]|
+-----+----------------------------+
You can use collect_set also
Edit: for information, tuples don't exist in dataframes. The closest structures are struct since they are the equivalent of case classes in the untyped dataset API.
Edit 2: Also be warned that collect_set comes with the caveat that the result is actually not a set (there is no datatype with set properties in the SQL types). That means that you can end up with distinct "sets" which differ by their order (in version 2.1.0 at least). Sorting them with sort_array is then necessary.

Related

Why Spark uses ordered schema for dataframe?

I wondered why spark uses ordered schema in dataframe rather than using name based schema where 2 schemas considered to be the same if for each column name they have the same type.
My first question is that what was the advantage of ordering columns in schema that spark orders columns? Does it make some operations on dataframe faster when we have this assumption?
And my second question is whether I can tell spark that the order of columns do not matter to me and consider two schemas to be the same if the unordered set of columns and their types are the same.
Spark dataframes are not relational databases. It saves time for certain types of processing; e.g. union, which will take in fact the names from the last DF. So, it's an implementation detail.
You therefore cannot state that ordering does not matter to Spark. See the union of the below:
val df2 = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "talk", "animal")
val df3 = df.union(df2)
Note that with JSON schema inference everything is alphabetical. That to me is very handy.

How to separate a collection inside a column in spark-dataframe and pass the values to other column?

Suppose this is a dataframe :
val data = Seq(
Row("Ramesh",List("English","German"),Map("hair"->"black","eye"->"brown")),
Row("Vijay",List("Spark","French",null),Map("hair"->"brown","eye"->null)),
Row("Yann",List("Mandrin",""),Map("hair"->"red","eye"->"")),
Row("Ram",null,null),
Row("Jefferson",List(),Map())
)
val schema = new StructType()
.add("name",StringType)
.add("languages", ArrayType(StringType))
.add("properties", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
OUTPUT :
+----------+--------------+
|name |languages |
+----------+--------------+
I want all the output with names with atomic values. Null should also be covered.
Use explode_outer from org.apache.spark.sql.functions. This flattens the collection in the dataframe. And outer command considers the null value.
df.select($"name",explode_outer($"languages"))
.show(false)

Spark Scala Compare Row and Row of 2 Data frames and get differences

I have a Dataframe 1, Df1, Dataframe 2 , Df2 - Same Schema
I have Row 1 from Df1 - Dfw1, Row 1 from Df2 - Dfw2
I need to compare both to get differences b/n Dfw1 and Dfw2 and get the differences out as collection (Map or something)
A simple solution would be to transform the Row objects to Map and then compare the values of the 2 Maps.
Something like in Scala:
val m1 = Dfw1.getValuesMap[AnyVal](Dfw1.schema.fieldNames)
val m2 = Dfw2.getValuesMap[AnyVal](Dfw2.schema.fieldNames)
val differences = for {
field <- m1.keySet
if (!m1.get(field).equals(m2.get(field)))
} yield (field, m1(field), m2(field))
Returns Seq of tuples (field, value of Dfw1, value of Dfw1) if they are different.
You may also use pattern matching on Row object to compare:
Dfw1 match {
case(id: String, desc: String, ....) => // assuming you have the schema
// compare each value with Dfw2 and return differences
}

Pyspark add sequential and deterministic index to dataframe

I need to add an index column to a dataframe with three very simple constraints:
start from 0
be sequential
be deterministic
I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?
1, 2, 3, 4, 5
What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)
You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.
I don't want to zip with index and then have to separate the previously separated columns that are now in a single column
You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
Not sure about the performance but here is a trick.
Note - toPandas will collect all the data to driver
from pyspark.sql import SparkSession
# speed up toPandas using arrow
spark = SparkSession.builder.appName('seq-no') \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.sql.execution.arrow.enabled", "true") \
.getOrCreate()
df = spark.createDataFrame([
('id1', "a"),
('id2', "b"),
('id2', "c"),
], ["ID", "Text"])
df1 = spark.createDataFrame(df.toPandas().reset_index()).withColumnRenamed("index","seq_no")
df1.show()
+------+---+----+
|seq_no| ID|Text|
+------+---+----+
| 0|id1| a|
| 1|id2| b|
| 2|id2| c|
+------+---+----+

how to pass in array into udf spark

I have a problem that 1) I don't really know how to call a registered UDF. I found some answer saying use callUDF so this is how I call the function in my code. 2) I don't really know how to pass in arrays as parameters.
Here is my code:
val df = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val newdf = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val cols = df.columns
val temp = Array(df.select($"A"),df.select($"B"),df.select($"C"),df.select($"D"),df.select($"E"),df.select($"F"))
val temp2 = Array(newdf.select($"A"),newdf.select($"B"),newdf.select($"C"),newdf.select($"D"),newdf.select($"E"),newdf.select($"F"))
sparkSession.udf.register ( "myfunc" , ((A:Array[String],B:Array[String]) => {for(i <- 0 to 5)yield( if (A(i)==B(i)) "U" else "N")} ) )
val a = df.withColumn("A",callUDF("myfunc",(temp,temp2)))
Thanks in advance!
You are trying to use columns from two different dataframes which is illegal in a UDF. Spark's UDF can only work on a per row basis. You can't combine rows from different dataframes. To do so you need to perform a join between the two.
In your case you have just one row but in a realistic case you would have multiple rows, you need to make sure you have some unique key to join by such as a unique id.
If you don't and both dataframes have the same number of rows and the same number of partitions you can easily create an id for both dataframes like this:
df.withColumn("id",monotonicallyIncreasingId)
You should probably also rename the columns to have different names.
Look at the different options for join (see http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset) to see what best matches your need.
As for registering and calling a udf you can do:
def myFunc(s1: Seq[Int], s2: Seq[Int]) = {
for(i <- 0 to 5) yield {
if (s1(i)==s2(i)) "U" else "N"
}
}
val u = udf(myFunc)
val a = df.withColumn("A", myFunc(temp,temp2))
note that temp and temp2 should each be a column representing an array in the same dataframe, i.e. you should define them after the join on the relevant columns.