how to use window function in sqlContext in spark scala [duplicate] - sql

Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!

You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}

You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.

I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.

Related

PySpark - Get top 5 most frequent values for every column (without UDF)

Currently I'm gathering the top 5 most frequent values with a UDF.
The goal is to achieve the same result without using UDF and have the most efficient solution (avoid groupBy in loops).
Here's the code I'm using to have the result :
from pyspark.sql import functions as F
df = df.select('A', 'B', ...)
#F.udf
def get_top_5_udf(x)
from collections import Counter
return [elem[0] for elem in Counter(x).most_common(5)]
agg_expr = [get_top_5_udf(F.collect_list(col)).alias(col) for col in df.columns]
df_top5 = df.agg(*agg_expr)
The result looks like the following :
# result
#+-----------------+--------------+---------------+
#| A | B | ... |
#+-----------------+--------------+---------------+
#| [1, 2, 3, 4, 5] | [...] | ... |
#+-----------------+--------------+---------------+
You can try using count over window partitioned by each column before aggregating:
from pyspark.sql import functions as F, Window
result = df.select(*[
F.struct(
F.count(c).over(Window.partitionBy(c)).alias("cnt"),
F.col(c).alias("val")
).alias(c) for c in df.columns
]).agg(*[
F.slice(
F.expr(f"transform(sort_array(collect_set({c}), false), x -> x.val)"),
1, 5
).alias(c) for c in df.columns
])
result.show()

Spark Dataframe - How to keep only latest record for each group based on ID and Date? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have a dataframe:
DF:
1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
How to keep only latest record for each group? (there are 3 groups above (1,2,3)).
Result should be:
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Trying also to make it efficient (e.g. to finish within few short minutes on a moderate cluster with 100 million records), so sorting/ordering should be done (if they are required) in most efficient and correct manner..
You have to use the window function.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
you have to partition the window by the group and OrderBy time, below pyspark script do the work
from pyspark.sql.functions import *
from pyspark.sql.window import Window
schema = "Group int,time timestamp "
df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')
w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))
df.filter(df.Rank == 1).drop(df.Rank).show()
+-----+-------------------+
|Group| time|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
You can use window functions as described here for cases like this:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+

Merge multiple spark rows to one

I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.
Suppose you have this DataFrame:
+-------+-------------+
|country| continent|
+-------+-------------+
| china| asia|
| brazil|south america|
| france| europe|
| china| asia|
+-------+-------------+
Here's how to take all the distinct countries and run a transformation:
df
.select("country")
.distinct
.withColumn("country", concat(col("country"), lit(" is fun!")))
.show()
+--------------+
| country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+
You can use dropDuplicates instead of distinct if you don't want to lose the continent information:
df
.dropDuplicates("country")
.withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
.show(false)
+-------+-------------+------------------------------------+
|country|continent |description |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe |france is a country in europe |
|china |asia |china is a country in asia |
+-------+-------------+------------------------------------+
See here for more information about filtering DataFrames and here for more information on dropping duplicates.
Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.
df = df.select("column1", "column2",....,..,"column N").distinct.[].collect()
in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
The equivalent of Python df.iloc is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1
Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.
You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
As of Spark 3.1.1 on Databricks, it's a matter of selecting the column of interest, and applying limit:
%python
retDF = (inputDF
.select(col(inputDF
.columns[0]))
.limit(100)
)