aggregate of array values - sql

Given a table on a specific day with different hex_ids I would like to aggregate the data such that the total distinct users for hex_id A is the sum of distinct users in hex_id [A, B, C]
+----------+-------+------+---------+
| date_id|user_id|hex_id| hex_map|
+----------+-------+------+---------+
|2016-11-01| 100| A|[A, B, C]|
|2016-11-01| 300| B| [B]|
|2016-11-01| 400| B| [B]|
|2016-11-01| 100| C| [B, C]|
|2016-11-01| 200| C| [B, C]|
|2016-11-01| 300| C| [B, C]|
+----------+-------+------+---------+
I would like to aggregate the table on hex_id such that the value
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A|[A, B, C]| 1|
| B| [B]| 2|
| C| [B, C]| 3|
+------+---------+---+
becomes being replaced by the alphabets
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A| 6 | 1|
| B| 2 | 2|
| C| 5 | 3|
+------+---------+---+
This is run on spark sql 2.4.0 I am stumped on how to achieve this.
Where the value of 6 comes from [A+B+C]
my best attempt is
query="""
with cte as (select hex_id, hex_map, count(distinct user_id) cnt from tab group by hex_id, hex_map),
subq as (select hex_id as hex, cnt as cnts, explode(hex_map) xxt from cte),
sss (select * from subq a left join cte b on a.xxt = b.hex_id)
select hex, sum(cnt) from sss group by hex
"""
spark.sql(query).show()

Since you did not specify the behavior of your aggregation, I decided to use first, but you can adapt it to your wish.
The idea is to convert the character to the ascii representation, you can do that through the code below:
val df1 = spark.sql("select hex_id, first(hex_map) as first_hex_map from test group by hex_id")
df1.createOrReplaceTempView("df1")
val df2 = spark.sql("select hex_id, transform(first_hex_map, a -> ascii(a) - 64) as aggr from df1")
df2.createOrReplaceTempView("df2")
val df3 = spark.sql("select hex_id, aggr, aggregate(aggr, 0, (acc, x) -> acc + x) as final from df2")
final result:
+------+---------+-----+
|hex_id|aggr |final|
+------+---------+-----+
|A |[1, 2, 3]|6 |
|B |[2] |2 |
|C |[2, 3] |5 |
+------+---------+-----+
or using Dataset API:
df.groupBy("hex_id").agg(first("hex_map").as("first_hex_map"))
.withColumn("transformed", transform(col("first_hex_map"), a => ascii(a).minus(64)))
.withColumn("hex_map", aggregate(col("transformed"), lit(0), (acc, x) => acc.plus(x)))
Good luck!

Related

PySpark pivot as SQL query

Looking to write the full-SQL equivalent of a pivot implemented in pyspark. Code below creates a pandas DataFrame.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'name': ['up','down','left','up','down','left','right','up','down'],
'count': [6,7,5,3,4,2,9,12,4]})
# id name count
# 0 a up 6
# 1 a down 7
# 2 a left 5
# 3 b up 3
# 4 b down 4
# 5 b left 2
# 6 b right 9
# 7 c up 12
# 8 c down 4
Code below then converts to a pyspark DataFrame and implements a pivot on the name column.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ds = spark.createDataFrame(df)
dp = ds.groupBy('id').pivot('name').max().toPandas()
# id down left right up
# 0 c 4 NaN NaN 12
# 1 b 4 2.0 9.0 3
# 2 a 7 5.0 NaN 6
Trying to do the equivalent of ds.groupBy('id').pivot('name').max() in full-SQL, ie something like
ds.createOrReplaceTempView('ds')
dp = spark.sql(f"""
SELECT * FROM ds
PIVOT
(MAX(count)
FOR
...)""").toPandas()
Taking reference from SparkSQL Pivot -
Pivot
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
FOR name in ('up','down','left','right')
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach
You can dynamically create the PIVOT Clause , I tried to create a general wrapper around the same
def pivot_by(inp_df,by):
distinct_by = inp_df[by].unique()
distinct_name_str = ''
for i,name in enumerate(distinct_by):
if i == 0:
distinct_name_str += f'\'{name}\''
else:
distinct_name_str += f',\'{name}\''
final_str = f'FOR {by} in ({distinct_name_str})'
return final_str
pivot_clause_str = pivot_by(df,'name')
### O/p - FOR name in ('up','down','left','right')
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach Usage
pivot_clause_str = pivot_by(df,'id')
##O/p - FOR id in ('a','b','c')
ds.createOrReplaceTempView('ds')
sql.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+-----+----+---+----+
| name| a| b| c|
+-----+----+---+----+
| down| 7| 4| 4|
| left| 5| 2|null|
| up| 6| 3| 12|
|right|null| 9|null|
+-----+----+---+----+

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.

How to get unique values for each column in HIVE/PySpark table?

I have a table in HIVE/PySpark with A, B and C columns.
I want to get unique values for each of the column like
{A: [1, 2, 3], B:[a, b], C:[10, 20]}
in any format (dataframe, table, etc.)
How to do this efficiently (in parallel for each column) in HIVE or PySpark?
Current approach that I have does this for each column separately and thus is taking a lot of time.
We can use collect_set() from the pyspark.sql.functions module,
>>> df = spark.createDataFrame([(1,'a',10),(2,'a',20),(3,'b',10)],['A','B','C'])
>>> df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| a| 10|
| 2| a| 20|
| 3| b| 10|
+---+---+---+
>>> from pyspark.sql import functions as F
>>> df.select([F.collect_set(x).alias(x) for x in df.columns]).show()
+---------+------+--------+
| A| B| C|
+---------+------+--------+
|[1, 2, 3]|[b, a]|[20, 10]|
+---------+------+--------+

What is the difference between cube, rollup and groupBy operators?

I can't find any detailed documentation regarding the differences.
I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy.
These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words
table.groupBy($"foo", $"bar")
is equivalent to:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show
// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+
and you compute cube(x, y) with count as an aggregation:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
A similar function to cube is rollup which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+
Just for comparison lets see the result of plain groupBy:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+
To summarize:
When using plain GROUP BY every row is included only once in its corresponding summary.
With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y, COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.
There's one more member in the "family" which can explain it all - GROUPING SETS. We don't have it in PySpark/Scala, but it exists in SQL API.
GROUPING SETS is used to design whatever combination of groupings is required. Others (cube, rollup, groupBy) return predefined existent combinations:
cube("id", "x", "y") will return (), (id), (x), (y), (id, x), (id, y), (x, y), (id, x, y).
(All the possible existent combinations.)
rollup("id", "x", "y") will only return (), (id), (id, x), (id, x, y).
(Combinations which include the beginning of the provided sequence.)
groupBy("id", "x", "y") will only return (id, x, y) combination.
Examples
Input df:
df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")
cube
df.cube("id", "x", "y").count()
is the same as...
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null| 2| 3|
|null|null|null| 4|
| a|null| 2| 3|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null| 1| 1|
|null| foo|null| 2|
| a|null|null| 4|
|null|null| 1| 1|
|null| foo| 2| 1|
|null| foo| 1| 1|
| a| foo| 2| 1|
|null| bar|null| 2|
|null| bar| 2| 2|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
rollup
df.rollup("id", "x", "y").count()
is the same as... GROUPING SETS ((), (id), (id, x), (id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
--(x), <- (not used)
--(y), <- (not used)
(id, x),
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null|null| 4|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null|null| 4|
| a| foo| 2| 1|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
groupBy
df.groupBy("id", "x", "y").count()
is the same as... GROUPING SETS ((id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
--(), <- (not used)
--(id), <- (not used)
--(x), <- (not used)
--(y), <- (not used)
--(id, x), <- (not used)
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+---+---+---+-----+
| id| x| y|count|
+---+---+---+-----+
| a|foo| 2| 1|
| a|foo| 1| 1|
| a|bar| 2| 2|
+---+---+---+-----+
Note. All the above return existent combinations. In the example dataframe, there is no row for "id":"a", "x":"bar", "y":1. Even cube does not return it. In order to get all the possible combinations (existent or not) we should do something like the following (crossJoin):
df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# | id| x| y|count|
# +----+----+----+-----+
# |null|null|null| 4|
# |null|null| 1| 1|
# |null|null| 2| 3|
# |null| bar|null| 2|
# |null| bar| 2| 2|
# |null| foo|null| 2|
# |null| foo| 1| 1|
# |null| foo| 2| 1|
# | a|null|null| 4|
# | a|null| 1| 1|
# | a|null| 2| 3|
# | a| bar|null| 2|
# | a| bar| 1| null|
# | a| bar| 2| 2|
# | a| foo|null| 2|
# | a| foo| 1| 1|
# | a| foo| 2| 1|
# +----+----+----+-----+
If you do not want null first remove it using below example
Dfwithoutnull=df.na.drop("all",seq(col name 1,col name 2))
Above expression will delete null form the original dataframe
2.group by you know I guess.
3.rollup and cube is GROUPING SET operator.
Roll-up is a multidimensional aggrigation and treating element in hierarchical manner
And in cube rather than treating element hierarchically a cube does the same thing accross all dimension.
You can try grouping_id to understand the level of abstraction