I am working with Spark SQL, and doing some SQL operations on a Hive Table.
My table is like this:
```
ID COST CODE
1 100 AB1
5 200 BC3
1 400 FD3
6 600 HJ2
1 900 432
3 800 DS2
2 500 JT4
```
I want to create another table out of this, which would have the total cost and top 5 CODES in a chain in another column like this.
```
ID TOTAL_COST CODE CODE_CHAIN
1 1400 432 432, FD3, AB1
```
Total Cost is easy but, how to concat the values from the CODE column and form another column.
I have tried collect_set function but, the values cannot be limited and also are not properly sorted, probably due to distributed processing.
Any SQL logic is possible?
EDIT:
I need the data sorted, so I get top 5 values.
Use slice, sort_array, and collect_list
import org.apache.spark.sql.functions._
df
.groupBy("id")
.agg(
sum("cost") as "total_cost",
slice(sort_array(collect_list(struct($"cost", $"code")), false), 1, 5)("code") as "codes")
In Spark 2.3 you'll have to replace slice with manual indexing of the sorted array
val sorted = sort_array(collect_list(struct($"cost", $"code")), false)("code")
val codes = array((0 until 5).map(i => sorted.getItem(i)): _*) as "codes"
Use window function and with() table to filter on the first row_number. Check this out:
scala> val df = Seq((1,100,"AB1"),(5,200,"BC3"),(1,400,"FD3"),(6,600,"HJ2"),(1,900,"432"),(3,800,"DS2"),(2,500,"JT4")).toDF("ID","COST","CODE")
df: org.apache.spark.sql.DataFrame = [ID: int, COST: int ... 1 more field]
scala> df.show()
+---+----+----+
| ID|COST|CODE|
+---+----+----+
| 1| 100| AB1|
| 5| 200| BC3|
| 1| 400| FD3|
| 6| 600| HJ2|
| 1| 900| 432|
| 3| 800| DS2|
| 2| 500| JT4|
+---+----+----+
scala> df.createOrReplaceTempView("course")
scala> spark.sql(""" with tab1(select id,cost,code,collect_list(code) over(partition by id order by cost desc rows between current row and 5 following ) cc, row_number() over(partition by id order by cost desc) rc,sum(cost) over(partition by id order by cost desc rows between current row and 5 following) total from course) select id, total, cc from tab1 where rc=1 """).show(false)
+---+-----+---------------+
|id |total|cc |
+---+-----+---------------+
|1 |1400 |[432, FD3, AB1]|
|6 |600 |[HJ2] |
|3 |800 |[DS2] |
|5 |200 |[BC3] |
|2 |500 |[JT4] |
+---+-----+---------------+
scala>
Related
I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.
I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
Given a table like the following:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1| 2.80033330216659| 315.320000|
| 1| 2.8222365762732| 315.320000|
| 1| 5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2| 5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3| 5.77993740687426| 315.320000|
+--+------------------+-----------+
I'm trying to reduce the records per id to just one by taking the most frequent diagnosis for that id.
If it were an rdd, something like would do it:
rdd.map(lambda x: (x["id"], [(x["diagnosis_age"], x["diagnosis"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
in sql:
select id, diagnosis, diagnosis_age
from (select id, diagnosis, diagnosis_age, count(*) as cnt,
row_number() over (partition by id order by count(*) desc) as seqnum
from t
group by id, diagnosis, age
) da
where seqnum = 1;
desired output:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 2| 5.70572315231258| 315.320000|
| 3| 5.744053881894209| 315.320000|
+--+------------------+-----------+
How can I achieve the same using only spark dataframe operations if possible? Specifically without using any rdd actions/ sql.
Thanks
You can use count, max, first with window functions and filter on count=max.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id","diagnosis").orderBy("diagnosis_age")
w2=Window().partitionBy("id")
df.withColumn("count", F.count("diagnosis").over(w))\
.withColumn("max", F.max("count").over(w2))\
.filter("count=max")\
.groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"),F.first("diagnosis").alias("diagnosis"))\
.orderBy("id").show()
+---+------------------+---------+
| id| diagnosis_age|diagnosis|
+---+------------------+---------+
| 1|2.1843037179180302| 315.32|
| 2| 5.70572315231258| 315.32|
| 3| 5.744053881894209| 315.32|
+---+------------------+---------+
Python: Here is the conversion of my scala code.
from pyspark.sql.functions import col, first, count, desc, row_number
from pyspark.sql import Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).alias("diagnosis_age"), count(col("diagnosis_age")).alias("cnt")) \
.withColumn("seqnum", row_number().over(Window.partitionBy("id").orderBy(col("cnt").desc()))) \
.where("seqnum = 1") \
.select("id", "diagnosis_age", "diagnosis", "cnt") \
.orderBy("id") \
.show(10, False)
Scala: Your query does not make sense to me. The groupBy condition leads to the count for the record always be 1. I have modified a bit in the dataframe expression such as
import org.apache.spark.sql.expressions.Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).as("diagnosis_age"), count(col("diagnosis_age")).as("cnt"))
.withColumn("seqnum", row_number.over(Window.partitionBy("id").orderBy(col("cnt").desc)))
.where("seqnum = 1")
.select("id", "diagnosis_age", "diagnosis", "cnt")
.orderBy("id")
.show(false)
where the result is:
+---+------------------+---------+---+
|id |diagnosis_age |diagnosis|cnt|
+---+------------------+---------+---+
|1 |2.1843037179180302|315.32 |3 |
|2 |5.70572315231258 |315.32 |2 |
|3 |5.744053881894209 |315.32 |3 |
+---+------------------+---------+---+
The table looks like this :
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
The city column contains around 1000+ values which are | delimited
I want to create a flag column to indicate if a person visited only the city of interest.
city_of_interest=['Paris','Seattle','Tokyo']
There are 20 such values in the list.
Ouput should look like this :
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
The solution can either be in pandas or pyspark.
For pyspark, use split + array_contains:
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
For Pandas, use Series.str.get_dummies:
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)
Pandas Solution
First transform to list to use DataFrame.explode:
new_df=df.copy()
new_df['CITY']=new_df['CITY'].str.lstrip('|').str.split('|')
#print(new_df)
# ID CITY
#0 1 [London, Paris, Tokyo]
#1 2 [Tokyo, Barcelona, Mumbai, London]
#2 3 [Vienna, Paris, Seattle]
Then we can use:
Method 1: DataFrame.pivot_table
new_df=( new_df.explode('CITY')
.pivot_table(columns='CITY',index='ID',aggfunc='size',fill_value=0)
[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Method 2: DataFrame.groupby + DataFrame.unstack
new_df=( new_df.explode('CITY')
.groupby(['ID'])
.CITY
.value_counts()
.unstack('CITY',fill_value=0)[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Output new_df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0
Using a UDF to check if the city of interest value is in the delimited column.
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()
Suppose I have a SQL table looking something like this:
--------------------
| id| name|
--------------------
| 1| Alice|
| 2| Bob|
| 3| Alice|
| 4| Alice|
| 5| Jeff|
| ...| ...|
--------------------
Is it possible to formulate a query which returns a list of names and the number of times they occur? I've made a solution to this by querying all the rows, removing duplicates counting and then ordering; which works, but just looks messy. Can this be neatened up in a SQL query?
This is standard SQL and should deliver your expected result:
select name, count(*)
from tblName
group by name
order by name
If you want to order by the count in descending order, you can use:
select name, count(*)
from tblName
group by name
order by 2 DESC