Concat values from a column and make another column

Concat values from a column and make another column - sql

I am working with Spark SQL, and doing some SQL operations on a Hive Table.
My table is like this:
```
ID COST CODE
1 100 AB1
5 200 BC3
1 400 FD3
6 600 HJ2
1 900 432
3 800 DS2
2 500 JT4
```
I want to create another table out of this, which would have the total cost and top 5 CODES in a chain in another column like this.
```
ID TOTAL_COST CODE CODE_CHAIN
1 1400 432 432, FD3, AB1
```
Total Cost is easy but, how to concat the values from the CODE column and form another column.
I have tried collect_set function but, the values cannot be limited and also are not properly sorted, probably due to distributed processing.
Any SQL logic is possible?
EDIT:
I need the data sorted, so I get top 5 values.

Use slice, sort_array, and collect_list
import org.apache.spark.sql.functions._
df
.groupBy("id")
.agg(
sum("cost") as "total_cost",
slice(sort_array(collect_list(struct($"cost", $"code")), false), 1, 5)("code") as "codes")
In Spark 2.3 you'll have to replace slice with manual indexing of the sorted array
val sorted = sort_array(collect_list(struct($"cost", $"code")), false)("code")
val codes = array((0 until 5).map(i => sorted.getItem(i)): _*) as "codes"

Use window function and with() table to filter on the first row_number. Check this out:
scala> val df = Seq((1,100,"AB1"),(5,200,"BC3"),(1,400,"FD3"),(6,600,"HJ2"),(1,900,"432"),(3,800,"DS2"),(2,500,"JT4")).toDF("ID","COST","CODE")
df: org.apache.spark.sql.DataFrame = [ID: int, COST: int ... 1 more field]
scala> df.show()
+---+----+----+
| ID|COST|CODE|
+---+----+----+
| 1| 100| AB1|
| 5| 200| BC3|
| 1| 400| FD3|
| 6| 600| HJ2|
| 1| 900| 432|
| 3| 800| DS2|
| 2| 500| JT4|
+---+----+----+
scala> df.createOrReplaceTempView("course")
scala> spark.sql(""" with tab1(select id,cost,code,collect_list(code) over(partition by id order by cost desc rows between current row and 5 following ) cc, row_number() over(partition by id order by cost desc) rc,sum(cost) over(partition by id order by cost desc rows between current row and 5 following) total from course) select id, total, cc from tab1 where rc=1 """).show(false)
+---+-----+---------------+
|id |total|cc |
+---+-----+---------------+
|1 |1400 |[432, FD3, AB1]|
|6 |600 |[HJ2] |
|3 |800 |[DS2] |
|5 |200 |[BC3] |
|2 |500 |[JT4] |
+---+-----+---------------+
scala>

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks

What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance

Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

spark dataframe reducing multiple records per id to just one by most frequent value [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
Given a table like the following:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1| 2.80033330216659| 315.320000|
| 1| 2.8222365762732| 315.320000|
| 1| 5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2| 5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3| 5.77993740687426| 315.320000|
+--+------------------+-----------+
I'm trying to reduce the records per id to just one by taking the most frequent diagnosis for that id.
If it were an rdd, something like would do it:
rdd.map(lambda x: (x["id"], [(x["diagnosis_age"], x["diagnosis"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
in sql:
select id, diagnosis, diagnosis_age
from (select id, diagnosis, diagnosis_age, count(*) as cnt,
row_number() over (partition by id order by count(*) desc) as seqnum
from t
group by id, diagnosis, age
) da
where seqnum = 1;
desired output:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 2| 5.70572315231258| 315.320000|
| 3| 5.744053881894209| 315.320000|
+--+------------------+-----------+
How can I achieve the same using only spark dataframe operations if possible? Specifically without using any rdd actions/ sql.
Thanks

You can use count, max, first with window functions and filter on count=max.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id","diagnosis").orderBy("diagnosis_age")
w2=Window().partitionBy("id")
df.withColumn("count", F.count("diagnosis").over(w))\
.withColumn("max", F.max("count").over(w2))\
.filter("count=max")\
.groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"),F.first("diagnosis").alias("diagnosis"))\
.orderBy("id").show()
+---+------------------+---------+
| id| diagnosis_age|diagnosis|
+---+------------------+---------+
| 1|2.1843037179180302| 315.32|
| 2| 5.70572315231258| 315.32|
| 3| 5.744053881894209| 315.32|
+---+------------------+---------+

Python: Here is the conversion of my scala code.
from pyspark.sql.functions import col, first, count, desc, row_number
from pyspark.sql import Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).alias("diagnosis_age"), count(col("diagnosis_age")).alias("cnt")) \
.withColumn("seqnum", row_number().over(Window.partitionBy("id").orderBy(col("cnt").desc()))) \
.where("seqnum = 1") \
.select("id", "diagnosis_age", "diagnosis", "cnt") \
.orderBy("id") \
.show(10, False)
Scala: Your query does not make sense to me. The groupBy condition leads to the count for the record always be 1. I have modified a bit in the dataframe expression such as
import org.apache.spark.sql.expressions.Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).as("diagnosis_age"), count(col("diagnosis_age")).as("cnt"))
.withColumn("seqnum", row_number.over(Window.partitionBy("id").orderBy(col("cnt").desc)))
.where("seqnum = 1")
.select("id", "diagnosis_age", "diagnosis", "cnt")
.orderBy("id")
.show(false)
where the result is:
+---+------------------+---------+---+
|id |diagnosis_age |diagnosis|cnt|
+---+------------------+---------+---+
|1 |2.1843037179180302|315.32 |3 |
|2 |5.70572315231258 |315.32 |2 |
|3 |5.744053881894209 |315.32 |3 |
+---+------------------+---------+---+

How to create multiple flag columns based on list values found in the dataframe column?

The table looks like this :
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
The city column contains around 1000+ values which are | delimited
I want to create a flag column to indicate if a person visited only the city of interest.
city_of_interest=['Paris','Seattle','Tokyo']
There are 20 such values in the list.
Ouput should look like this :
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
The solution can either be in pandas or pyspark.

For pyspark, use split + array_contains:
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
For Pandas, use Series.str.get_dummies:
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)

Pandas Solution
First transform to list to use DataFrame.explode:
new_df=df.copy()
new_df['CITY']=new_df['CITY'].str.lstrip('|').str.split('|')
#print(new_df)
# ID CITY
#0 1 [London, Paris, Tokyo]
#1 2 [Tokyo, Barcelona, Mumbai, London]
#2 3 [Vienna, Paris, Seattle]
Then we can use:
Method 1: DataFrame.pivot_table
new_df=( new_df.explode('CITY')
.pivot_table(columns='CITY',index='ID',aggfunc='size',fill_value=0)
[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Method 2: DataFrame.groupby + DataFrame.unstack
new_df=( new_df.explode('CITY')
.groupby(['ID'])
.CITY
.value_counts()
.unstack('CITY',fill_value=0)[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Output new_df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0

Using a UDF to check if the city of interest value is in the delimited column.
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()

SQL Count occurrences of non-unique column

Suppose I have a SQL table looking something like this:
--------------------
| id| name|
--------------------
| 1| Alice|
| 2| Bob|
| 3| Alice|
| 4| Alice|
| 5| Jeff|
| ...| ...|
--------------------
Is it possible to formulate a query which returns a list of names and the number of times they occur? I've made a solution to this by querying all the rows, removing duplicates counting and then ordering; which works, but just looks messy. Can this be neatened up in a SQL query?

This is standard SQL and should deliver your expected result:
select name, count(*)
from tblName
group by name
order by name
If you want to order by the count in descending order, you can use:
select name, count(*)
from tblName
group by name
order by 2 DESC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Concat values from a column and make another column - sql

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

spark dataframe reducing multiple records per id to just one by most frequent value [duplicate]

How to create multiple flag columns based on list values found in the dataframe column?

SQL Count occurrences of non-unique column

Categories

Resources