SQL Server create Unpivot table with where condition - sql

Here is my sample table
---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Id| CompanyName|part1_sales_amount|part2_sales_amount|part3_sales_amount|part1_sales_quantity|part2_sales_quantity|part3_sales_quantity|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 1| FastCarsCo| 1| 2| 3| 4| 5| 6|
| 2|TastyCakeShop| 4| 5| 6| 4| 5| 6|
| 3| KidsToys| 7| 8| 9| 7| 8| 9|
| 4| FruitStall| 10| 11| 12| 10| 11| 12|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
Here is output table that i want
+---+-------------+------------------+------------------+------------------+
| Id| CompanyName|Account |amount |quantity |
+---+-------------+------------------+------------------+------------------+
| 1| FastCarsCo| part1_sales| 1| 1|
| 1| FastCarsCo| part2_sales| 2| 2|
| 1| FastCarsCo| part3_sales| 3| 3|
| 2|TastyCakeShop| part1_sales| 4| 4|
| 2|TastyCakeShop| part2_sales| 5| 5|
| 2|TastyCakeShop| part3_sales| 6| 6|
| 3| KidsToys| part1_sales| 7| 7|
| 3| KidsToys| part2_sales| 8| 8|
| 3| KidsToys| part3_sales| 9| 9|
| 4| FruitStall| part1_sales| 10| 10|
| 4| FruitStall| part2_sales| 11| 11|
| 4| FruitStall| part3_sales| 12| 12|
+---+-------------+------------------+------------------+------------------+
Things I already did
SELECT
Id,
CompanyName,
REPLACE ( acc , '_amount' , '' ) AS Account,
amount,
quantity
FROM
(
SELECT Id, CompanyName, part1_sales_amount ,part2_sales_amount ,part3_sales_amount ,part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity
FROM privot
) src
UNPIVOT
(
amount FOR acc IN (part1_sales_amount ,part2_sales_amount ,part3_sales_amount )
) pvt1
UNPIVOT
(
quantity FOR acc1 IN (part1_sales_quantity, part2_sales_quantity, part3_sales_quantity )
) pvt2
It gave some result but it seems like there is some unexpected record also(Like cross join). so my final step the WHERE clause, what should I write in WHERE clause.I tried many thing but non is a correct one.
Note: In my real data base here are almost 200 column like those part1_sales_amount and part1_sales_quantity
Please any help appreciate.

You can use apply :
select t.id, t.companyname, tt.amount, tt.qty
from table t cross apply
( values (t.part1_sales_amount, t.part1_sales_quantity),
(t.part2_sales_amount, t.part2_sales_quantity),
(t.part3_sales_amount, t.part3_sales_quantity),
. . .
) tt(amount, qty);

SELECT Id, CompanyName, account, amount, quantity
FROM MyTable
CROSS APPLY (
SELECT account = 'part1_sales_amount', amount = part1_sales_amount, quantity = part1_sales_quantity
UNION ALL
SELECT account = 'part2_sales_amount', amount = part2_sales_amount, quantity = part2_sales_quantity
UNION ALL
SELECT account = 'part3_sales_amount', amount = part3_sales_amount, quantity = part3_sales_quantity
) AS AnotherData

Single unpivot and choose the corresponding quantity column:
declare #privot table
(
id int,
CompanyName varchar(20),
part1_sales_amount money,
part2_sales_amount money,
part3_sales_amount money,
part4_sales_amount money,
part5_sales_amount money,
part1_sales_quantity int,
part2_sales_quantity int,
part3_sales_quantity int,
part4_sales_quantity int,
part5_sales_quantity int
);
insert into #privot
(
Id, CompanyName,
part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity
)
values
(1, 'FastCarsCo', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
(2, 'TastyCakeShop', 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
(3, 'KidsToys', 11, 21, 31, 41, 51, 61, 71, 81, 91, 101),
(4, 'FruitStall', 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000);
select
Id, CompanyName, replace(acc, '_amount', '') as acc, amount,
quantity=choose(/*try_cast ??*/replace(left(acc, charindex('_', acc)-1), 'part', ''), /*quantity columns*/part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity)
FROM
(
SELECT *
--Id, CompanyName,
--part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
--part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity , part4_sales_quantity, part5_sales_quantity
FROM #privot
) src
UNPIVOT
(
amount FOR acc IN (/*amount columns*/part1_sales_amount ,part2_sales_amount ,part3_sales_amount, part4_sales_amount, part5_sales_amount )
) pvt1;

Related

Sum multiple column with PARTITION from single table

I have a question, it seems simple but I can't figure it out.
I have a sample table like this:
Overtime Table (OT)
+----------+------------+----------+-------------+
|EmployeeId|OvertimeDate|HourMargin|OvertimePoint|
+----------+------------+----------+-------------+
| 1| 2020-07-01| 05:00| 15|
| 1| 2020-07-02| 03:00| 9|
| 2| 2020-07-01| 01:00| 3|
| 2| 2020-07-03| 03:00| 9|
| 3| 2020-07-06| 03:00| 9|
| 3| 2020-07-07| 01:00| 3|
+----------+------------+----------+-------------+
OLC Table (OLC)
+----------+------------+-----+------+
|EmployeeId| OLCDate | OLC | Trip |
+----------+------------+-----+------+
| 1| 2020-07-01| 2| 0|
| 3| 2020-07-13| 3| 6|
+----------+------------+-----+------+
So, based on that tables, I want to calculate total OT.HourMargin, OT.OTPoint, OLC.OLC, and OLC.Trip with the final result like this:
Result
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| 0| 0|
| 3| 04:00| 24| 3| 6|
+----------+-----------+----------+--------+----------+
Here is the query that I try to achieve the result:
DECLARE #Overtime TABLE (
EmployeeId INT,
OvertimeDate DATE,
HourMargin TIME,
OvertimePoint INT
)
DECLARE #OLC TABLE (
EmployeeId INT,
OLCDate DATE,
OLC INT,
Trip INT
)
INSERT INTO #Overtime VALUES (1, '2020-07-01', '05:00:00', 15)
INSERT INTO #Overtime VALUES (1, '2020-07-02', '03:00:00', 9)
INSERT INTO #Overtime VALUES (2, '2020-07-01', '01:00:00', 3)
INSERT INTO #Overtime VALUES (2, '2020-07-03', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-06', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-07', '01:00:00', 3)
INSERT INTO #OLC VALUES (1, '2020-07-01', 2, 0)
INSERT INTO #OLC VALUES (3, '2020-07-13', 3, 6)
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)) OVER (PARTITION BY OT.EmployeeId)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) OVER (PARTITION BY OT.EmployeeId) AS TotalPoint,
SUM(OLC.OLC) OVER (PARTITION BY OLC.EmployeeId) AS TotalOLC,
SUM(OLC.Trip) OVER (PARTITION BY OLC.EmployeeId) AS TotalTrip
FROM
#Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
ORDER BY
EmployeeId
Here is the result from my query:
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| NULL| NULL|
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| NULL| NULL|
| 2| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
+----------+-----------+----------+--------+----------+
It seems when I try to SUM multiple columns from single table, it will create multiple rows in the final result. Right now, what came across to my mind is using CTE, separate the multiple column into multiple CTE's and querying from all CTE's. Or even try to create temp table/table variable, query the sum's from each column and store/update it.
So, any idea how to achieve my result without using multiple CTE's or temp tables?
Thank You
You want to group together rows that belong to the same EmployeeID, so this implies aggregation rather than window functions:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
COALESCE(SUM(OLC.OLC), 0) AS TotalOLC,
COALESCE(SUM(OLC.Trip), 0) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
GROUP BY OT.EmployeeId
I also don't see the point for the join condition on the dates, so I removed it. Finally, you can use coalesce() to return 0 for rows that have no OLC.
Demo on DB Fiddle:
EmployeeId | TotalMargin | TotalPoint | TotalOLC | TotalTrip
---------: | :---------- | ---------: | -------: | --------:
1 | 08:00:00 | 24 | 4 | 0
2 | 04:00:00 | 12 | 0 | 0
3 | 04:00:00 | 12 | 6 | 12
You've decided to use SUM OVER but you're experiencing the "problem" of multiple rows... that's what a sum over does; you can conceive that doing an OVER(PARTITION..) does a group by that is auto joined back to the driving table so you end up with all the rows from the driving table together with repeated results of the summation
Here is a simple data set:
ProductID, Price
1, 100
1, 200
2, 300
2, 400
Here are some queries and results:
--perform a basic group and sum
SELECT ProductID, SUM(Price) S FROM x GROUP BY ProductID
1, 300
2, 700
--perform basic group/sum and join it back to the main table
SELECT ProductID, Price, S
FROM
x
INNER JOIN
(SELECT ProductID, SUM(Price) s FROM x GROUP BY ProductID) y
ON x.ProductID = y.ProductID
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
--perform a sum over, the partition here being the same as the earlier group
SELECT ProductID, Price, SUM(Price) OVER(PARTITION BY ProductID) FROM x
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
You can see the latter two produce the same result, extra rows with the total appended. It may help you understand simple window functions if you conceive that this is what he db does internally - it takes the "partition by", does a subquery group by with it, and joins the results back on whatever columns were in the partition
It looks like what you really want is a simple group:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin))), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
SUM(OLC.OLC) AS TotalOLC,
SUM(OLC.Trip) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
GROUP BY OT.EmployeeID

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()

Break down a table to pivot in columns (SQL,PYSPARK)

I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table :
+----+-----+-----+-----+
|year|month|total| loop|
+----+-----+-----+-----+
|2012| 1| 20|loop1|
|2012| 2| 30|loop1|
|2012| 1| 10|loop2|
|2012| 2| 5|loop2|
|2012| 1| 50|loop3|
|2012| 2| 60|loop3|
+----+-----+-----+-----+
And I need to get an output like:
year month total_loop1 total_loop2 total_loop3
2012 1 20 10 50
2012 2 30 5 60
The closer I have gotten is with the SQL code:
select a.year,a.month, a.total,b.total from test a
left join test b
on a.loop <> b.loop
and a.year = b.year and a.month=b.month
output still so far:
+----+-----+-----+-----+
|year|month|total|total|
+----+-----+-----+-----+
|2012| 1| 20| 10|
|2012| 1| 20| 50|
|2012| 1| 10| 20|
|2012| 1| 10| 50|
|2012| 1| 50| 20|
|2012| 1| 50| 10|
|2012| 2| 30| 5|
|2012| 2| 30| 60|
|2012| 2| 5| 30|
|2012| 2| 5| 60|
|2012| 2| 60| 30|
|2012| 2| 60| 5|
+----+-----+-----+-----+
How could I do it? thanks so much
Table Script and Sample data
CREATE TABLE [TableName](
[year] [nvarchar](50) NULL,
[month] [int] NULL,
[total] [int] NULL,
[loop] [nvarchar](50) NULL
)
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 20, N'loop1')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 30, N'loop1')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 10, N'loop2')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 5, N'loop2')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 50, N'loop3')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 60, N'loop3')
Using Pivot function...
SELECT *
FROM TableName
PIVOT(Max([total])
FOR [loop] IN ([loop1], [loop2], [loop3]) ) pvt
Online Demo: http://www.sqlfiddle.com/#!18/164a4/1/0
If you are looking for a dynamic solution, then try this... (Dynamic Pivot)
DECLARE #cols AS NVARCHAR(max) = Stuff((SELECT DISTINCT ',' + Quotename([loop])
FROM TableName
FOR xml path(''), type).value('.', 'NVARCHAR(MAX)'), 1, 1, '');
DECLARE #query AS NVARCHAR(max) = 'SELECT *
FROM TableName
PIVOT(Max([total])
FOR [loop] IN ('+ #cols +') ) pvt';
EXECUTE(#query)
Online Demo: http://www.sqlfiddle.com/#!18/164a4/3/0
Output
+------+-------+-------+-------+-------+
| year | month | loop1 | loop2 | loop3 |
+------+-------+-------+-------+-------+
| 2012 | 1 | 20 | 10 | 50 |
| 2012 | 2 | 30 | 5 | 60 |
+------+-------+-------+-------+-------+
You don't need to use join you can do conditional aggregation:
select year, month,
max(case when loop = 'loop1' then total end) loop1,
max(case when loop = 'loop2' then total end) loop2,
max(case when loop = 'loop3' then total end) loop3
from test a
group by year, month;
You can use PIVOT() to convert rows to columns:
SELECT
year,
MONTH,
p.loop1 AS 'total_loop1',
p.loop2 AS 'total_loop2',
p.loop3 AS 'total_loop3'
FROM
tablename
PIVOT
(MAX(total)
FOR loop IN ([loop1], [loop2], [loop3])
) AS p;

What's the default window frame for window functions

Running the following code:
val sales = Seq(
(0, 0, 0, 5),
(1, 0, 1, 3),
(2, 0, 2, 1),
(3, 1, 0, 2),
(4, 2, 0, 8),
(5, 2, 2, 8))
.toDF("id", "orderID", "prodID", "orderQty")
val orderedByID = Window.orderBy('id')
val totalQty = sum('orderQty').over(orderedByID).as('running_total')
val salesTotalQty = sales.select(*, totalQty).orderBy('id')
salesTotalQty.show()
The result is:
+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 11|
| 4| 2| 0| 8| 19|
| 5| 2| 2| 8| 27|
+---+-------+------+--------+-------------+
There is no window frame defined in the above code, it looks the default window frame is rowsBetween(Window.unboundedPreceding, Window.currentRow)
Not sure my understanding about default window frame is correct
From Spark Gotchas
Default frame specification depends on other aspects of a given window defintion:
if the ORDER BY clause is specified and the function accepts the frame specification, then the frame specification is defined by RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,
otherwise the frame specification is defined by ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.