Break down a table to pivot in columns (SQL,PYSPARK) - sql

I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table :
+----+-----+-----+-----+
|year|month|total| loop|
+----+-----+-----+-----+
|2012| 1| 20|loop1|
|2012| 2| 30|loop1|
|2012| 1| 10|loop2|
|2012| 2| 5|loop2|
|2012| 1| 50|loop3|
|2012| 2| 60|loop3|
+----+-----+-----+-----+
And I need to get an output like:
year month total_loop1 total_loop2 total_loop3
2012 1 20 10 50
2012 2 30 5 60
The closer I have gotten is with the SQL code:
select a.year,a.month, a.total,b.total from test a
left join test b
on a.loop <> b.loop
and a.year = b.year and a.month=b.month
output still so far:
+----+-----+-----+-----+
|year|month|total|total|
+----+-----+-----+-----+
|2012| 1| 20| 10|
|2012| 1| 20| 50|
|2012| 1| 10| 20|
|2012| 1| 10| 50|
|2012| 1| 50| 20|
|2012| 1| 50| 10|
|2012| 2| 30| 5|
|2012| 2| 30| 60|
|2012| 2| 5| 30|
|2012| 2| 5| 60|
|2012| 2| 60| 30|
|2012| 2| 60| 5|
+----+-----+-----+-----+
How could I do it? thanks so much

Table Script and Sample data
CREATE TABLE [TableName](
[year] [nvarchar](50) NULL,
[month] [int] NULL,
[total] [int] NULL,
[loop] [nvarchar](50) NULL
)
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 20, N'loop1')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 30, N'loop1')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 10, N'loop2')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 5, N'loop2')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 1, 50, N'loop3')
INSERT [TableName] ([year], [month], [total], [loop]) VALUES (N'2012', 2, 60, N'loop3')
Using Pivot function...
SELECT *
FROM TableName
PIVOT(Max([total])
FOR [loop] IN ([loop1], [loop2], [loop3]) ) pvt
Online Demo: http://www.sqlfiddle.com/#!18/164a4/1/0
If you are looking for a dynamic solution, then try this... (Dynamic Pivot)
DECLARE #cols AS NVARCHAR(max) = Stuff((SELECT DISTINCT ',' + Quotename([loop])
FROM TableName
FOR xml path(''), type).value('.', 'NVARCHAR(MAX)'), 1, 1, '');
DECLARE #query AS NVARCHAR(max) = 'SELECT *
FROM TableName
PIVOT(Max([total])
FOR [loop] IN ('+ #cols +') ) pvt';
EXECUTE(#query)
Online Demo: http://www.sqlfiddle.com/#!18/164a4/3/0
Output
+------+-------+-------+-------+-------+
| year | month | loop1 | loop2 | loop3 |
+------+-------+-------+-------+-------+
| 2012 | 1 | 20 | 10 | 50 |
| 2012 | 2 | 30 | 5 | 60 |
+------+-------+-------+-------+-------+

You don't need to use join you can do conditional aggregation:
select year, month,
max(case when loop = 'loop1' then total end) loop1,
max(case when loop = 'loop2' then total end) loop2,
max(case when loop = 'loop3' then total end) loop3
from test a
group by year, month;

You can use PIVOT() to convert rows to columns:
SELECT
year,
MONTH,
p.loop1 AS 'total_loop1',
p.loop2 AS 'total_loop2',
p.loop3 AS 'total_loop3'
FROM
tablename
PIVOT
(MAX(total)
FOR loop IN ([loop1], [loop2], [loop3])
) AS p;

Related

SQL/Hive MERGE INTO -- Why won't this work?

I am trying to follow an example posed by databricks but I am unable to understand why the smallest example won't work as I am expecting it to.
The ultimate goal of this process is for an idempotent merge; run it many times and the operation will only proceed once, the first time.
Some data
-- 1 row is new, 1 row is what we have already
create table #baseset
(
Date varchar(30),
ID varchar(30),
State varchar(30),
Count varchar(30)
)
insert into #baseset values('2/7/2023', 'A', 'A', null)
insert into #baseset values('2/6/2023', 'A', 'A', null)
create table #changeset
(
Date varchar(30),
ID varchar(30),
State varchar(30),
Count varchar(30)
)
insert into #changeset values('2/8/2023', 'A', 'A', null)
insert into #changeset values('2/7/2023', 'A', 'A', null)
My commands and results
I am expecting this to just return two rows:
SELECT
-- UPDATE
cs.ID as MERGEKEY,
cs.*
FROM
changeset cs
UNION ALL
SELECT
-- INSERT
NULL as MERGEKEY,
cs.*
FROM
changeset cs
JOIN baseset c ON c.ID = cs.ID
WHERE
NOT (
c.Date = cs.Date
AND c.State = cs.State
AND c.Count = cs.Count
)
+--------+--------+---+-----+-----+
|MERGEKEY| Date| ID|State|Count|
+--------+--------+---+-----+-----+
| A|2/8/2023| A| A| null|
| null|2/8/2023| A| A| null|
+--------+--------+---+-----+-----+
but instead I am returning:
+--------+--------+---+-----+-----+
|MERGEKEY| Date| ID|State|Count|
+--------+--------+---+-----+-----+
| A|2/8/2023| A| A| null|
| A|2/7/2023| A| A| null|
| null|2/7/2023| A| A| null|
| null|2/8/2023| A| A| null|
| null|2/8/2023| A| A| null|
+--------+--------+---+-----+-----+

Sum multiple column with PARTITION from single table

I have a question, it seems simple but I can't figure it out.
I have a sample table like this:
Overtime Table (OT)
+----------+------------+----------+-------------+
|EmployeeId|OvertimeDate|HourMargin|OvertimePoint|
+----------+------------+----------+-------------+
| 1| 2020-07-01| 05:00| 15|
| 1| 2020-07-02| 03:00| 9|
| 2| 2020-07-01| 01:00| 3|
| 2| 2020-07-03| 03:00| 9|
| 3| 2020-07-06| 03:00| 9|
| 3| 2020-07-07| 01:00| 3|
+----------+------------+----------+-------------+
OLC Table (OLC)
+----------+------------+-----+------+
|EmployeeId| OLCDate | OLC | Trip |
+----------+------------+-----+------+
| 1| 2020-07-01| 2| 0|
| 3| 2020-07-13| 3| 6|
+----------+------------+-----+------+
So, based on that tables, I want to calculate total OT.HourMargin, OT.OTPoint, OLC.OLC, and OLC.Trip with the final result like this:
Result
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| 0| 0|
| 3| 04:00| 24| 3| 6|
+----------+-----------+----------+--------+----------+
Here is the query that I try to achieve the result:
DECLARE #Overtime TABLE (
EmployeeId INT,
OvertimeDate DATE,
HourMargin TIME,
OvertimePoint INT
)
DECLARE #OLC TABLE (
EmployeeId INT,
OLCDate DATE,
OLC INT,
Trip INT
)
INSERT INTO #Overtime VALUES (1, '2020-07-01', '05:00:00', 15)
INSERT INTO #Overtime VALUES (1, '2020-07-02', '03:00:00', 9)
INSERT INTO #Overtime VALUES (2, '2020-07-01', '01:00:00', 3)
INSERT INTO #Overtime VALUES (2, '2020-07-03', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-06', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-07', '01:00:00', 3)
INSERT INTO #OLC VALUES (1, '2020-07-01', 2, 0)
INSERT INTO #OLC VALUES (3, '2020-07-13', 3, 6)
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)) OVER (PARTITION BY OT.EmployeeId)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) OVER (PARTITION BY OT.EmployeeId) AS TotalPoint,
SUM(OLC.OLC) OVER (PARTITION BY OLC.EmployeeId) AS TotalOLC,
SUM(OLC.Trip) OVER (PARTITION BY OLC.EmployeeId) AS TotalTrip
FROM
#Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
ORDER BY
EmployeeId
Here is the result from my query:
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| NULL| NULL|
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| NULL| NULL|
| 2| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
+----------+-----------+----------+--------+----------+
It seems when I try to SUM multiple columns from single table, it will create multiple rows in the final result. Right now, what came across to my mind is using CTE, separate the multiple column into multiple CTE's and querying from all CTE's. Or even try to create temp table/table variable, query the sum's from each column and store/update it.
So, any idea how to achieve my result without using multiple CTE's or temp tables?
Thank You
You want to group together rows that belong to the same EmployeeID, so this implies aggregation rather than window functions:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
COALESCE(SUM(OLC.OLC), 0) AS TotalOLC,
COALESCE(SUM(OLC.Trip), 0) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
GROUP BY OT.EmployeeId
I also don't see the point for the join condition on the dates, so I removed it. Finally, you can use coalesce() to return 0 for rows that have no OLC.
Demo on DB Fiddle:
EmployeeId | TotalMargin | TotalPoint | TotalOLC | TotalTrip
---------: | :---------- | ---------: | -------: | --------:
1 | 08:00:00 | 24 | 4 | 0
2 | 04:00:00 | 12 | 0 | 0
3 | 04:00:00 | 12 | 6 | 12
You've decided to use SUM OVER but you're experiencing the "problem" of multiple rows... that's what a sum over does; you can conceive that doing an OVER(PARTITION..) does a group by that is auto joined back to the driving table so you end up with all the rows from the driving table together with repeated results of the summation
Here is a simple data set:
ProductID, Price
1, 100
1, 200
2, 300
2, 400
Here are some queries and results:
--perform a basic group and sum
SELECT ProductID, SUM(Price) S FROM x GROUP BY ProductID
1, 300
2, 700
--perform basic group/sum and join it back to the main table
SELECT ProductID, Price, S
FROM
x
INNER JOIN
(SELECT ProductID, SUM(Price) s FROM x GROUP BY ProductID) y
ON x.ProductID = y.ProductID
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
--perform a sum over, the partition here being the same as the earlier group
SELECT ProductID, Price, SUM(Price) OVER(PARTITION BY ProductID) FROM x
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
You can see the latter two produce the same result, extra rows with the total appended. It may help you understand simple window functions if you conceive that this is what he db does internally - it takes the "partition by", does a subquery group by with it, and joins the results back on whatever columns were in the partition
It looks like what you really want is a simple group:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin))), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
SUM(OLC.OLC) AS TotalOLC,
SUM(OLC.Trip) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
GROUP BY OT.EmployeeID

SQL Server create Unpivot table with where condition

Here is my sample table
---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Id| CompanyName|part1_sales_amount|part2_sales_amount|part3_sales_amount|part1_sales_quantity|part2_sales_quantity|part3_sales_quantity|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 1| FastCarsCo| 1| 2| 3| 4| 5| 6|
| 2|TastyCakeShop| 4| 5| 6| 4| 5| 6|
| 3| KidsToys| 7| 8| 9| 7| 8| 9|
| 4| FruitStall| 10| 11| 12| 10| 11| 12|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
Here is output table that i want
+---+-------------+------------------+------------------+------------------+
| Id| CompanyName|Account |amount |quantity |
+---+-------------+------------------+------------------+------------------+
| 1| FastCarsCo| part1_sales| 1| 1|
| 1| FastCarsCo| part2_sales| 2| 2|
| 1| FastCarsCo| part3_sales| 3| 3|
| 2|TastyCakeShop| part1_sales| 4| 4|
| 2|TastyCakeShop| part2_sales| 5| 5|
| 2|TastyCakeShop| part3_sales| 6| 6|
| 3| KidsToys| part1_sales| 7| 7|
| 3| KidsToys| part2_sales| 8| 8|
| 3| KidsToys| part3_sales| 9| 9|
| 4| FruitStall| part1_sales| 10| 10|
| 4| FruitStall| part2_sales| 11| 11|
| 4| FruitStall| part3_sales| 12| 12|
+---+-------------+------------------+------------------+------------------+
Things I already did
SELECT
Id,
CompanyName,
REPLACE ( acc , '_amount' , '' ) AS Account,
amount,
quantity
FROM
(
SELECT Id, CompanyName, part1_sales_amount ,part2_sales_amount ,part3_sales_amount ,part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity
FROM privot
) src
UNPIVOT
(
amount FOR acc IN (part1_sales_amount ,part2_sales_amount ,part3_sales_amount )
) pvt1
UNPIVOT
(
quantity FOR acc1 IN (part1_sales_quantity, part2_sales_quantity, part3_sales_quantity )
) pvt2
It gave some result but it seems like there is some unexpected record also(Like cross join). so my final step the WHERE clause, what should I write in WHERE clause.I tried many thing but non is a correct one.
Note: In my real data base here are almost 200 column like those part1_sales_amount and part1_sales_quantity
Please any help appreciate.
You can use apply :
select t.id, t.companyname, tt.amount, tt.qty
from table t cross apply
( values (t.part1_sales_amount, t.part1_sales_quantity),
(t.part2_sales_amount, t.part2_sales_quantity),
(t.part3_sales_amount, t.part3_sales_quantity),
. . .
) tt(amount, qty);
SELECT Id, CompanyName, account, amount, quantity
FROM MyTable
CROSS APPLY (
SELECT account = 'part1_sales_amount', amount = part1_sales_amount, quantity = part1_sales_quantity
UNION ALL
SELECT account = 'part2_sales_amount', amount = part2_sales_amount, quantity = part2_sales_quantity
UNION ALL
SELECT account = 'part3_sales_amount', amount = part3_sales_amount, quantity = part3_sales_quantity
) AS AnotherData
Single unpivot and choose the corresponding quantity column:
declare #privot table
(
id int,
CompanyName varchar(20),
part1_sales_amount money,
part2_sales_amount money,
part3_sales_amount money,
part4_sales_amount money,
part5_sales_amount money,
part1_sales_quantity int,
part2_sales_quantity int,
part3_sales_quantity int,
part4_sales_quantity int,
part5_sales_quantity int
);
insert into #privot
(
Id, CompanyName,
part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity
)
values
(1, 'FastCarsCo', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
(2, 'TastyCakeShop', 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
(3, 'KidsToys', 11, 21, 31, 41, 51, 61, 71, 81, 91, 101),
(4, 'FruitStall', 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000);
select
Id, CompanyName, replace(acc, '_amount', '') as acc, amount,
quantity=choose(/*try_cast ??*/replace(left(acc, charindex('_', acc)-1), 'part', ''), /*quantity columns*/part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity)
FROM
(
SELECT *
--Id, CompanyName,
--part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
--part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity , part4_sales_quantity, part5_sales_quantity
FROM #privot
) src
UNPIVOT
(
amount FOR acc IN (/*amount columns*/part1_sales_amount ,part2_sales_amount ,part3_sales_amount, part4_sales_amount, part5_sales_amount )
) pvt1;

Vertica: repeat category from previous period if it not listed in current

I'm trying to make some sort of running total in table with gaps. I have a period, a category and a value, and I want to list all categories used in current and previous periods for given storage_id even if there is no value of that category in current period.
My data:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|b |bar |
4| 2|c |foo |
My goal:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|a |NULL |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|a |NULL |
2| 2|b |bar |
4| 2|a |NULL |
4| 2|b |NULL |
4| 2|c |foo |
I managed to make it using temporary table and 2 self-joins. Is there more efficient way to do that, e.g., using window functions?
Reproducible example:
CREATE LOCAL TEMPORARY TABLE tt (
storage_id int
, category varchar(255)
, value varchar(255)
, period int
) ON COMMIT PRESERVE ROWS;
INSERT INTO tt
SELECT 1, 'a', 'foo', 1 UNION ALL
SELECT 1, 'b', 'bar', 2 UNION ALL
SELECT 1, 'a', 'bar', 3 UNION ALL
SELECT 1, 'b', 'foo', 3 UNION ALL
SELECT 2, 'a', 'foo', 1 UNION ALL
SELECT 2, 'b', 'bar', 2 UNION ALL
SELECT 2, 'c', 'foo', 4
;
My imperfect solution:
WITH
cat as (
SELECT
t1.category
, t1.storage_id
, t2.period
FROM
tt as t1 join tt as t2
on t1.storage_id = t2.storage_id
and t1.period <= t2.period
GROUP BY
t1.category
, t1.storage_id
, t2.period
)
SELECT
cat.period
, cat.storage_id
, cat.category
, tt.value
FROM cat
LEFT JOIN tt
ON tt.category = cat.category
and tt.storage_id = cat.storage_id
and tt.period = cat.period
ORDER BY
storage_id, period;
11 rows, 178 ms
I want to list all categories used in current and previous periods even if there is no value of that category in current period.
I don't see how your result set illustrates this, because you have not carried all results to the end.
For the problem you describe, the following should do what you want:
select p.period, sc.storage_id, sc.category, tt.value
from (select distinct period from tt) p join
(select storage_id, category, min(period) as first_period
from tt
group by 1, 2
) sc
on p.period >= sc.first_period left join
tt
on tt.period = p.period and
tt.storage_id = sc.storage_id and
tt.category = sc.category
order by p.period, sc.storage_id, sc.category;
Here is a db<>fiddle.
I can't figure out the actual logic that produces the result set you want.

SQL - Pivot or Unpivot?

Another time, another problem. I have the following table:
|assemb.|Repl_1|Repl_2|Repl_3|Repl_4|Repl_5|Amount_1|Amount_2|Amount_3|Amount_4|Amount_5|
|---------------------------------------------------------------------------------------|
|4711001|111000|222000|333000|444000|555000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
|4711002|222000|333000|444000|555000|666000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
And here what I need:
|Article|Amount|
|--------------|
| 111000| 1|
|--------------|
| 222000| 2|
|--------------|
| 333000| 2|
|--------------|
| 444000| 2|
|--------------|
| 555000| 2|
|--------------|
| 666000| 1|
|---------------
Repl_1 to Repl_10 are replacement-articles of the assembly. I can have n assemblies with to 10 rep-articles. At the end I need to overview all articles with there amounts of all assemblies.
THX.
Best greetz
Vegeta
This is probably the quickest way of achieving it using UNION ALL. However, I'd recommend normalising your table
SELECT Article, SUM(Amount) FROM (
SELECT Repl_1 AS Article, SUM(Amount_1) AS Amount FROM #Test GROUP BY Repl_1
UNION ALL
SELECT Repl_2 AS Article, SUM(Amount_2) AS Amount FROM #Test GROUP BY Repl_2
UNION ALL
SELECT Repl_3 AS Article, SUM(Amount_3) AS Amount FROM #Test GROUP BY Repl_3
UNION ALL
SELECT Repl_4 AS Article, SUM(Amount_4) AS Amount FROM #Test GROUP BY Repl_4
UNION ALL
SELECT Repl_5 AS Article, SUM(Amount_5) AS Amount FROM #Test GROUP BY Repl_5
) tbl GROUP BY Article