SQL/Hive MERGE INTO -- Why won't this work?

SQL/Hive MERGE INTO -- Why won't this work? - sql

I am trying to follow an example posed by databricks but I am unable to understand why the smallest example won't work as I am expecting it to.
The ultimate goal of this process is for an idempotent merge; run it many times and the operation will only proceed once, the first time.
Some data
-- 1 row is new, 1 row is what we have already
create table #baseset
(
Date varchar(30),
ID varchar(30),
State varchar(30),
Count varchar(30)
)
insert into #baseset values('2/7/2023', 'A', 'A', null)
insert into #baseset values('2/6/2023', 'A', 'A', null)
create table #changeset
(
Date varchar(30),
ID varchar(30),
State varchar(30),
Count varchar(30)
)
insert into #changeset values('2/8/2023', 'A', 'A', null)
insert into #changeset values('2/7/2023', 'A', 'A', null)
My commands and results
I am expecting this to just return two rows:
SELECT
-- UPDATE
cs.ID as MERGEKEY,
cs.*
FROM
changeset cs
UNION ALL
SELECT
-- INSERT
NULL as MERGEKEY,
cs.*
FROM
changeset cs
JOIN baseset c ON c.ID = cs.ID
WHERE
NOT (
c.Date = cs.Date
AND c.State = cs.State
AND c.Count = cs.Count
)
+--------+--------+---+-----+-----+
|MERGEKEY| Date| ID|State|Count|
+--------+--------+---+-----+-----+
| A|2/8/2023| A| A| null|
| null|2/8/2023| A| A| null|
+--------+--------+---+-----+-----+
but instead I am returning:
+--------+--------+---+-----+-----+
|MERGEKEY| Date| ID|State|Count|
+--------+--------+---+-----+-----+
| A|2/8/2023| A| A| null|
| A|2/7/2023| A| A| null|
| null|2/7/2023| A| A| null|
| null|2/8/2023| A| A| null|
| null|2/8/2023| A| A| null|
+--------+--------+---+-----+-----+

Related

How to use window function in Redshift?

I have 2 tables:
| Product |
|:----: |
| product_id |
| source_id|
Source
source_id
priority
sometimes there are cases when 1 product_id can contain few sources and my task is to select data with min priority from for example
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 2| 9|
| 10| 4| 2|
| 20| 2| 9|
| 20| 4| 2|
| 30| 2| 9|
| 30| 4| 2|
correct result should be like:
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 4| 2|
| 20| 4| 2|
| 30| 4| 2|
I am using query:
SELECT p.product_id, p.source_id, s.priority FROM Product p
INNER JOIN Source s on s.source_id = p.source_id
WHERE s.priority = (SELECT Min(s1.priority) OVER (PARTITION BY p.product_id) FROM Source s1)
but it returns error "this type of correlated subquery pattern is not supported yet" so as i understand i can't use such variant in Redshift, how should it be solved, are there any other ways?

You just need to unroll the where clause into the second data source and the easiest flag for min priority is to use the ROW_NUMBER() window function. You're asking Redshift to rerun the window function for each JOIN ON test which creates a lot of inefficiencies in clustered database. Try the following (untested):
SELECT p.product_id, p.source_id, s.priority
FROM Product p
INNER JOIN (
SELECT ROW_NUMBER() OVER (PARTITION BY p.product_id, order by s1.priority) as row_num,
source_id,
priority
FROM Source) s
on s.source_id = p.source_id
WHERE row_num = 1
Now the window function only runs once. You can also move the subquery to a CTE if that improve readability for your full case.

Already found best solution for that case:
SELECT
p.product_id
, p.source_id
, s.priority
, Min(s.priority) OVER (PARTITION BY p.product_id) as min_priority
FROM Product p
INNER JOIN Source s
ON s.source_id = p.source_id
WHERE s.priority = p.min_priority

Sum multiple column with PARTITION from single table

I have a question, it seems simple but I can't figure it out.
I have a sample table like this:
Overtime Table (OT)
+----------+------------+----------+-------------+
|EmployeeId|OvertimeDate|HourMargin|OvertimePoint|
+----------+------------+----------+-------------+
| 1| 2020-07-01| 05:00| 15|
| 1| 2020-07-02| 03:00| 9|
| 2| 2020-07-01| 01:00| 3|
| 2| 2020-07-03| 03:00| 9|
| 3| 2020-07-06| 03:00| 9|
| 3| 2020-07-07| 01:00| 3|
+----------+------------+----------+-------------+
OLC Table (OLC)
+----------+------------+-----+------+
|EmployeeId| OLCDate | OLC | Trip |
+----------+------------+-----+------+
| 1| 2020-07-01| 2| 0|
| 3| 2020-07-13| 3| 6|
+----------+------------+-----+------+
So, based on that tables, I want to calculate total OT.HourMargin, OT.OTPoint, OLC.OLC, and OLC.Trip with the final result like this:
Result
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| 0| 0|
| 3| 04:00| 24| 3| 6|
+----------+-----------+----------+--------+----------+
Here is the query that I try to achieve the result:
DECLARE #Overtime TABLE (
EmployeeId INT,
OvertimeDate DATE,
HourMargin TIME,
OvertimePoint INT
)
DECLARE #OLC TABLE (
EmployeeId INT,
OLCDate DATE,
OLC INT,
Trip INT
)
INSERT INTO #Overtime VALUES (1, '2020-07-01', '05:00:00', 15)
INSERT INTO #Overtime VALUES (1, '2020-07-02', '03:00:00', 9)
INSERT INTO #Overtime VALUES (2, '2020-07-01', '01:00:00', 3)
INSERT INTO #Overtime VALUES (2, '2020-07-03', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-06', '03:00:00', 9)
INSERT INTO #Overtime VALUES (3, '2020-07-07', '01:00:00', 3)
INSERT INTO #OLC VALUES (1, '2020-07-01', 2, 0)
INSERT INTO #OLC VALUES (3, '2020-07-13', 3, 6)
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)) OVER (PARTITION BY OT.EmployeeId)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) OVER (PARTITION BY OT.EmployeeId) AS TotalPoint,
SUM(OLC.OLC) OVER (PARTITION BY OLC.EmployeeId) AS TotalOLC,
SUM(OLC.Trip) OVER (PARTITION BY OLC.EmployeeId) AS TotalTrip
FROM
#Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
ORDER BY
EmployeeId
Here is the result from my query:
+----------+-----------+----------+--------+----------+
|EmployeeId|TotalMargin|TotalPoint|TotalOLC|TotalPoint|
+----------+-----------+----------+--------+----------+
| 1| 08:00| 24| NULL| NULL|
| 1| 08:00| 24| 2| 0|
| 2| 04:00| 12| NULL| NULL|
| 2| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
| 3| 04:00| 12| NULL| NULL|
+----------+-----------+----------+--------+----------+
It seems when I try to SUM multiple columns from single table, it will create multiple rows in the final result. Right now, what came across to my mind is using CTE, separate the multiple column into multiple CTE's and querying from all CTE's. Or even try to create temp table/table variable, query the sum's from each column and store/update it.
So, any idea how to achieve my result without using multiple CTE's or temp tables?
Thank You

You want to group together rows that belong to the same EmployeeID, so this implies aggregation rather than window functions:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin)), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
COALESCE(SUM(OLC.OLC), 0) AS TotalOLC,
COALESCE(SUM(OLC.Trip), 0) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
GROUP BY OT.EmployeeId
I also don't see the point for the join condition on the dates, so I removed it. Finally, you can use coalesce() to return 0 for rows that have no OLC.
Demo on DB Fiddle:
EmployeeId | TotalMargin | TotalPoint | TotalOLC | TotalTrip
---------: | :---------- | ---------: | -------: | --------:
1 | 08:00:00 | 24 | 4 | 0
2 | 04:00:00 | 12 | 0 | 0
3 | 04:00:00 | 12 | 6 | 12

You've decided to use SUM OVER but you're experiencing the "problem" of multiple rows... that's what a sum over does; you can conceive that doing an OVER(PARTITION..) does a group by that is auto joined back to the driving table so you end up with all the rows from the driving table together with repeated results of the summation
Here is a simple data set:
ProductID, Price
1, 100
1, 200
2, 300
2, 400
Here are some queries and results:
--perform a basic group and sum
SELECT ProductID, SUM(Price) S FROM x GROUP BY ProductID
1, 300
2, 700
--perform basic group/sum and join it back to the main table
SELECT ProductID, Price, S
FROM
x
INNER JOIN
(SELECT ProductID, SUM(Price) s FROM x GROUP BY ProductID) y
ON x.ProductID = y.ProductID
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
--perform a sum over, the partition here being the same as the earlier group
SELECT ProductID, Price, SUM(Price) OVER(PARTITION BY ProductID) FROM x
1, 100, 300
1, 200, 300
2, 300, 700
2, 400, 700
You can see the latter two produce the same result, extra rows with the total appended. It may help you understand simple window functions if you conceive that this is what he db does internally - it takes the "partition by", does a subquery group by with it, and joins the results back on whatever columns were in the partition
It looks like what you really want is a simple group:
SELECT
OT.EmployeeId,
CONVERT(TIME, DATEADD(MS, (SUM(DATEDIFF(MS, '00:00:00.000', OT.HourMargin))), '00:00:00.000')) AS TotalMargin,
SUM(OT.OvertimePoint) AS TotalPoint,
SUM(OLC.OLC) AS TotalOLC,
SUM(OLC.Trip) AS TotalTrip
FROM #Overtime OT
LEFT JOIN #OLC OLC ON OLC.EmployeeId = OT.EmployeeId
AND OLC.OLCDate = OT.OvertimeDate
GROUP BY OT.EmployeeID

SQL Server create Unpivot table with where condition

Here is my sample table
---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Id| CompanyName|part1_sales_amount|part2_sales_amount|part3_sales_amount|part1_sales_quantity|part2_sales_quantity|part3_sales_quantity|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 1| FastCarsCo| 1| 2| 3| 4| 5| 6|
| 2|TastyCakeShop| 4| 5| 6| 4| 5| 6|
| 3| KidsToys| 7| 8| 9| 7| 8| 9|
| 4| FruitStall| 10| 11| 12| 10| 11| 12|
+---+-------------+------------------+------------------+------------------+--------------------+--------------------+--------------------+
Here is output table that i want
+---+-------------+------------------+------------------+------------------+
| Id| CompanyName|Account |amount |quantity |
+---+-------------+------------------+------------------+------------------+
| 1| FastCarsCo| part1_sales| 1| 1|
| 1| FastCarsCo| part2_sales| 2| 2|
| 1| FastCarsCo| part3_sales| 3| 3|
| 2|TastyCakeShop| part1_sales| 4| 4|
| 2|TastyCakeShop| part2_sales| 5| 5|
| 2|TastyCakeShop| part3_sales| 6| 6|
| 3| KidsToys| part1_sales| 7| 7|
| 3| KidsToys| part2_sales| 8| 8|
| 3| KidsToys| part3_sales| 9| 9|
| 4| FruitStall| part1_sales| 10| 10|
| 4| FruitStall| part2_sales| 11| 11|
| 4| FruitStall| part3_sales| 12| 12|
+---+-------------+------------------+------------------+------------------+
Things I already did
SELECT
Id,
CompanyName,
REPLACE ( acc , '_amount' , '' ) AS Account,
amount,
quantity
FROM
(
SELECT Id, CompanyName, part1_sales_amount ,part2_sales_amount ,part3_sales_amount ,part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity
FROM privot
) src
UNPIVOT
(
amount FOR acc IN (part1_sales_amount ,part2_sales_amount ,part3_sales_amount )
) pvt1
UNPIVOT
(
quantity FOR acc1 IN (part1_sales_quantity, part2_sales_quantity, part3_sales_quantity )
) pvt2
It gave some result but it seems like there is some unexpected record also(Like cross join). so my final step the WHERE clause, what should I write in WHERE clause.I tried many thing but non is a correct one.
Note: In my real data base here are almost 200 column like those part1_sales_amount and part1_sales_quantity
Please any help appreciate.

You can use apply :
select t.id, t.companyname, tt.amount, tt.qty
from table t cross apply
( values (t.part1_sales_amount, t.part1_sales_quantity),
(t.part2_sales_amount, t.part2_sales_quantity),
(t.part3_sales_amount, t.part3_sales_quantity),
. . .
) tt(amount, qty);

SELECT Id, CompanyName, account, amount, quantity
FROM MyTable
CROSS APPLY (
SELECT account = 'part1_sales_amount', amount = part1_sales_amount, quantity = part1_sales_quantity
UNION ALL
SELECT account = 'part2_sales_amount', amount = part2_sales_amount, quantity = part2_sales_quantity
UNION ALL
SELECT account = 'part3_sales_amount', amount = part3_sales_amount, quantity = part3_sales_quantity
) AS AnotherData

Single unpivot and choose the corresponding quantity column:
declare #privot table
(
id int,
CompanyName varchar(20),
part1_sales_amount money,
part2_sales_amount money,
part3_sales_amount money,
part4_sales_amount money,
part5_sales_amount money,
part1_sales_quantity int,
part2_sales_quantity int,
part3_sales_quantity int,
part4_sales_quantity int,
part5_sales_quantity int
);
insert into #privot
(
Id, CompanyName,
part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity
)
values
(1, 'FastCarsCo', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
(2, 'TastyCakeShop', 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
(3, 'KidsToys', 11, 21, 31, 41, 51, 61, 71, 81, 91, 101),
(4, 'FruitStall', 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000);
select
Id, CompanyName, replace(acc, '_amount', '') as acc, amount,
quantity=choose(/*try_cast ??*/replace(left(acc, charindex('_', acc)-1), 'part', ''), /*quantity columns*/part1_sales_quantity, part2_sales_quantity, part3_sales_quantity, part4_sales_quantity, part5_sales_quantity)
FROM
(
SELECT *
--Id, CompanyName,
--part1_sales_amount, part2_sales_amount, part3_sales_amount, part4_sales_amount, part5_sales_amount,
--part1_sales_quantity ,part2_sales_quantity ,part3_sales_quantity , part4_sales_quantity, part5_sales_quantity
FROM #privot
) src
UNPIVOT
(
amount FOR acc IN (/*amount columns*/part1_sales_amount ,part2_sales_amount ,part3_sales_amount, part4_sales_amount, part5_sales_amount )
) pvt1;

Vertica: repeat category from previous period if it not listed in current

I'm trying to make some sort of running total in table with gaps. I have a period, a category and a value, and I want to list all categories used in current and previous periods for given storage_id even if there is no value of that category in current period.
My data:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|b |bar |
4| 2|c |foo |
My goal:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|a |NULL |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|a |NULL |
2| 2|b |bar |
4| 2|a |NULL |
4| 2|b |NULL |
4| 2|c |foo |
I managed to make it using temporary table and 2 self-joins. Is there more efficient way to do that, e.g., using window functions?
Reproducible example:
CREATE LOCAL TEMPORARY TABLE tt (
storage_id int
, category varchar(255)
, value varchar(255)
, period int
) ON COMMIT PRESERVE ROWS;
INSERT INTO tt
SELECT 1, 'a', 'foo', 1 UNION ALL
SELECT 1, 'b', 'bar', 2 UNION ALL
SELECT 1, 'a', 'bar', 3 UNION ALL
SELECT 1, 'b', 'foo', 3 UNION ALL
SELECT 2, 'a', 'foo', 1 UNION ALL
SELECT 2, 'b', 'bar', 2 UNION ALL
SELECT 2, 'c', 'foo', 4
;
My imperfect solution:
WITH
cat as (
SELECT
t1.category
, t1.storage_id
, t2.period
FROM
tt as t1 join tt as t2
on t1.storage_id = t2.storage_id
and t1.period <= t2.period
GROUP BY
t1.category
, t1.storage_id
, t2.period
)
SELECT
cat.period
, cat.storage_id
, cat.category
, tt.value
FROM cat
LEFT JOIN tt
ON tt.category = cat.category
and tt.storage_id = cat.storage_id
and tt.period = cat.period
ORDER BY
storage_id, period;
11 rows, 178 ms

I want to list all categories used in current and previous periods even if there is no value of that category in current period.
I don't see how your result set illustrates this, because you have not carried all results to the end.
For the problem you describe, the following should do what you want:
select p.period, sc.storage_id, sc.category, tt.value
from (select distinct period from tt) p join
(select storage_id, category, min(period) as first_period
from tt
group by 1, 2
) sc
on p.period >= sc.first_period left join
tt
on tt.period = p.period and
tt.storage_id = sc.storage_id and
tt.category = sc.category
order by p.period, sc.storage_id, sc.category;
Here is a db<>fiddle.
I can't figure out the actual logic that produces the result set you want.

SQL - Pivot or Unpivot?

Another time, another problem. I have the following table:
|assemb.|Repl_1|Repl_2|Repl_3|Repl_4|Repl_5|Amount_1|Amount_2|Amount_3|Amount_4|Amount_5|
|---------------------------------------------------------------------------------------|
|4711001|111000|222000|333000|444000|555000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
|4711002|222000|333000|444000|555000|666000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
And here what I need:
|Article|Amount|
|--------------|
| 111000| 1|
|--------------|
| 222000| 2|
|--------------|
| 333000| 2|
|--------------|
| 444000| 2|
|--------------|
| 555000| 2|
|--------------|
| 666000| 1|
|---------------
Repl_1 to Repl_10 are replacement-articles of the assembly. I can have n assemblies with to 10 rep-articles. At the end I need to overview all articles with there amounts of all assemblies.
THX.
Best greetz
Vegeta

This is probably the quickest way of achieving it using UNION ALL. However, I'd recommend normalising your table
SELECT Article, SUM(Amount) FROM (
SELECT Repl_1 AS Article, SUM(Amount_1) AS Amount FROM #Test GROUP BY Repl_1
UNION ALL
SELECT Repl_2 AS Article, SUM(Amount_2) AS Amount FROM #Test GROUP BY Repl_2
UNION ALL
SELECT Repl_3 AS Article, SUM(Amount_3) AS Amount FROM #Test GROUP BY Repl_3
UNION ALL
SELECT Repl_4 AS Article, SUM(Amount_4) AS Amount FROM #Test GROUP BY Repl_4
UNION ALL
SELECT Repl_5 AS Article, SUM(Amount_5) AS Amount FROM #Test GROUP BY Repl_5
) tbl GROUP BY Article

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL/Hive MERGE INTO -- Why won't this work? - sql

Related

How to use window function in Redshift?

Sum multiple column with PARTITION from single table

SQL Server create Unpivot table with where condition

Vertica: repeat category from previous period if it not listed in current

SQL - Pivot or Unpivot?

Categories

Resources