Merge rows based on a condition - apache-spark-sql

Is it possible to merge a collection of rows based on a condition in Spark SQL using a sql query ?
If the difference between purch_dt of two consecutive rows placed in order (line_num) is less than 5 days, then combine them into 1 row and output that merged row and the merged row should have the max value of purch_dt for that group. I tried using the LEAD function but I can't get it to reset after each false condition is encountered and consider the following rows as a new group. I am not being able to get the max of purch_dt for each such group.
Input:
orderid | line_num | purch_dt
1 | 1 | 10-02-2020
1 | 2 | 12-02-2020
1 | 3 | 14-02-2020
1 | 4 | 21-03-2020
1 | 5 | 23-03-2020
Output:
orderid | purch_dt
1 | 14-02-2020 -- 1 - 3 combined into 1 row because difference is <5 between each
1 | 23-02-2020 -- 4 - 5 combined into 1 row because difference is <5 between each
Total Output rows = 2 because we have 2 groups.
Please note that line_num 4 is used as a set break since its difference between line_num = 3 is greater than 5. Hence it should have its own merged record set.
I have the sql below so far, but I can't get to break out and create the groups.
create temporary view next_dt as
select
order,
LEAD(purch_dt) over (partition by orderid order by line_num asc) AS next_purch_dt,
purch_dt
from orders;
select *
from (
select
order,
CASE WHEN datediff(next_purch_dt, purch_dt) < 5 OR next_purch IS NULL THEN 'Y'
ELSE 'N'
END AS flg
from
next_dt)
WHERE flg = 'Y';
Any help is appreciated.
UPDATE:
Slight change in the requirements:-
The comparison has now to be made between two different fields in consecutive records - purch_dt of the current record and the return_dt of the next record.
Also, when a merged record group is being output, it should have the purch_dt populated with the value of the record with the least line_num in that group. And the return_dt column populated with the value of the max line_num record of that same group.
Input:
orderid | line_num | purch_dt | return_dt
1 | 1 | 10-02-2020 | 10-02-2020
1 | 2 | 12-02-2020 | 13-02-2020
1 | 3 | 14-02-2020 | 14-02-2020
1 | 4 | 21-03-2020 | 23-02-2020
1 | 5 | 23-03-2020 | 24-02-2020
Output:
orderid | purch_dt | return_dt
1 | 10-02-2020 | 14-02-2020
1 | 21-03-2020 | 24-02-2020
Total Output rows = 2 because we have 2 groups.
Note that each output record contains the purch_dt of the record with min line_num in that group. And contains return_dt populated as per the record with max line_num in that group.

You almost got this, below query has worked for me,
sql("""create temporary view next_dt_orders as
select *
from (
select
orderid,line_num,purch_dt,
case when datediff(
(lead(purch_dt) over (partition by orderid order by line_num asc)),
purch_dt) < 5
then "N"
else "Y"
end as flag
from
orders) tab
where
flag='Y'""")
sql("select * from next_dt_orders").show()
+-------+--------+----------+----+
|orderid|line_num| purch_dt|flag|
+-------+--------+----------+----+
| 1| 3|2020-02-14| Y|
| 1| 5|2020-03-23| Y|
+-------+--------+----------+----+

Related

How to pivot column data into a row where a maximum qty total cannot be exceeded?

Introduction:
I have come across an unexpected challenge. I'm hoping someone can help and I am interested in the best method to go about manipulating the data in accordance to this problem.
Scenario:
I need to combine column data associated to two different ID columns. Each row that I have associates an item_id and the quantity for this item_id. Please see below for an example.
+-------+-------+-------+---+
|cust_id|pack_id|item_id|qty|
+-------+-------+-------+---+
| 1 | A | 1 | 1 |
| 1 | A | 2 | 1 |
| 1 | A | 3 | 4 |
| 1 | A | 4 | 0 |
| 1 | A | 5 | 0 |
+-------+-------+-------+---+
I need to manipulate the data shown above so that 24 rows (for 24 item_ids) is combined into a single row. In the example above I have chosen 5 items to make things easier. The selection format I wish to get, assuming 5 item_ids, can be seen below.
+---------+---------+---+---+---+---+---+
| cust_id | pack_id | 1 | 2 | 3 | 4 | 5 |
+---------+---------+---+---+---+---+---+
| 1 | A | 1 | 1 | 4 | 0 | 0 |
+---------+---------+---+---+---+---+---+
However, here's the condition that is making this troublesome. The maximum total quantity for each row must not exceed 5. If the total quantity exceeds 5 a new row associated to the cust_id and pack_id must be created for the rest of the item_id quantities. Please see below for the desired output.
+---------+---------+---+---+---+---+---+
| cust_id | pack_id | 1 | 2 | 3 | 4 | 5 |
+---------+---------+---+---+---+---+---+
| 1 | A | 1 | 1 | 3 | 0 | 0 |
| 1 | A | 0 | 0 | 1 | 0 | 0 |
+---------+---------+---+---+---+---+---+
Notice how the quantities of item_ids 1, 2 and 3 summed together equal 6. This exceeds the maximum total quantity of 5 for each row. For the second row the difference is created. In this case only item_id 3 has a single quantity remaining.
Note, if a 2nd row needs to be created that total quantity displayed in that row also cannot exceed 5. There is a known item_id limit of 24. But, there is no known limit of the quantity associated for each item_id.
Here's an approach which goes from left-field a bit.
One approach would have been to do a recursive CTE, building the rows one-by-one.
Instead, I've taken an approach where I
Create a new (virtual) table with 1 row per item (so if there are 6 items, there will be 6 rows)
Group those items into groups of 5 (I've called these rn_batches)
Pivot those (based on counts per item per rn_batch)
For these, processing is relatively simple
Creating one row per item is done using INNER JOIN to a numbers table with n <= the relevant quantity.
The grouping then just assigns rn_batch = 1 for the first 5 items, rn_batch = 2 for the next 5 items, etc - until there are no more items left for that order (based on cust_id/pack_id).
Here is the code
/* Data setup */
CREATE TABLE #Order (cust_id int, pack_id varchar(1), item_id int, qty int, PRIMARY KEY (cust_id, pack_id, item_id))
INSERT INTO #Order (cust_id, pack_id, item_id, qty) VALUES
(1, 'A', 1, 1),
(1, 'A', 2, 1),
(1, 'A', 3, 4),
(1, 'A', 4, 0),
(1, 'A', 5, 0);
/* Pivot results */
WITH Nums(n) AS
(SELECT (c * 100) + (b * 10) + (a) + 1 AS n
FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) A(a)
CROSS JOIN (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) B(b)
CROSS JOIN (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) C(c)
),
ItemBatches AS
(SELECT cust_id, pack_id, item_id,
FLOOR((ROW_NUMBER() OVER (PARTITION BY cust_id, pack_id ORDER BY item_id, N.n)-1) / 5) + 1 AS rn_batch
FROM #Order O
INNER JOIN Nums N ON N.n <= O.qty
)
SELECT *
FROM (SELECT cust_id, pack_id, rn_batch, 'Item_' + LTRIM(STR(item_id)) AS item_desc
FROM ItemBatches
) src
PIVOT
(COUNT(item_desc) FOR item_desc IN ([Item_1], [Item_2], [Item_3], [Item_4], [Item_5])) pvt
ORDER BY cust_id, pack_id, rn_batch;
And here are results
cust_id pack_id rn_batch Item_1 Item_2 Item_3 Item_4 Item_5
1 A 1 1 1 3 0 0
1 A 2 0 0 1 0 0
Here's a db<>fiddle with
additional data in the #Orders table
the answer above, and also the processing with each step separated.
Notes
This approach (with the virtual numbers table) assumes a maximum of 1,000 for a given item in an order. If you need more, you can easily extend that numbers table by adding additional CROSS JOINs.
While I am in awe of the coders who made SQL Server and how it determines execution plans in millisends, for larger datasets I give SQL Server 0 chance to accurately predict how many rows will be in each step. As such, for performance, it may work better to split the code up into parts (including temp tables) similar to the db<>fiddle example.

Compare dates between rows from the same input file based on ID and replicate rows by increment date using SQL Server

I am trying to duplicate rows by comparing the date of the current row with date of the next row for a user ID and row should be duplicated by incrementing the date where < date of the next row.
To explain in detail
input:-
Compare Start_DateMonth of first row with second row and replicate the row by incrementing the Start_DateMonth till the Start_DateMonth of the second row of the input
Expected Output:-
Please suggest if this logic can be implemented using SQL Server.
One way to do it is to use a recursive query:
with cte (user_id, start_datemonth, start_dateday, lead_start_datemonth) as (
select
user_id,
start_datemonth,
start_dateday,
lead(start_datemonth) over(partition by user_id order by start_datemonth) lead_start_datemonth
from mytable
union all
select
user_id,
start_datemonth + 1,
start_dateday,
lead_start_datemonth
from cte
where start_datemonth + 1 < lead_start_datemonth
)
select user_id, start_datemonth, start_dateday from cte
Demo on DB Fiddle:
user_id | start_datemonth | start_dateday
-------: | --------------: | ------------:
11110002 | 210601 | 1
11110002 | 210602 | 1
11110002 | 210603 | 1
11110002 | 210604 | 2
11110002 | 210605 | 2
11110002 | 210606 | 2
11110002 | 210607 | 4

Find records which have multiple occurrences in another table array (postgres)

I have a table which has records in array. Also there is another table which have single string records. I want to get records which have multiple occurrences in another table. Following are tables;
Vehicle
veh_id | vehicle_types
-------+---------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
vehicle_names
id | vehicle_name
-----+-----------------------
1 | byd_tang
2 | volt
3 | viper
4 | laferrari
5 | sonata
6 | jaguarxf
7 | swift
8 | teslax
9 | mirai
10 | ferrariff
11 | bmwi8
I have a query which can give output what I expect but its not optimal and may be its expensive query.
This is the query:
select veh_name
from vehicle_names dsb
where (select count(*) from vehicle dsd
where dsb.veh_name = ANY (dsd.veh_types)) > 1
The output should be:
byd_tang
volt
viper
One option would be an aggregation query:
SELECT
vn.id,
vn.veh_name
FROM vehicle_names vn
INNER JOIN vehicle v
ON vn. veh_name = ANY (v.veh_types)
GROUP BY
vn.id,
vn.veh_name
HAVING
COUNT(*) > 1;
This only counts a vehicle name which appears in two or more records in the other table. It would not pick up, for example, a single vehicle record with the same name appearing two or more times.

In MS Access, how do I update a table record to its current value plus the count of records in a different table?

I have two tables.
**tblMonthlyData**
ReportMonth | TotalItems | TotalVariances
Jan | 5 | 0
Feb | 1 | 1
Mar | 2 | 0
Apr | 8 | 4
May | 4 | 0
Jun | 5 | 0
Jul | 3 | 0
Aug | 5 | 0
Sep | 9 | 3
Oct | 1 | 0
Nov | 7 | 0
Dec | 6 | 0
and
**tblDailyData**
ID | ItemNum | CountedQty | SystemQty | Variance
1 | Item1 | 4 | 4 | 0
2 | Item2 | 8 | 5 | -3
3 | Item3 | 1 | 2 | 1
4 | Item4 | 6 | 4 | -2
For the sake of clarity, we'll say the above tblDailyData is from a count done today, 01/27/2017. Variance is a calculated field based on the data in both quantity fields.
I'm trying to add the count of records in tblDailyData to TotalItems in tblMonthlyData based on the date of the count (i.e. counts are done daily and each counts data needs to be added to the appropriate month in tblMonthlyData). So for the above example I'd need to add 4 (number of records) to TotalItems in tblMonthlyData for the Jan record, resulting in the updated record being 9, and add 3 (number of variances) to TotalVariances, resulting in the updated record being 3.
So far, I've tried using a Make Table Query for both total items counted and total number of variances, then using an Update Query that looks like this:
UPDATE tblMonthlyData
SET TotalItems = TotalItems + tblTempTotalItems.CountOfItems,
TotalVariances = TotalVariances + tblTempTotalVariances.CountOfVariances
WHERE Format$([ReportMonth],"mmm")=Format$(Now(),"mmm");
I've also tried a similar method using select queries to count records and variances (without creating the temporary tables) and running the update query based on those. Both methods result in Access prompting for the CountOfItems and CountOfVariances parameters when the update query is ran instead of just taking the values from the specified temporary table or select query.
This seemed like it'd be such a simple operation (query the count of records and variances, add them to the appropriate monthly record in separate table), but it turns out I can't figure out how to make it work. Thanks for any help!
This does not seem to be a situation for a table, but rather for some views/queries, which will always be up to date.
Use a GROUP BY FORMAT([date_field],"mm/dd/yyyy") clause in your query for daily item count (if you want to add that to a montlhy count, we will do that in ANOTHER query.
SELECT FORMAT([date_field],"mm/dd/yyyy") AS Date, COUNT(ID) AS TotalItems
FROM tblDailyData
GROUP BY Date
Call this query dailyTotalItems.
SELECT FORMAT([date_field],"mm/dd/yyyy") AS Date, COUNT(ID) AS TotalItemsWithVariance, SUM(
FROM tblDailyData
WHERE NOT (Variance = 0)
GROUP BY Date
Call this query dailyTotalItemsWithVariance.
SELECT MONTH([date_field]) As MonthDate, SUM(TotalItems) As TotalMonthlyItems
FROM dailyTotalItems
GROUP BY MonthDate
Call this query monthlyTotalItems.
SELECT MONTH([date_field]) As MonthDate, SUM(TotalItemsWithVariance) As TotalMonthlyItemsWithVariance
FROM dailyTotalItemsWithVariance
GROUP BY MonthDate
Call this query monthlyTotalItemsWithVariance.
Then LEFT JOIN both on MonthDate.
SELECT * FROM monthlyTotalItems
LEFT JOIN monthlyTotalItemsWithVariance ON monthlyTotalItems.MonthDate = monthlyTotalItemsWithVariance.MonthDate
NOTE: TotalItems will always be >= TotalItemsWithVariance AND every date with a variance must have had a count. So get ALL dates in monthlyTotalItems and left join to match the monthlyTotalItemsWithVariance items (which must be included, as shown above)

SQL SELECT only rows where a max value is present, and the corresponding ID from another linked table

I have a simple Parts database which I'd like to use for calculating costs of assemblies, and I need to keep a cost history, so that I can update the costs for parts without the update affecting historic data.
So far I have the info stored in 2 tables:
tblPart:
PartID | PartName
1 | Foo
2 | Bar
3 | Foobar
tblPartCostHistory
PartCostHistoryID | PartID | Revision | Cost
1 | 1 | 1 | £1.00
2 | 1 | 2 | £1.20
3 | 2 | 1 | £3.00
4 | 3 | 1 | £2.20
5 | 3 | 2 | £2.05
What I want to end up with is just the PartID for each part, and the PartCostHistoryID where the revision number is highest, so this:
PartID | PartCostHistoryID
1 | 2
2 | 3
3 | 5
I've had a look at some of the other threads on here and I can't quite get it. I can manage to get the PartID along with the highest Revision number, but if I try to then do anything with the PartCostHistoryID I end up with multiple PartCostHistoryIDs per part.
I'm using MS Access 2007.
Many thanks.
Mihai's (very concise) answer will work assuming that the order of both
[PartCostHistoryID] and
[Revision] for each [PartID]
are always ascending.
A solution that does not rely on that assumption would be
SELECT
tblPartCostHistory.PartID,
tblPartCostHistory.PartCostHistoryID
FROM
tblPartCostHistory
INNER JOIN
(
SELECT
PartID,
MAX(Revision) AS MaxOfRevision
FROM tblPartCostHistory
GROUP BY PartID
) AS max
ON max.PartID = tblPartCostHistory.PartID
AND max.MaxOfRevision = tblPartCostHistory.Revision
SELECT PartID,MAX(PartCostHistoryID) FROM table GROUP BY PartID
Here is query
select PartCostHistoryId, PartId from tblCost
where PartCostHistoryId in
(select PartCostHistoryId from
(select * from tblCost as tbl order by Revision desc) as tbl1
group by PartId
)
Here is SQL Fiddle http://sqlfiddle.com/#!2/19c2d/12