Update column value to be minimum of all column values before it - sql

I have a table, which when sorted according to the week number gives the units left of a product at a store. The units left should always be decreasing. However, there are some garbage values due to which the units left in a store increases for few weeks and then decreases again. I just have these four columns to work with. I want to replace the garbage values with the correct value. I am looking for the SQL for the following garbage value replacement logic - the units left for each week should be the minimum of the values in units left of all rows above it sorted by week number ascending.
e.g. here it goes to 12 for week 4 and 5 and then back to 9 - which is incorrect - they [the 12s] should each be replaced by 9
INPUT:---
+-------+------------+-------------+------------+
| Store | Product ID | Week Number | Units left |
+-------+------------+-------------+------------+
| XXX | A1 | 1 | 10.0 |
| XXX | A1 | 2 | 9 |
| XXX | A1 | 3 | 9 |
| XXX | A1 | 4 | 12 |
| XXX | A1 | 5 | 12 |
| XXX | A1 | 6 | 9 |
| XXX | A1 | 7 | 8 |
+-------+------------+-------------+------------+
OUTPUT:----
+-------+------------+-------------+------------+
| Store | Product ID | Week Number | Units left |
+-------+------------+-------------+------------+
| XXX | A1 | 1 | 10.0 |
| XXX | A1 | 2 | 9 |
| XXX | A1 | 3 | 9 |
| XXX | A1 | 4 | 9 |
| XXX | A1 | 5 | 9 |
| XXX | A1 | 6 | 9 |
| XXX | A1 | 7 | 8 |
+-------+------------+-------------+------------+
The DB is Teradata.

You could try cumulative minimum function in teradata.
Select Store, Product_ID, Week_Number, Units,
MIN(Units) over (PARTITION BY Store, Product_ID ORDER BY Week_Number ROWS UNBOUNDED PRECEDING) as Corrected_units from TABLE_NAME;

You can use a cumulative minimum:
select t.*,
min(units_left) over (partition by store, product_id
order by date
rows between unbounded preceding and current row
) as imputed_units_left
from t;
This is standard SQL syntax and should work in all the databases you originally tagged.
If you want to actually change the data -- well, the syntax varies by database.

Related

output difference of two values same column to another column

Can anhone help me out or point me in the right direction? What is simplest way to get from current table to output table??
Current Table
ID | type | amount |
2 | A | 19 |
2 | B | 6 |
3 | A | 5 |
3 | B | 11 |
4 | A | 1 |
4 | B | 23 |
Desires output
ID | type | amount | change |
2 | A | 19 | 13 |
2 | B | 6 | -6 |
3 | A | 5 | -22 |
3 | B | 11 | |
4 | A | 1 | |
4 | B | 23 | |
I don't get how the values are put on rows. You can, for instance, subtract the "B" value from the "A" value for any given id. For instance:
select t.*,
(case when type = 'A'
then amount - max(amount) filter (type = 'B') over (partition by id)
end) as diff_a_b
from t;

SQL generate unique ID from rolling ID

I've been trying to find an answer to this for the better part of a day with no luck.
I have a SQL table with measurement data for samples and I need a way to assign a unique ID to each sample. Right now each sample has an ID number that rolls over frequently. What I need is a unique ID for each sample. Below is a table with a simplified dataset, as well as an example of a possible UID that would do what I need.
| Row | Time | Meas# | Sample# | UID (Desired) |
| 1 | 09:00 | 1 | 1 | 1 |
| 2 | 09:01 | 2 | 1 | 1 |
| 3 | 09:02 | 3 | 1 | 1 |
| 4 | 09:07 | 1 | 2 | 2 |
| 5 | 09:08 | 2 | 2 | 2 |
| 6 | 09:09 | 3 | 2 | 2 |
| 7 | 09:24 | 1 | 3 | 3 |
| 8 | 09:25 | 2 | 3 | 3 |
| 9 | 09:25 | 3 | 3 | 3 |
| 10 | 09:47 | 1 | 1 | 4 |
| 11 | 09:47 | 2 | 1 | 4 |
| 12 | 09:49 | 3 | 1 | 4 |
My problem is that rows 10-12 have the same Sample# as rows 1-3. I need a way to uniquely identify and group each sample. Having the row number or time of the first measurement on the sample would be good.
One other complication is that the measurement number doesn't always start with 1. It's based on measurement locations, and sometimes it skips location 1 and only has locations 2 and 3.
I am going to speculate that you want a unique number assigned to each sample, where now you have repeats.
If so, you can use lag() and a cumulative sum:
select t.*,
sum(case when prev_sample = sample then 0 else 1 end) over (order by row) as new_sample_number
from (select t.*,
lag(sample) over (order by row) as prev_sample
from t
) t;

SQL Statement to show columns multiple times

I have a table containing an integer column that represents a work place, an integer column that represents the number of workpieces finished at that workplace and a date column.
I want to create a query that creates rows of the following type
location int | date of Max(workpiece) | max workpieces | Min(Date) | workpieces (Min(Date)) | max(Date) | workpieces (Max(Date))
So i want a row for each location containing the date of the day where the most pieces where finished plus the amount of the pieces, the oldest date and the pieces finished on that day and the newest date plus the number of pieces finished that day.
Do I have to use joins, to join the table with itself 3 times each given one of the criteria and then join on location? Is The GROUP BY Operator involved, which I don't quite get the hang of?
EDIT: Here's some sample data
+-------+-----------+-----------+-------------------+
| id | location | amount | date |
+-------+-----------+-----------+-------------------+
| 1 | 1 | 10 | 01.01.2016 |
| 2 | 2 | 5 | 01.01.2016 |
| 3 | 1 | 6 | 02.01.2016 |
| 4 | 2 | 35 | 02.01.2016 |
| 5 | 1 | 50 | 03.01.2016 |
| 6 | 2 | 20 | 03.01.2016 |
+-------+-----------+-----------+-------------------+
I want my output to look like this:
loc | dateMaxAmount| MaxAmount | MinDate | AmountMinDate | MaxDate | MaxDateAmount
1 | 03.01.2016 | 50 | 01.01.2016| 10 | 03.01.2016| 50
2 | 02.01.2016 | 35 | 01.01.2016| 5 | 03.01.2016| 20
I am using MS Access.

Spark dataframe add Missing Values

I have a dataframe of the following format. I want to add empty rows for missing time stamps for each customer.
+-------------+----------+------+----+----+
| Customer_ID | TimeSlot | A1 | A2 | An |
+-------------+----------+------+----+----+
| c1 | 1 | 10.0 | 2 | 3 |
| c1 | 2 | 11 | 2 | 4 |
| c1 | 4 | 12 | 3 | 5 |
| c2 | 2 | 13 | 2 | 7 |
| c2 | 3 | 11 | 2 | 2 |
+-------------+----------+------+----+----+
The resulting table should be of the format
+-------------+----------+------+------+------+
| Customer_ID | TimeSlot | A1 | A2 | An |
+-------------+----------+------+------+------+
| c1 | 1 | 10.0 | 2 | 3 |
| c1 | 2 | 11 | 2 | 4 |
| c1 | 3 | null | null | null |
| c1 | 4 | 12 | 3 | 5 |
| c2 | 1 | null | null | null |
| c2 | 2 | 13 | 2 | 7 |
| c2 | 3 | 11 | 2 | 2 |
| c2 | 4 | null | null | null |
+-------------+----------+------+------+------+
I have 1 Million customers and 360(in the above example only 4 is depicted) Time slots.
I figured out a way to create a Dataframe with 2 columns (Customer_id,Timeslot) with (1 M x 360 rows) and do a Left outer join with the original dataframe.
Is there a better way to do this?
You can express this as a SQL query:
select df.customerid, t.timeslot,
t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
(select distinct timeslot from df) t left join
df
on df.customerid = c.customerid and df.timeslot = t.timeslot;
Notes:
You should probably put this into another dataframe.
You might have tables with the available customers and/or timeslots. Use those instead of the subqueries.
I think can used the answer of gordon linoff but you can add the following thinsg as you stated that there are millions of customer and you are performing join in them.
use tally table for TimeSlot?? because it might give a better performance.
for more usabllity please refer the following link
http://www.sqlservercentral.com/articles/T-SQL/62867/
and I think you should use partition or row number function to divide you column customerid and select the customers based on some partition value. For example just select the row number values and then cross join with the tally table. it can imporove your performance .

SQL Inner Join based on MAX of timestamp

Amended Once
Amended Twice: The headers of the remaining 9 tables except for reports are always called "what".
I have about 10 tables with the following structure:
reports (165k rows)
+-----------+-----------+
| identifier| category |
+-----------+-----------+
| 1 | fixed |
| 2 | wontfix |
| 3 | fixed |
| 4 | invalid |
| 5 | later |
| 6 | wontfix |
| 7 | duplicate |
| 8 | later |
| 9 | wontfix |
+-----------+-----------+
status (300k rows, all identifiers from reports come up at least once)
+-----------+-----------+----------+
| identifier| time | what |
+-----------+-----------+----------+
| 1 | 12 | RESOLVED |
| 1 | 9 | NEW |
| 2 | 7 | ASSIGNED |
| 3 | 10 | RESOLVED |
| 5 | 4 | REOPEN |
| 7 | 9 | ASSIGNED |
| 4 | 9 | ASSIGNED |
| 7 | 11 | RESOLVED |
| 8 | 3 | NEW |
| 4 | 3 | NEW |
| 7 | 6 | NEW |
+-----------+-----------+----------+
priority (300k rows, all identifiers from reports come up at least once)
+-----------+-----------+----------+
| identifier| time | what |
+-----------+-----------+----------+
| 3 | 12 | LOW |
| 1 | 9 | LOW |
| 9 | 2 | HIGH |
| 8 | 7 | HIGH |
| 3 | 10 | HIGH |
| 5 | 4 | MEDIUM |
| 4 | 9 | MEDIUM |
| 4 | 3 | LOW |
| 7 | 9 | LOW |
| 7 | 11 | HIGH |
| 8 | 3 | LOW |
| 6 | 12 | MEDIUM |
| 7 | 6 | LOW |
| 6 | 9 | HIGH |
| 2 | 6 | HIGH |
| 2 | 1 | LOW |
+-----------+-----------+----------+
What I need is:
reportsfinal (165k rows)
+-----------+-----------+--------------+------------+
| identifier| category | what11 | what22 |
+-----------+-----------+--------------+------------+
| 1 | fixed | RESOLVED | LOW |
| 2 | wontfix | ASSIGNED | HIGH |
| 3 | fixed | RESOLVED | LOW |
| 4 | invalid | ASSIGNED | MEDIUM |
| 5 | later | REOPEN | MEDIUM |
| 6 | wontfix | | MEDIUM |
| 7 | duplicate | RESOLVED | HIGH |
| 8 | later | NEW | HIGH |
| 9 | wontifx | | HIGH |
+-----------+-----------+--------------+------------+
That is, reports (after query = reportsfinal) serves as the basis table and I have to add one or two columns from 9 other tables. The identifier is the key, but in some tables, the identifier comes up multiple times. In these cases I want to use the entry with the highest time only.
I tried several queries, but none of them worked. If possible, I want to run one query to get different columns from the 9 other tables with this approach.
What I tried based on the answer below:
select T.identifier,
T.category,
t.what AS what11,
t.what AS what22 from (
select R.identifier,
R.category,
COALESCE(S.what,'NA')what,
COALESCE(P.what,'NA')what,
ROW_NUMBER()OVER(partition by R.identifier,R.category ORDER by (select null))RN
from reports R
LEFT JOIN bugstatus S
ON S.identifier = R.identifier
LEFT JOIN priority P
ON P.identifier = s.identifier
GROUP BY R.identifier,R.category,S.what,P.what)T
Where T.RN = 1
ORDER BY T.identifier;
This gives the error:
Error: near "(": syntax error.
Basically you need a correlated subqueries in the select list.
From the hip, something like:
Select a.Identifier
,a.Category
,(select process
from status where status.identifier = a.Identifer order by time desc limit 1) Process
,(select prio
from priority where priorty.identifier = a.Identifer order by time desc limit 1) prio
From Reports a
For each associated table just use a predicate based on a subquery to identify the specific timestamp...
Single letter tokens r, s, and p are defined aliases for tables reports, status and priority respectively
Select r.Identifier, r.category,
coalesce(s.what, 'NA') status,
coalesce(p.what, 'NA') priority
From reports r
left join status s
on s.identifier = r.identifier
and s.time =
(Select max(time) from status
where identifier = r.identifier)
left join priority p
on p.identifier = r.identifier
and p.time =
(Select max(time) from priority
where identifier = r.identifier);
QUESTION: Why did you rename the columns from Status, and priority to What?? You might as well name then something or data, or information. At least the original names (status and prio) communicated something.. The word What is meaningless.
NOTE. I reversed (undid) the edit for the aliases of what11 and what12, as these names are e meaningless.
using Row_number works based on your assumed data
select T.identifier,
T.category,
what AS what11,
what AS what22 from (
select R.identifier,
r.category,
COALESCE(S.what,'NA')what,
COALESCE(P.what,'NA')what,
ROW_NUMBER()OVER(partition by R.identifier,r.category ORDER by (select null))RN
from reports R left join status S
ON S.identifier = R.identifier
LEFT JOIN Priority P
ON P.identifier = s.identifier
GROUP BY R.identifier,r.category,S.what,P.what)T
Where T.RN = 1
ORDER BY T.identifier