Database Design Historical Data Model

Database Design Historical Data Model - sql

I am thinking a good design to capture history of product change. Suppose a user can have different products to trade for each day.
User Product Day
1 A 1
1 B 1
1 A 2
1 B 2
1 C 3
As we can see above on day 3, Product C is Added and Product A B are removed.
Thinking of below 2 design:
#1 Capture the product changes and store it as start and end date
User Product Start End
1 A 1 3
1 B 1 3
1 C 3 -
#2 Capture the product changes as 1 record
User Product Action Day
1 A Added 1
1 B Added 1
1 C Added 3
1 A Removed 3
1 B Removed 3
My following question is can these 2 models be converted to each other. For example, we can use Lead/Lag to convert #2 into #1.
Which design is better? Our system is using #2 to store the historical data.
Updated:
the intention to use the data is showing the product changes history.
For example, for a given date range, what's the product change for a particular user?

The second model seems better, at least if your main interest is in queries like "find all changes for all users and products, which occurred between DATE_1 and DATE_2".
With the second model, the query is trivial:
select * from (table) where (date) between DATE_1 and DATE_2;
How would you write the query for the first model?
Moreover, with the second model you could create an index on (user, date) - or even just on (date) - which will make quick work of the query. Even if you had indexes on the table in the first model, they wouldn't be used due to the complicated nature of the query.
While integrity constraints would be relatively difficult in both cases (as they are cross-rows), they would be much easier to implement - either with materialized views or with triggers - with the second model. In the first model you have to make sure there are no overlaps between the intervals. With the first model, if you partition by user and order by date, the condition is simply that the action alternates from row to row. Still not trivial to implement, but much simpler than the "non-overlapping intervals" condition for the first model.
To your other question: It is, indeed, trivial to go from either model to the other, using PIVOT and UNPIVOT. You do need an analytic function (ROW_NUMBER) before you PIVOT to go from model #2 to #1. You don't need any preparation to go from #1 to #2.

Personally, I think the first option is better. I'm assuming you have so many rows that the raw structure of a row per user, product and date is too heavy? Because for visualisations I think the raw table would work fine as is.
However, if you have to aggregate due to size, and do not need to know the amounts of the product nor how many users are selling them on any given day, then the first option would be easier to work with in my opinion just in terms of SQL. On the other hand, you will have a problem in case a product can have several start and end dates, since I am assuming a new entry would just overwrite the previous date stamp.
So, that in mind, I would personally create a table with a row per day(or monthly if you want to minimise the size of the table and monthly is granular enough for your use case). Then add a column for each product and whether or not they were sold that day. You could even do it with a count on the number of users selling that product, which would give you a little more detail. The only problem this model has, is that I would only use it in case it is truly static, historical data with no need to add new products.

You can convert from any one format to the other formats.
Data in the first format:
CREATE TABLE table1 (Usr, Product, Day) AS
SELECT 1, 'A', '1' FROM DUAL UNION ALL
SELECT 1, 'B', '1' FROM DUAL UNION ALL
SELECT 1, 'A', '2' FROM DUAL UNION ALL
SELECT 1, 'B', '2' FROM DUAL UNION ALL
SELECT 1, 'C', '3' FROM DUAL
Then:
SELECT usr,
product,
day + DECODE( action, 'Removed', 1, 0) AS day,
action
FROM (
SELECT Usr,
Product,
Day,
CASE
WHEN LAG( Day ) OVER ( PARTITION BY Usr, Product ORDER BY Day ) = Day - 1
THEN NULL
ELSE 'Added'
END AS Added,
CASE
WHEN LEAD( Day ) OVER ( PARTITION BY Usr, Product ORDER BY Day ) = Day + 1
THEN NULL
WHEN Day = MAX( Day ) OVER ()
THEN NULL
ELSE 'Removed'
END AS Removed
FROM table1
)
UNPIVOT ( action FOR value IN ( Added, Removed ) )
Outputs that data in the second form:
USR | PRODUCT | DAY | ACTION
--: | :------ | --: | :------
1 | A | 1 | Added
1 | A | 3 | Removed
1 | B | 1 | Added
1 | B | 3 | Removed
1 | C | 3 | Added
and:
SELECT Usr,
Product,
MIN( Day ) AS "Start",
CASE MAX( Day )
WHEN Last_Day
THEN NULL
ELSE MAX( Day ) + 1
END AS "End"
FROM (
SELECT Usr,
Product,
Day,
Day - ROW_NUMBER() OVER ( PARTITION BY Usr, Product ORDER BY Day ) AS grp,
MAX( Day ) OVER () AS last_day
FROM table1
)
GROUP BY Usr, Product, Grp, Last_Day
ORDER BY Usr, Product, "Start"
Outputs the data in the third format:
USR | PRODUCT | Start | End
--: | :------ | :---- | ---:
1 | A | 1 | 3
1 | B | 1 | 3
1 | C | 3 | null
Data in the second format:
CREATE TABLE table2 ( Usr, Product, Day, Action ) AS
SELECT 1, 'A', 1, 'Added' FROM DUAL UNION ALL
SELECT 1, 'A', 3, 'Removed' FROM DUAL UNION ALL
SELECT 1, 'B', 1, 'Added' FROM DUAL UNION ALL
SELECT 1, 'B', 3, 'Removed' FROM DUAL UNION ALL
SELECT 1, 'C', 3, 'Added' FROM DUAL;
Then you can convert it to the third format using:
SELECT Usr,
Product,
"Start",
"End"
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY Usr, Product, Action ORDER BY Day ) AS rn
FROM table2 t
)
PIVOT (
MAX( Day )
FOR Action IN (
'Added' AS "Start",
'Removed' AS "End"
)
)
Which outputs:
USR | PRODUCT | Start | End
--: | :------ | ----: | ---:
1 | A | 1 | 3
1 | B | 1 | 3
1 | C | 3 | null
Data in the third format:
CREATE TABLE table3 ( Usr, Product, "Start", "End" ) AS
SELECT 1, 'A', 1, 3 FROM DUAL UNION ALL
SELECT 1, 'B', 1, 3 FROM DUAL UNION ALL
SELECT 1, 'C', 3, NULL FROM DUAL;
Then to get the data in the first format you can use:
WITH unrolled_data ( Usr, Product, Day, "End" ) AS (
SELECT Usr, Product, "Start", "End"
FROM table3
UNION ALL
SELECT Usr, Product, Day + 1, "End"
FROM unrolled_data
WHERE Day + 1 < COALESCE( "End", 4 /* The last day + 1 */ )
)
SELECT Usr, Product, Day
FROM unrolled_data
ORDER BY Usr, Day, Product
Outputs:
USR | PRODUCT | DAY
--: | :------ | --:
1 | A | 1
1 | B | 1
1 | A | 2
1 | B | 2
1 | C | 3
And can convert to the second format using:
SELECT *
FROM table3
UNPIVOT ( Day FOR Action IN ( "Start" AS 'Added', "End" AS 'Removed' ) )
Which outputs:
USR | PRODUCT | ACTION | DAY
--: | :------ | :------ | --:
1 | A | Added | 1
1 | A | Removed | 3
1 | B | Added | 1
1 | B | Removed | 3
1 | C | Added | 3
(and you can combine queries to convert from 2-to-1.)
db<>fiddle here

Related

Counting current items by month

I'm trying to build a monthly tally of active equipment, grouped by service area from a database log table. I think I'm 90% of the way there; I have a list of months, along with the total number of items that existed, and grouped by region.
However, I also need to know the state of each item as they were on the first of each month, and this is the part I'm stuck on. For instance, Item 1 is in region A in January, but moves to Region B in February. Item 2 is marked as 'inactive' in February, so shouldn't be counted. My existing query will always count item 1 in region A, and item 2 as 'active'.
I can correctly show that Item 3 is deleted in March, and Item 4 doesn't show up until the April count. I realize that I'm getting the first values because my query is specifying the min date, I'm just not sure how I need to change it to get what I want.
I think I'm looking for a way to group by Max(OperationDate) for each Month.
The Table looks like this:
| EQUIPID | EQUIPNAME | EQUIPACTIVE | DISTRICT | REGION | OPERATIONDATE | OPERATION |
|---------|-----------|-------------|----------|--------|----------------------|-----------|
| 1 | Item 1 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 3 | Item 3 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 0 | 1 | A | 2015-02-10T00:00:00Z | UPD |
| 1 | Item 1 | 1 | 1 | B | 2015-02-15T00:00:00Z | UPD |
| 3 | (null) | (null) | (null) | (null) | 2015-02-21T00:00:00Z | DEL |
| 1 | Item 1 | 1 | 1 | A | 2015-03-01T00:00:00Z | UPD |
| 4 | Item 4 | 1 | 1 | B | 2015-03-10T00:00:00Z | INS |
There is also a subtable that holds attributes that I care about. It's structure is similar. Unfortunately, due to previous design decisions, there is no correlation to operations between the two tables. Any joins will need to be done using the EquipmentID, and have the overlapping states matched up for each date.
Current query:
--cte to build date list
WITH calendar (dt) AS
(SELECT &fromdate from dual
UNION ALL
SELECT Add_Months(dt,1)
FROM calendar
WHERE dt < &todate)
SELECT dt, a.district, a.region, count(*)
FROM
(SELECT EQUIPID, DISTRICT, REGION, OPERATION, MIN(OPERATIONDATE ) AS FirstOp, deleted.deldate
FROM Equipment_Log
LEFT JOIN
(SELECT EQUIPID,MAX(OPERATIONDATE) as DelDate
FROM Equipment_Log
WHERE OPERATION = 'DEL'
GROUP BY EQUIPID
) Deleted
ON Equipment_Log.EQUIPID = Deleted.EQUIPID
WHERE OPERATION <> 'DEL' --AND additional unimportant filters
GROUP BY EQUIPID,DISTRICT, REGION , OPERATION, deldate
) a
INNER JOIN calendar
ON (calendar.dt >= FirstOp AND calendar.dt < deldate)
OR (calendar.dt >= FirstOp AND deldate is null)
LEFT JOIN
( SELECT EQUIPID, MAX(OPERATIONDATE) as latestop
FROM SpecialEquip_Table_Log
--where SpecialEquip filters
group by EQUIPID
) SpecialEquip
ON a.EQUIPID = SpecialEquip.EQUIPID and calendar.dt >= SpecialEquip.latestop
GROUP BY dt, district, region
ORDER BY dt, district, region

Take only last operation for each id. This is what row_number() and where rn = 1 do.
We have calendar and data. Make partitioned join.
I assumed that you need to fill values for months where entries for id are missing. So nvl(lag() ignore nulls) are needed, because if something appeared in January it still exists in Feb, March and we need district, region values from last not empty row.
Now you have everything to make count. That part where you mentioned SpecialEquip_Table_Log is up to you, because you left-joined this table and not used it later, so what is it for? Join if you need it, you have id.
db<>fiddle
with
calendar(mth) as (
select date '2015-01-01' from dual union all
select add_months(mth, 1) from calendar where mth < date '2015-05-01'),
data as (
select id, dis, reg, dt, op, act
from (
select equipid id, district dis, region reg,
to_char(operationdate, 'yyyy-mm') dt,
row_number()
over (partition by equipid, trunc(operationdate, 'month')
order by operationdate desc) rn,
operation op, nvl(equipactive, 0) act
from t)
where rn = 1 )
select mth, dis, reg, sum(act) cnt
from (
select id, mth,
nvl(dis, lag(dis) ignore nulls over (partition by id order by mth)) dis,
nvl(reg, lag(reg) ignore nulls over (partition by id order by mth)) reg,
nvl(act, lag(act) ignore nulls over (partition by id order by mth)) act
from calendar
left join data partition by (id) on dt = to_char(mth, 'yyyy-mm') )
group by mth, dis, reg
having sum(act) > 0
order by mth, dis, reg
It may seem complicated, so please run subqueries separately at first to see what is going on. And test :) Hope this helps.

How to returning BigQuery table rows with a max value

I've a simple bigquery table with 3 columns (and some example data) below:
|---------------------|------------------|------------|
| Name | Time | Value |
|---------------------|------------------|------------|
| a | 1 | x |
|---------------------|------------------|------------|
| a | 2 | y |
|---------------------|------------------|------------|
| a | 3 | z |
|---------------------|------------------|------------|
| b | 1 | x |
|---------------------|------------------|------------|
| b | 4 | y |
|---------------------|------------------|------------|
For each name, I'd like to return the value with the max time.
For the above table the 3rd and 5th row should be returned, e.g.,
|---------------------|------------------|------------|
| Name | Time | Value |
|---------------------|------------------|------------|
| a | 3 | z |
|---------------------|------------------|------------|
| b | 4 | y |
|---------------------|------------------|------------|
It is roughly like: (1) first group by Name, (2) find out the max time in each group, (3) identify the row with the max time.
Seems for (1) and (2), we can use group by + max(), but i'm not sure how to achieve the (3) step.
Anyone has ideas of what's the best query I can write to achieve this purpose.
Thanks a lot.

ROW_NUMBER is one way to go here:
SELECT Name, Time, Value
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Time DESC) rn
FROM yourTable
) t
WHERE rn = 1;
Using QUALIFY we can try:
SELECT Name, Time, Value
FROM yourTable
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Time DESC) = 1;

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY time DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY name
if to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' name, 1 time, 'x' value UNION ALL
SELECT 'a', 2, 'y' UNION ALL
SELECT 'a', 3, 'z' UNION ALL
SELECT 'b', 1, 'x' UNION ALL
SELECT 'b', 4, 'y'
)
SELECT AS VALUE ARRAY_AGG(t ORDER BY time DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY name
result is
Row name time value
1 a 3 z
2 b 4 y

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

I have a dataset upon which I am trying to obain a summed value for each group, if a subgroup within each group meets a certain condition. I am not sure if this is possible, or if I am approaching this problem incorrectly.
My data is structured as following:
+----+-------------+---------+-------+
| ID | Transaction | Product | Value |
+----+-------------+---------+-------+
| 1 | A | 0 | 10 |
| 1 | A | 1 | 15 |
| 1 | A | 2 | 20 |
| 1 | B | 1 | 5 |
| 1 | B | 2 | 10 |
+----+-------------+---------+-------+
In this example I want to obtain the sum of values by the ID column, if a transaction does not contain any products labeled 0. In the above described scenario, all values related to Transaction A would be excluded because Product 0 was purchased. With the outcome being:
+----+-------------+
| ID | Sum of Value|
+----+-------------+
| 1 | 15 |
+----+-------------+
This process would repeat for multiple IDs with each ID only containing the sum of values if the transaction does not contain product 0.

Hmmm . . . one method is to use not exists for the filtering:
select id, sum(value)
from t
where not exists (select 1
from t t2
where t2.id = t.id and t2.transaction = t.transaction and
t2.product = 0
)
group by id;

Do not need to use correlated subquery with not exists.
Just use group by.
with s (id, transaction, product, value) as (
select 1, 'A', 0, 10 from dual union all
select 1, 'A', 1, 15 from dual union all
select 1, 'A', 2, 20 from dual union all
select 1, 'B', 1, 5 from dual union all
select 1, 'B', 2, 10 from dual)
select id, sum(sum_value) as sum_value
from
(select id, transaction,
sum(value) as sum_value
from s
group by id, transaction
having count(decode(product, 0, 1)) = 0
)
group by id;
ID SUM_VALUE
---------- ----------
1 15

To count a column based on another column's repeating(same) entry

I want to create a report of calls last made based on weeks from last call and call-Group
Actual Data is like below with call id, date of call and call grouping
callid | Date | Group
----------------------------
1 | 6-1-18 | a1
2 | 6-1-18 | a2
3 | 7-1-18 | a3
4 | 8-1-18 | a1
5 | 9-1-18 | a2
6 | 9-1-18 | a4
Expected data is to display the number of calls for each call group corresponding to the number of week from last call
week | |
from | |
last |Group|Group
call | a1 | a2
--------------------
1 | 2 | 2 ->number of calls
2 | - | -
3 | 1 | -
4 | 2 | -
5 | - | 3
6 | - | -
Can anyone please tell me some solution for this

Although you data provided was a very small set and not rich enough to cover all cases, here is an sql that will calculate the number of weeks difference between each call and last call within a group and count the number of calls for each group for the particular week difference.
with your_table as (
select 1 as "callid", to_date('6-1-18','dd-mm-rr') as "date", 'a1' as "group" from dual
union select 2, to_date('6-1-18','mm-dd-rr'), 'a2' from dual
union select 3, to_date('7-1-18','mm-dd-rr'), 'a3' from dual
union select 4, to_date('8-1-18','mm-dd-rr'), 'a1' from dual
union select 5, to_date('9-1-18','mm-dd-rr'), 'a2' from dual
union select 6, to_date('6-1-18','mm-dd-rr'), 'a4' from dual
),
data1 as (
select t.*, max(t."date") over (partition by t."group") last_call_dt from your_table t
),
data2 as (select t.*, round((last_call_dt-t."date")/7,0) as weeks_diff from data1 t)
select * from (
select t.weeks_diff, t."callid", t."group" from data2 t
)
PIVOT
(
COUNT("callid")
FOR "group" IN ('a1', 'a2', 'a3','a4')
)
order by weeks_diff
to try it out with your table just make the following change:
with your_table as (select * from my_table), ....
let me know how it goes :)

sql group by personalised condition

Hi，I have a column as below
+--------+--------+
| day | amount|
+--------+---------
| 2 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
+--------+--------+
now I want something like this sum day 1- day2 as row one , sum day1-3 as row 2, and so on.
+--------+--------+
| day | amount|
+--------+---------
| 1-2 | 11 |
| 1-3 | 14 |
| 1-4 | 17 |
+--------+--------+
Could you offer any one help ,thanks!

with data as(
select 2 day, 2 amount from dual union all
select 1 day, 3 amount from dual union all
select 1 day, 4 amount from dual union all
select 2 day, 2 amount from dual union all
select 3 day, 3 amount from dual union all
select 4 day, 3 amount from dual)
select distinct day, sum(amount) over (order by day range unbounded preceding) cume_amount
from data
order by 1;
DAY CUME_AMOUNT
---------- -----------
1 7
2 11
3 14
4 17
if you are using oracle you can do something like the above

Assuming the day range in left column always starts from "1-", What you need is a query doing cumulative sum on the grouped table(dayWiseSum below). Since it needs to be accessed twice I'd put it into a temporary table.
CREATE TEMPORARY TABLE dayWiseSum AS
(SELECT day,SUM(amount) AS amount FROM table1 GROUP BY day ORDER BY day);
SELECT CONCAT("1-",t1.day) as day, SUM(t2.amount) AS amount
FROM dayWiseSum t1 INNER JOIN dayWiseSum
t2 ON t1.day > t2.day
--change to >= if you want to include "1-1"
GROUP BY t1.day, t1.amount ORDER BY t1.day
DROP TABLE dayWiseSum;
Here's a fiddle to test with:
http://sqlfiddle.com/#!9/c1656/1/0
Note: Since sqlfiddle isn't allowing CREATE statements, I've replaced dayWiseSum with it's query there. Also, I've used "Text to DDL" option to paste the exact text of the table from your question to generate the create table query :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Database Design Historical Data Model - sql

Related

Counting current items by month

How to returning BigQuery table rows with a max value

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

To count a column based on another column's repeating(same) entry

sql group by personalised condition

Categories

Resources