Comparing two sets of data - sql

Very sorry if this has been answered in some way. I have checked all over and can't figure it out.
I need to find a way in postgresql to compare data from week to week. All data exists in the same table, and has a Week number column. Data will not always completely overlap but I need to compare data within groups when they do.
Say these are the data sets:
Week 2
+--------+--------+------+---------+-------+
| group | num | color| ID | week #|
+--------+--------+------+---------+-------+
| a | 1 | red | a1red | 2 |
| a | 2 | blue | a2blue | 2 |
| b | 3 | blue | b3blue | 2 |
| c | 7 | black| c7black | 2 |
| d | 8 | black| d8black | 2 |
| d | 9 | red | d9red | 2 |
| d | 10 | gray | d10gray | 2 |
+--------+--------+------+---------+-------+
Week 3
+--------+--------+------+---------+-------+
| group | num | color| ID | week #|
+--------+--------+------+---------+-------+
| a | 1 | red | a1red | 3 |
| a | 2 | green| a2green | 3 |
| b | 3 | blue | b3blue | 3 |
| b | 5 | green| b5green | 3 |
| c | 7 | black| c7black | 3 |
| e | 11 | blue | d11blue | 3 |
| e | 12 | other| d12other| 3 |
| e | 14 | brown| d14brown| 3 |
+--------+--------+------+---------+-------+
Each row has an ID made out of the group, number, and color values.
I need the query to grab all groups from Week 3, then for any groups in Week 3 that exist in Week 2:
flag ID's within the group that have changed, like in group A.
flag if any ID's were added or removed to the group, like in group B.
One function that would be nice to have, but is not essential, would be to have Week 3 compare against Week 1 for groups that do not exist in Week 2.
I have thought about trying to divide the two weeks up and use intercept/except to get results but I can't quite wrap my head around how I might get this to work correctly. Any tips would be much appreciated.

For just two (known) weeks you can do something like this:
select coalesce(w1.group_nr, w2.group_nr) as group_nr,
coalesce(w1.num, w2.num) as num,
case
when w1.group_nr is null then 'missing in first week'
when w2.group_nr is null then 'missing in second week'
when (w1.color, w1.id) is distinct from (w2.color, w2.id) then 'data has changed'
else 'no change'
end as status,
case
when
w1.group_nr is not null
and w2.group_nr is not null
and w1.color is distinct from w2.color then 'color is different'
end as color_change,
case
when
w1.group_nr is not null
and w2.group_nr is not null
and w1.id is distinct from w2.id then 'id is different'
end as id_change
from (
select group_nr, num, color, id, hstore
from data
where week = 2
) as w1
full outer join (
select group_nr, num, color, id
from data
where week = 3
) w2 on (w1.group_nr, w1.num) = (w2.group_nr, w2.num)
Getting the attributes that have changed is a bit clumsy. If you can live with a textual representation, you could use the hstore extension to display the differences:
select coalesce(w1.group_nr, w2.group_nr) as group_nr,
coalesce(w1.num, w2.num) as num,
case
when w1.group_nr is null then 'missing in first week'
when w2.group_nr is null then 'missing in second week'
when (w1.color, w1.id) is distinct from (w2.color, w2.id) then 'data has changed'
else 'no change'
end as status,
w2.attributes - w1.attributes as changed_attributes
from (
select group_nr, num, color, id, hstore(data) - 'week'::text as attributes
from data
where week = 2
) as w1
full outer join (
select group_nr, num, color, id, hstore(data) - 'week'::text as attributes
from data
where week = 3
) w2 on (w1.group_nr, w1.num) = (w2.group_nr, w2.num);

Related

How to return the same period last year data with SQL?

I am trying to create a view in postgreSQL with the requirements as below:
The table needs to show the same period last year data for every records.
Sample data:
date_sk | location_sk | division_sk | employee_type_sk | value
20180202 | 6 | 8 | 4 | 1
20180202 | 7 | 2 | 4 | 2
20190202 | 6 | 8 | 4 | 1
20190202 | 7 | 2 | 4 | 1
20200202 | 6 | 8 | 4 | 1
20200202 | 7 | 2 | 4 | 3
In the table, date_sk, location_sk, division_sk and employee_type_sk are super keys which form an unique record in the table.
You can check the required output as below:
date_sk | location_sk | division_sk | employee_type_sk | value | value_last_year
20180202 | 6 | 8 | 4 | 1 | NULL
20180203 | 7 | 2 | 4 | 2 | NULL
20190202 | 6 | 8 | 4 | 1 | 1
20190203 | 7 | 3 | 4 | 1 | NULL
20200202 | 6 | 8 | 4 | 1 | 1
20200203 | 7 | 3 | 4 | 3 | 1
The records start on 20180202, therefore, the data for the same period last year is unavailable. At the 4th record, there is a difference in division_sk comparing with the same period last year - hence, the head_count_last_year is NULL.
My current solution is to create a view from the sample data with an addition column as same_date_last_year then LEFT JOIN the same table. The SQL queries are below:
CREATE VIEW test_view AS
SELECT *,
CONCAT(LEFT(date_sk, 4) - 1, RIGHT(date_sk, 4)) AS same_date_last_year
FROM test_table
SELECT
test_view.date_sk,
test_view.location_sk,
test_view.division_sk,
test_view.employee_type_sk,
test_view.value,
test_table.value AS value_last_year
FROM test_view
LEFT JOIN test_table ON (test_view.same_date_last_year = test_table.date_sk)
We have a lot of data in the table. My solution above is unacceptable in terms of performance.
Is there a different query which yields the same result and might improve the performance ?
You could simply use a correlated subquery here which is likely best for performance:
select *,
(
select value from t t2
where t2.date_sk=t.date_sk - interval '1' year and
t2.location_sk=t.location_sk and
t2.division_sk=t.division_sk and
t2.employee_type_sk=t.employee_type_sk
) as value_last_year
from t
WITH CTE(DATE_SK,LOCATION_SK,DIVISION_SK,EMPLOYEE_TYPE_SK,VALUE)AS
(
SELECT CAST('20180202' AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20180203'AS DATE),7,2,4,2 UNION ALL
SELECT CAST('20190202'AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20190203'AS DATE),7,2,4,1 UNION ALL
SELECT CAST('20200202'AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20200203'AS DATE),7,2,4,3
)
SELECT C.DATE_SK,C.LOCATION_SK,C.DIVISION_SK,C.EMPLOYEE_TYPE_SK,C.VALUE,
LAG(C.VALUE)OVER(PARTITION BY C.LOCATION_SK,C.DIVISION_SK,C.EMPLOYEE_TYPE_SK ORDER BY C.DATE_SK ASC)LAGG
FROM CTE AS C
ORDER BY C.DATE_SK ASC;
Could you please try if the above is suitable for you. I assume,DATE_SK is a date column or can be CAST to a date

SQL- Invalid query because of aggregate function with simple calculations for each date and ID

I very new to SQL (less than 100Hrs). Problem case is as mentioned below. Every time I try a query either get incorrect output or error that "not contained in either an aggregate function or the GROUP BY clause"
Have tried searching similar questions or example but no results. I am lost now.
Please help
I have three tables
Table Calc,
Source_id | date(yyyymmdd) | metric1 | metric 2
-------------------------------------------------
1 | 20201010 | 2 | 3
2 | 20201010 | 4 | 5
3 | 20201010 | 6 | 7
1 | 20201011 | 8 | 9
2 | 20201011 | 10 | 11
3 | 20201011 | 12 | 13
1 | 20201012 | 14 | 15
2 | 20201012 | 16 | 17
3 | 20201012 | 18 | 19
Table Source
Source_id | Description
------------------------
1 | ABC
2 | DEF
3 | XYZ
Table Factor
Date | Factor
-----------------
20201010 | .3
20201011 | .5
20201012 | .7
If selected dates by user is 20201010 to 20201012 then result will be
Required result
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | (((2x3)x.3 + (8x9)x.5 + (14x15)x.7))/(No of dates selected in this case =3)
DEF | (((4x5)x.3+ (10x11)x.5 + (16x17)x.7))/(No of dates selected in this case =3)
XYZ | (((6x7)x.3+ (12x13)x.5 + (18x19)x.7))/(No of dates selected in this case =3)
Dates will be user defined input so the calculated value should be average of that many dates. Selected dates will always be defined in range rather than random multiple selection.
In table calc, source_id and date together will be unique.
Each date has factor which is to be multiplied with all source_id for that date.
If selected dates by user is from 20201010 to 20201011 then result will be
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | ((2x3)x.3+(8x9)x.5)/2
DEF | ((4x5)x.3+(10x11)x.5)/2
XYZ | ((6x7)x.3+(12x13)x.5)/2
If selected dates by user is 20201012 then result will be
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | (14x15)x.7
DEF | (16x17)x.7
XYZ | (18x19)x.7
Create a CTE to store the starting and ending dates and cross join it to the join of the tables, group by source and aggregate:
WITH cte AS (SELECT '20201010' min_date, '20201012' max_date)
SELECT s.Description,
ROUND(SUM(c.metric1 * c.metric2 * f.factor / (DATEDIFF(day, t.min_date, t.max_date) + 1)), 2) calculated_value
FROM cte t CROSS JOIN Source s
LEFT JOIN Calc c ON c.Source_id = s.Source_id AND c.date BETWEEN t.min_date AND t.max_date
LEFT JOIN Factor f ON f.date = c.date
GROUP BY s.Source_id, s.Description
See the demo.
Results:
> Description | calculated_value
> :---------- | ---------------:
> ABC | 61.60
> DEF | 83.80
> XYZ | 110.00

CASE-Statement in WHERE-Clause | SQL

Hi I have following Table with the current month 'Month':
+---------------+
| current_Month |
+---------------+
| 12 |
+---------------+
And I have another Table with workers 'Workers'
+--------+--------------------------+
| Name | Month_joined_the_company |
+--------+--------------------------+
| Peter | 12 |
| Paul | 9 |
| Sarah | 5 |
| Donald | 12 |
+--------+--------------------------+
I now want, based on my Month table, Display all workers which joined the company untill the previous month the current month is 10 I would like to have this output
+--------+--------------------------+
| Name | Month_joined_the_company |
+--------+--------------------------+
| Paul | 9 |
| Sarah | 5 |
+--------+--------------------------+
But on the end of the year, I would like to include all workers even thos which month is equal with the current month
+--------+--------------------------+
| Name | Month_joined_the_company |
+--------+--------------------------+
| Peter | 12 |
| Paul | 9 |
| Sarah | 5 |
| Donald | 12 |
+--------+--------------------------+
I now have this Statement, but it does not work...
SELECT *
FROM workers
WHERE
CASE
WHEN (SELECT TOP (1) Current_Month FROM Month) = 12
THEN (Month_joined_the_company <= (SELECT TOP (1) Current_Month FROM Month))
ELSE (Month_joined_the_company < (SELECT TOP (1) Current_Month FROM Month))
END
But this does not work and I get an error. Can someone help me, how I can use CASE in a WHERE-Clause
Is this what you want?
select w.*
from workers w
inner join month m
on m.current_month = 12
or w.month_joined_the_company < m.current_month
This phrases as: if current_month = 12 then return all workers, else just return those whose month_joined_the_company is stricly smaller than current_month.
NB: you should probably consider use date datatypes to store these values, otherwise what happens when a new year begins?

Union in outer query

I'm attempting to combine multiple rows using a UNION but I need to pull in additional data as well. My thought was to use a UNION in the outer query but I can't seem to make it work. Or am I going about this all wrong?
The data I have is like this:
+------+------+-------+---------+---------+
| ID | Time | Total | Weekday | Weekend |
+------+------+-------+---------+---------+
| 1001 | AM | 5 | 5 | 0 |
| 1001 | AM | 2 | 0 | 2 |
| 1001 | AM | 4 | 1 | 3 |
| 1001 | AM | 5 | 3 | 2 |
| 1001 | PM | 5 | 3 | 2 |
| 1001 | PM | 5 | 5 | 0 |
| 1002 | PM | 4 | 2 | 2 |
| 1002 | PM | 3 | 3 | 0 |
| 1002 | PM | 1 | 0 | 1 |
+------+------+-------+---------+---------+
What I want to see is like this:
+------+---------+------+-------+
| ID | DayType | Time | Tasks |
+------+---------+------+-------+
| 1001 | Weekday | AM | 9 |
| 1001 | Weekend | AM | 7 |
| 1001 | Weekday | PM | 8 |
| 1001 | Weekend | PM | 2 |
| 1002 | Weekday | PM | 5 |
| 1002 | Weekend | PM | 3 |
+------+---------+------+-------+
The closest I've come so far is using UNION statement like the following:
SELECT * FROM
(
SELECT Weekday, 'Weekday' as 'DayType' FROM t1
UNION
SELECT Weekend, 'Weekend' as 'DayType' FROM t1
) AS X
Which results in something like the following:
+---------+---------+
| Weekday | DayType |
+---------+---------+
| 2 | Weekend |
| 0 | Weekday |
| 2 | Weekday |
| 0 | Weekend |
| 10 | Weekday |
+---------+---------+
I don't see any rhyme or reason as to what the numbers are under the 'Weekday' column, I suspect they're being grouped somehow. And of course there are several other columns missing, but since I can't put a large scope in the outer query with this as inner one, I can't figure out how to pull those in. Help is greatly appreciated.
It looks like you want to union all a pair of aggregation queries that use sum() and group by id, time, one for Weekday and one for Weekend:
select Id, DayType = 'Weekend', [time], Tasks=sum(Weekend)
from t
group by id, [time]
union all
select Id, DayType = 'Weekday', [time], Tasks=sum(Weekday)
from t
group by id, [time]
Try with this
select ID, 'Weekday' as DayType, Time, sum(Weekday)
from t1
group by ID, Time
union all
select ID, 'Weekend', Time, sum(Weekend)
from t1
group by ID, Time
order by order by 1, 3, 2
Not tested, but it should do the trick. It may require 2 proc sql steps for the calculation, one for summing and one for the case when statements. If you have extra lines, just use a max statement and group by ID, Time, type_day.
Proc sql; create table want as select ID, Time,
sum(weekday) as weekdayTask,
sum(weekend) as weekendTask,
case when calculated weekdaytask>0 then weekdaytask
when calculated weekendtask>0 then weekendtask else .
end as Task,
case when calculated weekdaytask>0 then "Weekday"
when calculated weekendtask>0 then "Weekend"
end as Day_Type
from have
group by ID, Time
;quit;
Proc sql; create table want2 as select ID, Time, Day_Type, Task
from want
;quit;

Select dynamic couples of lines in SQL (PostgreSQL)

My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.