SQL in R - Transform a column into rows - sql

I have the following query:
sqldf("
SELECT YEAR, SUM(SUB1) AS 'SUB1', SUM(SUB2) AS 'SUB2',
SUM(SUB3) AS 'SUB3', SUM(SUB4) AS 'SUB4', SUM(SUB5) AS 'SUB5'
FROM table1
GROUP BY YEAR
"
)
I got this table:
Is it possible to change SUB1 TO SUB5 from columns to rows?

Two methods:
Pivot post-query:
sqldf::sqldf("
SELECT YEAR, SUM(SUB1) AS 'SUB1', SUM(SUB2) AS 'SUB2',
SUM(SUB3) AS 'SUB3', SUM(SUB4) AS 'SUB4', SUM(SUB5) AS 'SUB5'
FROM table1
GROUP BY YEAR
"
) |>
reshape2::melt("YEAR")
# YEAR variable value
# 1 2019 SUB1 19638
# 2 2020 SUB1 3223
# 3 2021 SUB1 8184
# 4 2022 SUB1 18017
# 5 2019 SUB2 16854
# 6 2020 SUB2 2731
# 7 2021 SUB2 7034
# 8 2022 SUB2 15487
# 9 2019 SUB3 1087
# 10 2020 SUB3 2278
# 11 2021 SUB3 5922
# 12 2022 SUB3 12989
# 13 2019 SUB4 8598
# 14 2020 SUB4 1385
# 15 2021 SUB4 3629
# 16 2022 SUB4 7798
# 17 2019 SUB5 177
# 18 2020 SUB5 45
# 19 2021 SUB5 72
# 20 2022 SUB5 181
This can also be done with data.table::melt (same syntax) or tidyr::pivot_longer just as well.
In SQL, a bit more work (and less-flexible):
sqldf::sqldf("
select year, 'SUB1' as variable, sum(SUB1) as value from table1 group by year
union all
select year, 'SUB2' as variable, sum(SUB2) as value from table1 group by year
union all
select year, 'SUB3' as variable, sum(SUB3) as value from table1 group by year
union all
select year, 'SUB4' as variable, sum(SUB4) as value from table1 group by year
union all
select year, 'SUB5' as variable, sum(SUB5) as value from table1 group by year
")
Data
table1 <- structure(list(YEAR = 2019:2022, SUB1 = c(19638L, 3223L, 8184L, 18017L), SUB2 = c(16854L, 2731L, 7034L, 15487L), SUB3 = c(1087L, 2278L, 5922L, 12989L), SUB4 = c(8598L, 1385L, 3629L, 7798L), SUB5 = c(177L, 45L, 72L, 181L)), class = "data.frame", row.names = c(NA, -4L))
(It just-so-happens that this sample data is the same as the output in the question's image ... the sum will be the same, I did not try to mimic an un-aggregated frame.)

Related

Calculate running sum of previous 3 months from monthly aggregated data

I have a dataset that I have aggregated at monthly level. The next part needs me to take, for every block of 3 months, the sum of the data at monthly level.
So essentially my input data (after aggregated to monthly level) looks like:
month
year
status
count_id
08
2021
stat_1
1
09
2021
stat_1
3
10
2021
stat_1
5
11
2021
stat_1
10
12
2021
stat_1
10
01
2022
stat_1
5
02
2022
stat_1
20
and then my output data to look like:
month
year
status
count_id
3m_sum
08
2021
stat_1
1
1
09
2021
stat_1
3
4
10
2021
stat_1
5
8
11
2021
stat_1
10
18
12
2021
stat_1
10
25
01
2022
stat_1
5
25
02
2022
stat_1
20
35
i.e 3m_sum for Feb = Feb + Jan + Dec. I tried to do this using a self join and wrote a query along the lines of
WITH CTE AS(
SELECT date_part('month',date_col) as month
,date_part('year',date_col) as year
,status
,count(distinct id) as count_id
FROM (date_col, status, transaction_id) as a
)
SELECT a.month, a.year, a.status, sum(b.count_id) as 3m_sum
from cte as a
left join cte as b on a.status = b.status
and b.month >= a.month - 2 and b.month <= a.month
group by 1,2,3
This query NEARLY works. Where it falls apart is in Jan and Feb. My data is from August 2021 to Apr 2022. The means, the value for Jan should be Nov + Dec + Jan. Similarly for Feb it should be Dec + Jan + Feb.
As I am doing a join on the MONTH, all the months of Aug - Nov are treated as being values > month of jan/feb and so the query isn't doing the correct sum.
How can I adjust this bit to give the correct sum?
I did think of using a LAG function, but (even though I'm 99% sure a month won't ever be missed), I can't guarantee we will never have a month with 0 values, and therefore my LAG function will be summing the wrong rows.
I also tried doing the same join, but at individual date level (and not aggregating in my nested query) but this gave vastly different numbers, as I want the sum of the aggregation and I think the sum from the individual row was duplicated a lot of stuff I do a COUNT DISTINCT on to remove.
You can use a SUM with a window frame of 2 PRECEDING. To ensure you don't miss rows, use a calendar table and left-join all the results to it.
SELECT *,
SUM(a.count_id) OVER (ORDER BY c.year, c.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Calendar c
LEFT JOIN a ON a.year = c.year AND a.month = c.month
WHERE c.year >= 2021 AND c.year <= 2022;
db<>fiddle
You could also use LAG but you would need it twice.
It should be #Charlieface's answer - only that I get one different result than you put in your expected result table:
WITH
-- your input - and I avoid keywords like "MONTH" or "YEAR"
-- and also identifiers starting with digits are forbidden -
indata(mm,yy,status,count_id,sum_3m) AS (
SELECT 08,2021,'stat_1',1,1
UNION ALL SELECT 09,2021,'stat_1',3,4
UNION ALL SELECT 10,2021,'stat_1',5,8
UNION ALL SELECT 11,2021,'stat_1',10,18
UNION ALL SELECT 12,2021,'stat_1',10,25
UNION ALL SELECT 01,2022,'stat_1',5,25
UNION ALL SELECT 02,2022,'stat_1',20,35
)
SELECT
*
, SUM(count_id) OVER(
ORDER BY yy,mm
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS sum_3m_calc
FROM indata;
-- out mm | yy | status | count_id | sum_3m | sum_3m_calc
-- out ----+------+--------+----------+--------+-------------
-- out 8 | 2021 | stat_1 | 1 | 1 | 1
-- out 9 | 2021 | stat_1 | 3 | 4 | 4
-- out 10 | 2021 | stat_1 | 5 | 8 | 9
-- out 11 | 2021 | stat_1 | 10 | 18 | 18
-- out 12 | 2021 | stat_1 | 10 | 25 | 25
-- out 1 | 2022 | stat_1 | 5 | 25 | 25
-- out 2 | 2022 | stat_1 | 20 | 35 | 35

SQL Method to report next period value for this period

This may already be answered, but I can't figure out the correct search terms for what I need. We store values by Year / Period for the Beginning of Month (BOM). The BOM for one month is the same value as End of Month (EOM) for the previous month. I need a way to report this as such.
So 2018-02 BOM = 2018-01 EOM.
I thought I might be able to use something simple, but it does not account for the month/year wrap at 12 months as those fields are numerical.
select yr as YEAR, (pd-1) as PERIOD, sum(BOM) as EOM
from Table1
where type = '3'
group by yr, pd
order by yr desc, pd desc
This works for the middle months, but not for January, which becomes 2018-0 instead of 2017-12.
Example Data
Yr Pd Type BOM
18 02 3 100
18 02 3 100
18 02 2 200
18 02 2 100
18 01 3 100
18 01 3 100
18 01 2 200
18 01 2 100
18 01 3 100
18 01 2 300
17 12 3 100
17 12 3 200
17 12 2 300
17 12 3 200
17 12 2 100
17 11 3 300
17 11 2 400
17 11 3 400
17 11 2 100
So the results I am looking for would be:
Yr Pd EOM
18 01 200
17 12 300
17 11 500
17 10 700
I'm working in System iNavigator currently, but hoping to move this into an externally connected Excel query at some point.
Your DB2 database should be able to use CASE WHEN
Which can be used to calculate the year and the month, depending on the month.
For example:
select
CASE WHEN pd = 1 THEN yr - 1 ELSE yr END as Yr,
CASE WHEN pd = 1 THEN 12 ELSE pd - 1 END as Pd,
SUM(BOM) as EOM
from Table1
where type = '3'
group by yr, pd
order by yr desc, pd desc

SQL: Sum Only Certain Rows Depending on Start and End Date

I have two tables: The 1st table contains a unique identifier (UI). Each unique identifier has a column containing a start date (yyyy-mm-dd), and a column containing an end date (yyyy-mm-dd). The 2nd table contains the temperature for each day, with separate columns for the month, day, year and temperature. I would like to join those tables and get the compiled temperature for each unique identifier; however I would the compiled temperature to only include the days from the second table that fall between start and end dates from the 1st table.
For example, if one record has a start_date of 12/10/15 and an end date of 12/31/15, I would like to have a column containing compiled temperatures for the 10th-31s. If the next record has a start date 12/3/15-12/17/15, I'd like the column next to it to show the compiled temperature for the 3rd-17th. I'll include the query I have so far, but it is not too helpful because I have not really gotten very far:
; with Temps as (
select MONTH, DAY, YEAR, Temp
from Temperatures
where MONTH = 12
and YEAR = 2016
)
Select UI, start_date, end_date, location, SUM(temp)
from Table1 t1
Inner join Temps
on temps.month = month(t1.start_date)
I appreciate any help you might be able to give. Let me know I need to elaborate on anything.
Table 1
UI Start_Date End_Date
2080 12/5/2015 12/31/2015
1266 12/1/2015 12/31/2015
1787 12/17/2015 12/28/2015
1621 12/3/2015 12/20/2015
1974 12/10/2015 12/12/2015
1731 12/25/2015 12/31/2015
Table 2
Month Day Year Temp
12 1 2016 34
12 2 2016 32
12 3 2016 35
12 4 2016 37
12 5 2016 32
12 6 2016 30
12 7 2016 31
12 8 2016 36
12 9 2016 48
12 10 2016 42
12 11 2016 33
12 12 2016 41
12 13 2016 31
12 14 2016 29
12 15 2016 46
12 16 2016 48
12 17 2016 38
12 18 2016 29
12 19 2016 45
12 20 2016 37
12 21 2016 48
12 22 2016 46
12 23 2016 44
12 24 2016 45
12 25 2016 35
12 26 2016 44
12 27 2016 29
12 28 2016 38
12 29 2016 29
12 30 2016 35
12 31 2016 40
Table 3 (Expected Result)
UI Start_Date End_Date Compiled Temp
2080 12/5/2015 12/31/2015 1101
1266 12/1/2015 12/31/2015 1167
1787 12/17/2015 12/28/2015 478
1621 12/3/2015 12/20/2015 668
1974 12/10/2015 12/12/2015 126
1731 12/25/2015 12/31/2015 250
You could do something like this:
; WITH temps AS (
SELECT CONVERT(DATE, CONVERT(CHAR(4), [YEAR]) + '-' + CONVERT(CHAR(2), [MONTH]) + '-' + CONVERT(VARCHAR(2), [DAY])) [TDate], [Temp]
FROM Temperatures
WHERE [MONTH] = 12
AND [YEAR] = 2015
)
SELECT [UI], [start_date], [end_date]
, (SELECT SUM([temp])
FROM temps
WHERE [TDate] BETWEEN T1.[start_date] AND T1.[end_date]) [Compiled Temp]
FROM Table1 T1
No need for a join.
You can do a simple join of the two tables as well. You don't need to use a CTE.
--TEST DATA
if object_id('Table1','U') is not null
drop table Table1
create table Table1 (UI int, Start_Date date, End_Date date)
insert Table1
values
(2080,'12/05/2015','12/31/2015'),
(1266,'12/01/2015','12/31/2015'),
(1787,'12/17/2015','12/28/2015'),
(1621,'12/03/2015','12/20/2015'),
(1974,'12/10/2015','12/12/2015'),
(1731,'12/25/2015','12/31/2015')
if object_id('Table2','U') is not null
drop table Table2
create table Table2 (Month int, Day int, Year int, Temp int)
insert Table2
values
(12,1, 2015,34),
(12,2, 2015,32),
(12,3, 2015,35),
(12,4, 2015,37),
(12,5, 2015,32),
(12,6, 2015,30),
(12,7, 2015,31),
(12,8, 2015,36),
(12,9, 2015,48),
(12,10,2015,42),
(12,11,2015,33),
(12,12,2015,41),
(12,13,2015,31),
(12,14,2015,29),
(12,15,2015,46),
(12,16,2015,48),
(12,17,2015,38),
(12,18,2015,29),
(12,19,2015,45),
(12,20,2015,37),
(12,21,2015,48),
(12,22,2015,46),
(12,23,2015,44),
(12,24,2015,45),
(12,25,2015,35),
(12,26,2015,44)
--AGGREGATE TEMPS
select t1.Start_Date, t1.End_Date, avg(t2.temp) AvgTemp, sum(t2.temp) CompiledTemps
from table1 t1
join table2 t2 ON t2.Year between datepart(year, t1.Start_Date) and datepart(year, t1.End_Date)
and t2.Month between datepart(month,t1.Start_Date) and datepart(month,t1.End_Date)
and t2.Day between datepart(day, t1.Start_Date) and datepart(day, t1.End_Date)
group by t1.Start_Date, t1.End_Date

return the last row that meets a condition in sql

I have two tables:
Meter
ID SerialNumber
=======================
1 ABC1
2 ABC2
3 ABC3
4 ABC4
5 ABC5
6 ABC6
RegisterLevelInformation
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
4 2 350 350 1 jan 2015 EST
5 2 850 500 1 feb 2015 ACT
6 2 1000 150 1 apr 2015 ACT
7 3 1500 1500 1 jan 2015 EST
8 3 2500 1000 1 mar 2015 EST
9 3 5000 2500 4 apr 2015 EST
10 4 250 250 1 jan 2015 EST
11 4 550 300 1 feb 2015 ACT
12 4 1000 450 1 apr 2015 EST
13 5 350 350 1 jan 2015 ACT
14 5 850 500 1 feb 2015 ACT
15 5 1000 150 1 apr 2015 ACT
16 6 1500 1500 1 jan 2015 EST
17 6 2500 1000 1 mar 2015 EST
18 6 5000 2500 4 apr 2015 EST
I am trying to group by meter serial and return the last actual read date for each of the meters but I am unsure as to how to accomplish this. Here is the sql I have thus far:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode, PreviousReadDate
order by a.SerialNumber
I can't seem to get the MAX function to take effect in returning only the latest actual reading row and it returns all dates and the same meter serial is displayed several times.
If I use the following sql:
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
then each serial is shown only once. Any help would be greatly appreciated.
Like #PaulGriffin said in his comment you need to remove PreviousReadDate column from your GROUP BY clause.
Why are you experiencing this behaviour?
Basically the partition you have chosen - (SerialNumber,ReadTypeCode,PreviousReadDate) for each distinct pair of those values prints you SerialNumber, ReadTypeCode, MAX(PreviousReadDate). Since you are applying a MAX() function to each row of the partition that includes this column you are simply using an aggregate function on one value - so the output of MAX() will be equal to the one without it.
What you wanted to achieve
Get MAX value of PreviousReadDate for every pair of (SerialNumber,ReadTypeCode). So this is what your GROUP BY clause should include.
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
Is the correct SQL query for what you want.
Difference example
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
Here if you apply the query with grouping by 3 columns you would get result:
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 jan 2015 -- which is MAX of 1 value (1 jan 2015)
ABC1 | ACT | 1 feb 2015
ABC1 | EST | 1 apr 2015
But instead when you only group by SerialNumber,ReadTypeCode it would yield result (considering the sample data that I posted):
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 apr 2015 -- which is MAX of 2 values (1 jan 2015, 1 apr 2015)
ABC1 | ACT | 1 feb 2015 -- which is MAX of 1 value (because ReadTypeCode is different from the row above
Explanation of your second query
In this query - you are right indeed - each serial is shown only once.
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
But this query would produce you odd results you don't expect if you add grouping by more columns (which you have done in your first query - try it yourself).
You need to remove PreviousReadDate from your Group By clause.
This is what your query should look like:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
To understand how the group by clause works when you mention multiple columns, follow this link: Using group by on multiple columns
You will understand what was wrong with your query and why it returns all dates and the same meter serial is displayed several times.
Good luck!
Kudos! :)

Oracle 11g - Unpivot

I have a table like this
Date Year Month Day Turn_1 Turn_2 Turn_3
28/08/2014 2014 08 28 Foo Bar Xab
And i would like to "rotate" it in something like this:
Date Year Month Day Turn Source
28/08/2014 2014 08 28 Foo Turn_1
28/08/2014 2014 08 28 Bar Turn_2
28/08/2014 2014 08 28 Xab Turn_3
I need the "Source" column because i need to join this results to another table that say:
Source Interval
Turn_1 08 - 18
Turn_2 11 - 20
Turn_3 18 - 24
For now i have use unpivot to rotate the table, but i dont know how to display the "Source" column (and if it is possible):
select dt_date, df_year, df_month, df_turn
from my_rotatation_table
unpivot( df_turn
for x in(turn_1,
turn_2,
turn_3
))
SOLVED:
select dt_date, df_year, df_month, df_turn, df_source
from my_rotatation_table
unpivot( df_turn
for df_source in(turn_1 as 'Turn_1',
turn_2 as 'Turn_2',
turn_3 as 'Turn_3'
))
Use this query:
with t (Dat, Year, Month, Day, Turn_1, Turn_2, Turn_3) as (
select sysdate, 2014, 08, 28, 'Foo', 'Bar', 'Xab' from dual
)
select dat, year, month, day, turn, source from t
unpivot (
source for turn in (Turn_1, Turn_2, Turn_3)
)
DAT YEAR MONTH DAY TURN SOURCE
----------------------------------------------
08/01/2014 2014 8 28 TURN_1 Foo
08/01/2014 2014 8 28 TURN_2 Bar
08/01/2014 2014 8 28 TURN_3 Xab