Best way to store aggregated values - sql

We need to store aggregated values for different accounts which summarise various numbers on Month/Year basis. These numbers would be updated each time the data is updated (usually once or twice every 24 hours).
I'm expecting the data to be the results of PIVOT functions e.g.:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
Each account will need different aggregates e.g. "Count Of Customers", "Count Of Orders" and "Value Of Sales" and I'm not sure whether it would be best to add a key to the data or use separate tables e.g.:
Year Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 CntOrders 0 0 0 0 0 0 95 33 34 24 36 52
2011 CntCust 0 0 0 0 0 0 95 33 34 24 36 52
2011 ValOrders 0 0 0 0 0 0 95 33 34 24 36 52
Or
dbo.CountOfOrders
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
dbo.ValueOfOrders
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
I've read a number of posts suggesting both NoSQL and SQL Server so I'm not sure which way we should go or how to decide.
We can't justify a dedicated cube at the moment but I'm wondering if it would be better to store the values in a NoSQL database or whether we should stick with SQL Server?

I'll stick with SQL. However, if you are worried about the time to rebuild such PIVOT table, don't, because you don't have to necessarily build a table with unique "key".
Build it with key + process datetime and just append it to the main pivot. So during creation of the incrementals it will be bounded by your transaction timestamp (begin and end). There should be much bloat. If there is, you can collapse the process dates in a weekend job.

Set up a job to run stored procedures that insert data into tables.
Store the data like Account,Year,Month,Value
Use views of these tables for reporting multiple aggregates.
Definitely stick with SQL. There is no reason to add technical overhead for such a simple task.

Related

Identify if date is the last date for any given group?

I have a table that is structured like the below - this contains details about all customer subscriptions and when they start/end.
SubKey
CustomerID
Status
StartDate
EndDate
29333
102
7
01 jan 2013
1 Jan 2014
29334
102
6
7 Jun 2013
15 Jun 2022
29335
144
6
10 jun 2021
17 jun 2022
29336
144
2
8 oct 2023
10 oct 2025
I am trying to add an indicator flag to this table (either "yes" or "no") which shows me by each row, if when the [EndDate] of the SubKey is the last one for that CustomerID. So for the above example..
SubKey
CustomerID
Status
StartDate
EndDate
IsLast
29333
102
7
01 jan 2013
1 Jan 2014
No
29334
102
6
7 Jun 2013
15 Jun 2022
Yes
29335
144
6
10 jun 2021
17 jun 2022
Yes
29336
144
2
8 oct 2023
10 oct 2025
Yes
The flag is set to No for the first row, because on 1 Jan 2014, customerID 102 had another SubKey (29334) still active at the time (which didn't end until 15 jun 2022)
The rest of the rows are set to "Yes" because these were the last active subscriptions per CustomerID.
I have been reading about the LAG function which may be able to help. I am just not sure how to make it fit in this scenario.
Probably the easiest method would to use exists with a correlation. Can you try the following for your desired results for excluding rows without an overlap:
select *,
case when exists (
select * from t t2
where t2.customerId = t.customerId
and t2.enddate > t.enddate
and t2.startDate < t.Enddate
) then 'No' else 'Yes' end as IsLast
from t;

How to prepare month by month time series data in R from aggregated data containing part wise sales vs month?

I am having a dataframe which consists of month wise sales data for many parts:
For eg
Partno Month Qty
Part 1 June 2019 20
Part 1 July 2019 25
Part 1 Sep 2019 30
Part 2 Mar 2019 45
Part 3 Aug 2019 40
Part 3 Nov 2019 21
I want to convert this data into a month by month time series, which makes it easier for time series forecasting, Once I make it into a ts object
Month Part1 Part 2 Part 3
Jan 0 0 0
Feb 0 0 0
Mar 0 45 0
Apr 0 0 0
May 0 0 0
June 20 0 0
July 25 0 0
Aug 0 0 0
Sept 0 30 0
Oct 0 0 20
Nov 0 0 21
Dec 0 0 0
I am quite baffled as to how this can be carried out in R. Any solutions for the same would be highly useful, as I plan build some forecasting models in R.
Looking forward to hearing from you all!
Assume the data DF shown reproducibly in the Note at the end.
First convert DF to zoo splitting it by the first column and converting the Month column to yearmon class. Then convert that to ts class, extend it to Jan to Dec, and set any NAs to 0. (If you don't need the 0 months at the beginning and end omit the yrs and window lines.)
library(zoo)
z <- read.zoo(DF, split = 1, index = 2, FUN = as.yearmon, format = "%b %Y")
tt <- as.ts(z)
yrs <- as.integer(range(time(tt))) # start and end years
tt <- window(tt, start = yrs[1], end = yrs[2] + 11/12, extend = TRUE)
tt[is.na(tt)] <- 0
tt
giving:
Part 1 Part 2 Part 3
Jan 2019 0 0 0
Feb 2019 0 0 0
Mar 2019 0 45 0
Apr 2019 0 20 0
May 2019 0 0 0
Jun 2019 20 0 0
Jul 2019 25 0 0
Aug 2019 0 0 0
Sep 2019 30 0 0
Oct 2019 0 0 20
Nov 2019 0 0 21
Dec 2019 0 0 0
Note
Lines <- "Partno, Month, Qty
Part 1, Jun 2019, 20
Part 1, Jul 2019, 25
Part 1, Sep 2019, 30
Part 2, Mar 2019, 45
Part 2, Apr 2019, 20
Part 3, Oct 2019, 20
Part 3, Nov 2019, 21"
DF <- read.csv(text = Lines, strip.white = TRUE)

Calculation of values that rely on a date variable

I am trying to calculate the value of the last measurement taken (according to the date column) divided by the lowest value recorded (according to the measurement column) if two values in the “SUBJECT” column match and two values in the “PROCEDURE” column match. The the calculation would be produced in a new column. I am having trouble with this and I would appreciate a solution to this matter.
data Have;
input Subject Type :$12. Date &:anydtdte. Procedure :$12. Measurement;
format date yymmdd10.;
datalines;
500 Initial 15 AUG 2017 Invasive 20
500 Initial 15 AUG 2017 Surface 35
500 Followup 15 AUG 2018 Invasive 54
428 Followup 15 AUG 2018 Outer 29
765 Seventh 3 AUG 2018 Other 13
500 Followup 3 JUL 2018 Surface 98
428 Initial 3 JUL 2017 Outer 10
765 Initial 20 JUL 2019 Other 19
610 Third 20 AUG 2019 Invasive 66
610 Initial 17 Mar 2018 Invasive 17
;
*Intended output table
Subject Type Date Procedure Measurement Output
500 Initial 15 AUG 2017 Invasive 20 20/20
500 Initial 15 AUG 2017 Surface 35 35/35
500 Followup 15 AUG 2018 Invasive 54 54/20
428 Followup 15 AUG 2018 Outer 29 29/10
765 Seventh 3 AUG 2018 Other 13 13/19
500 Followup 3 JUL 2018 surface 98 98/35
428 Initial 3 JUL 2017 Outer 10 10/10
765 Initial 20 JUL 2019 Other 19 19/19
610 Third 20 AUG 2019 Invasive 66 66/17
610 Initial 17 Mar 2018 Invasive 17 17/17 ;
*Attempt;
PROC SQL;
create table want as
select a.*,
(select measurement as measurement_last_date
from have
where subject = a.subject and type = a.type
having date = max(date)) / min(a.measurement) as ratio
from have as a
group by subject, type
order by subject, type, date;
QUIT;
I think that you need use statement retain with data step.
the statement will retain your last row and you can 'll compare the last row with actual row processed.
link of some tutorial of how use statement retain.
enter link description here
SAS documentation
enter link description here

group by a substring in field

i have a table which looks like this :
coumn 1 = timestamp : string , column 2 = numOfentites : int
please note i am using hiveql
Fri, 10 Aug 2001 274
Fri, 10 Dec 1999 39
Fri, 10 Mar 2000 107
Fri, 10 May 2002 26
Fri, 10 Nov 2000 351
Fri, 10 Sep 1999 22
Fri, 11 Aug 2000 189
Fri, 11 Dec 1998 1
Fri, 11 Feb 2000 84
Fri, 11 Jan 2002 580
Fri, 11 Jun 1999 12
Fri, 11 May 2001 571
Fri, 12 Apr 2002 41
Now, I retrieved the frequency per year from this table and found out some year XXXX had the most number of entities.
My aim now is to go one level deep and extract the frequency per month for the year XXXX.
I tired using the group by clause on the substring indicating month but it doesn’t work.
can you guys please give me a direction on how to proceed..
Just need a hint not the answer :P trying to learn hiveql here
EDIT
here is the query that i used to extract the frequency of entities on yearly basis.
note that timestamp is the first column of the input.
select dates , count(dates) as numEmails
from (select split(timestamp," ")[3] as dates , count(timestamp)
from dataset
group by timestamp
) mailfreq
group by dates
order by numEmails desc;
I know that hivesql has strange limitations, but won't this work?
select split(timestamp," ")[3] as yr, split(timestamp," ")[2] as mon, count(timestamp)
from dataset
group by split(timestamp," ")[3], split(timestamp," ")[2];

Trying to pull the required rows from the single table with applying conditional statements on columns in sql server?

I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....