So I am doing a cohort analysis for customers, where a cohort is a group of people who started using the product in the same month. I then keep track of each cohort's total use for every subsequent month up till present time.
For example, the first "cohort month" is January 2012, then I have "use months" January 12, Feb 12, March 12, ..., March 17(current month). One column is "cohort month", and another is "use month". This process repeats for every subsequent cohort month. The table looks like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
...
Feb 17 | Feb 17
Feb 17 | Mar 17
Mar 17 | Mar 17
The problem arises because I want to do forecasting for one year out for both existing and future cohorts.
That means for the Jan 12 cohort, I want to do prediction for April 17 to Mar 18.
I also want to do predictions for the April 17 cohort (which doesn't exist yet) from April 17 to Mar 18. And so on till predictions for the Mar 18 cohort in Mar 18.
I can handle the predictions, don't worry about that.
My issue is that I cannot figure out how to add in this list of (April 17 .. Mar 17) in the "use month" column before every cohort switches.
I also need to add in cohorts April 17 to Mar 18, and have the applicable parts of this list of (April 17 ... Mar 17) for each of these future cohorts.
So I want the table to look like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Jan 12 | Apr 17
..
Jan 12 | Mar 18
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
Feb 12 | Apr 17
...
Feb 12 | Mar 18
...
...
Feb 17 | Feb 17
Feb 17 | Mar 17
...
Feb 17 | Mar 18
Mar 17 | Mar 17
...
Mar 17 | Mar 18
I know the first solution to come to mind is to do a create a list of all dates Jan 12 to Mar 18, cross join it to itself, and then left outer join to the current table I have (where cohort / use months range from Jan 12 to Mar 17). However, this is not scalable.
Is there a way I can just iteratively add in this list of the months of the next year?
I am using HP Vertica, could use Presto or Hive if absolutely necessary
I think you should use the query here below to create a temporary table out of nothing, and join it with the rest of your query. You can't do anything in a procedural manner in SQL, I'm afraid. You won't be able to get away without a CROSS JOIN. But here, you limit the CROSS JOIN to the generation of the first-of-month pairs that you need.
Here goes:
WITH
-- create a list of integers from 0 to 100 using the TIMESERIES clause
i(i) AS (
SELECT dt::DATE - '2000-01-01'::DATE
FROM (
SELECT '2000-01-01'::DATE + 0
UNION ALL SELECT '2000-01-01'::DATE + 100
) d(d)
TIMESERIES dt AS '1 day' OVER(ORDER BY d::TIMESTAMP)
)
,
-- limits are Jan-2012 to the first of the current month plus one year
month_limits(month_limit) AS (
SELECT '2012-01-01'::DATE
UNION ALL SELECT ADD_MONTHS(TRUNC(CURRENT_DATE,'MONTH'),12)
)
-- create the list of possible months as a CROSS JOIN of the i table
-- containing the integers and the month_limits table, using ADD_MONTHS()
-- and the smallest and greatest month of the month limits
,month_list AS (
SELECT
ADD_MONTHS(MIN(month_limit),i) AS month_first
FROM month_limits CROSS JOIN i
GROUP BY i
HAVING ADD_MONTHS(MIN(month_limit),i) <= (
SELECT MAX(month_limit) FROM month_limits
)
)
-- finally, CROSS JOIN the obtained month list with itself with the
-- filters needed.
SELECT
cohort.month_first AS cohort_month
, use.month_first AS use_month
FROM month_list AS cohort
CROSS JOIN month_list AS use
WHERE use.month_first >= cohort.month_first
ORDER BY 1,2
;
I'm trying to do the following query where I obtain a table grouping by years, month and sites, and then I pivot this sites to columns:
SELECT * FROM
(
SELECT
DECODE(GROUPING(TO_CHAR(TM.TIMESTAMP,'YYYY'))
,0, TO_CHAR(TM.TIMESTAMP,'YYYY')
,1, 'TOTAL') AS "YEAR",
DECODE(GROUPING(TO_CHAR(TM.TIMESTAMP,'MM'))
,0, TO_CHAR(TM.TIMESTAMP,'MM')
,1, 'TOTAL') AS "MONTH",
DECODE(GROUPING(TS.CODIGO5)
,0, TS.CODIGO5
,1, 'TOTAL') AS BU,
SUM(TM.KWHGEN) AS GEN
FROM T_MEDIDAS_CO TM
JOIN T_Sede TS ON TM.id_sede=TS.id_sede
WHERE TO_CHAR(TM.TIMESTAMP,'YYYY') IN (2015,2014)
AND TS.CODIGO5 IN ('FINSI', 'FINOC')
GROUP BY CUBE (TO_CHAR(TM.TIMESTAMP,'YYYY'), TO_CHAR(TM.TIMESTAMP,'MM'), TS.CODIGO5)
ORDER BY TO_CHAR(TM.TIMESTAMP,'YYYY') DESC, TO_CHAR(TM.TIMESTAMP,'MM') DESC, 3
)
PIVOT
(
SUM(GEN)
FOR BU IN ('FINCI' AS FINCI,'FINSI' AS FINSI, 'FINOC' AS FINOC, 'TOTAL' AS TOTAL)
)
ORDER BY "YEAR" DESC, "MONTH" DESC
to obtain this result
YEAR MONTH FINCI FINOC TOTAL
2015 12 110376,17 109991,55 220367,72
2015 11 92032,56 97938,09 189970,65
2015 10 77668,67 79273,98 156942,65
2015 09 87079,46 91203,73 178283,19
2015 08 99992,38 100220,24 200212,62
2015 07 142430 133979,74 276409,74
2015 06 107006,73 104320,96 211327,69
2015 05 86264 90985,62 177249,62
2015 04 85838,41 87147,74 172986,15
2015 03 106178,39 106342,4 212520,79
2015 02 125007,65 122790,76 247798,41
2015 01 134934,67 135897,7 270832,37
2015 TOTAL 1254809,09 1260092,51 2514901,6
2014 12 121185,25 122014,9 243200,15
2014 11 94682,9 94221,47 188904,37
2014 10 87212,59 92222,92 179435,51
2014 09 97306,19 100701,93 198008,12
2014 08 97738,26 101901,88 199640,14
2014 07 113242,07 117496,84 230738,91
2014 06 98234,69 98092,2 196326,89
2014 05 91202,74 102214,94 193417,68
2014 04 88517,65 103756,83 192274,48
2014 03 107541,53 119236,48 226778,01
2014 02 127880,75 131451,38 259332,13
2014 01 141381,35 143836,44 285217,79
2014 TOTAL 1266125,97 1327148,21 2593274,18
TOTAL 12 231561,42 232006,45 463567,87
TOTAL 11 186715,46 192159,56 378875,02
TOTAL 10 164881,26 171496,9 336378,16
TOTAL 09 184385,65 191905,66 376291,31
TOTAL 08 197730,64 202122,12 399852,76
TOTAL 07 255672,07 251476,58 507148,65
TOTAL 06 205241,42 202413,16 407654,58
TOTAL 05 177466,74 193200,56 370667,3
TOTAL 04 174356,06 190904,57 365260,63
TOTAL 03 213719,92 225578,88 439298,8
TOTAL 02 252888,4 254242,14 507130,54
TOTAL 01 276316,02 279734,14 556050,16
TOTAL TOTAL 2520935,06 2587240,72 5108175,78
But, I don't need the TOTAL | MONTH rows, how can I fix it?
Thanks a lot
i have a table which looks like this :
coumn 1 = timestamp : string , column 2 = numOfentites : int
please note i am using hiveql
Fri, 10 Aug 2001 274
Fri, 10 Dec 1999 39
Fri, 10 Mar 2000 107
Fri, 10 May 2002 26
Fri, 10 Nov 2000 351
Fri, 10 Sep 1999 22
Fri, 11 Aug 2000 189
Fri, 11 Dec 1998 1
Fri, 11 Feb 2000 84
Fri, 11 Jan 2002 580
Fri, 11 Jun 1999 12
Fri, 11 May 2001 571
Fri, 12 Apr 2002 41
Now, I retrieved the frequency per year from this table and found out some year XXXX had the most number of entities.
My aim now is to go one level deep and extract the frequency per month for the year XXXX.
I tired using the group by clause on the substring indicating month but it doesn’t work.
can you guys please give me a direction on how to proceed..
Just need a hint not the answer :P trying to learn hiveql here
EDIT
here is the query that i used to extract the frequency of entities on yearly basis.
note that timestamp is the first column of the input.
select dates , count(dates) as numEmails
from (select split(timestamp," ")[3] as dates , count(timestamp)
from dataset
group by timestamp
) mailfreq
group by dates
order by numEmails desc;
I know that hivesql has strange limitations, but won't this work?
select split(timestamp," ")[3] as yr, split(timestamp," ")[2] as mon, count(timestamp)
from dataset
group by split(timestamp," ")[3], split(timestamp," ")[2];
I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....
The title must be confusing, but the thing I am trying to do is very easy to understand with an example. I have a table like this:
Code Date_ Ratio
73245 Jan 1 1975 12:00AM 10
73245 Apr 18 2006 12:00AM 4
73245 Dec 26 2007 12:00AM 10
73245 Jan 30 2009 12:00AM 4
73245 Apr 21 2011 12:00AM 2
Basically for each security it gives some ratio for it with a date when the ratio starts to be effective. This table will be much easier to use if instead of just having a start date, it has a pair of start date and end date, like the following:
Code StartDate_ EndDate_ Ratio
73245 Jan 1 1975 12:00AM Apr 18 2006 12:00AM 10
73245 Apr 18 2006 12:00AM Dec 26 2007 12:00AM 4
73245 Dec 26 2007 12:00AM Jan 30 2009 12:00AM 10
73245 Jan 30 2009 12:00AM Apr 21 2011 12:00AM 4
73245 Apr 21 2011 12:00AM Dce 31 2049 12:00AM(or some random date in far future) 2
How do I transform the original table to the table I want using SQL statements? I have little experience with SQL and I could not figure how.
Please help! Thanks!
In SQL Server 2012:
SELECT code,
date_ AS startDate,
LEAD(date_) OVER (PARTITION BY code ORDER BY date_) AS endDate,
ratio
FROM mytable
In SQL Server 2005 and 2008:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY code ORDER BY date_) AS rn
FROM mytable
)
SELECT q1.code, q1.date_ AS startDate, q2.date_ AS endDate, q1.ratio
FROM q q1
LEFT JOIN
q q2
ON q2.code = q1.code
AND q2.rn = q1.rn + 1
Maybe it would also be possible to use OUTER APPLY, something like:
SELECT t1.Code, t1.Date_ AS StartDate_, ISNULL(t2.EndDate_, CAST('20491231' AS DATETIME)) AS EndDate_
FROM t1 AS t1o
OUTER APPLY
(
SELECT TOP 1 Date_ AS EndDate_
FROM t1
WHERE t1.Code = t1o.Code AND t1.Date_ > t1o.Date_
ORDER BY t1.Date_ ASC
) AS t2