How to add in missing dates as rows in table - sql

I have the following code which gets me how many rows were written on each day there was anything done.
SELECT
ingestion_time,
COUNT(ingestion_time) AS Rows_Written,
FROM
`workday.ingestions`
GROUP BY
ingestion_time
ORDER BY
ingestion_time
Which will give me something that looks like the following:
Ingestion_Time
Rows_Written
Jan 2, 2021
8
Jan 5, 2021
5
Jan 8, 2021
9
Jan 9, 2021
2
However, I want to be able to add in the missing dates so the tables looks like this instead:
Ingestion_Time
Rows_Written
Jan 2, 2021
8
Jan 3, 2021
0
Jan 4, 2021
0
Jan 5, 2021
5
Jan 6, 2021
0
Jan 7, 2021
0
Jan 8, 2021
9
Jan 9, 2021
2
How can I go about doing this? Do need to create a whole table with all dates and join it somehow, or is there another way? Thanks in advance.

Consider below approach
select date(Ingestion_Time) Ingestion_Time, Rows_Written
from your_current_query union all
select day, 0 from (
select *, lead(Ingestion_Time) over(order by Ingestion_Time) next_time
from your_current_query
), unnest(generate_date_array(date(Ingestion_Time) + 1, date(next_time) - 1)) day
if to apply to sample data in your question - output is

Related

big query SQL - repeatedly/recursively change a row's column in the select statement based on the values in previous row

I have table like below
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 31 2021
1
jan 3 2021
feb 1 2021
1
jan 27 2021
feb 26 2021
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
feb 9 2021
2
feb 10 2021
mar 12 2021
Now, I wanted to update the value in the 'end date' column of a row based on the values in the previous row 'end date' and the current row 'date'.
Say if the date in current row < end date of the previous row, I wanted to update the end date of the current row = (end date of the previous row).
I Wanted to do this repeated for all the rows (grouped by customer).
I want the output as below. Just need it in the select statement instead of a updating/inserting in a table.
Note - in below as the second row(end date) is updated with the value in the first row (jan 30 2021), now the third row value (jan 3 2021) is evaluated against the updated value in the second row (which is jan 30 2021) but not with the second row value before update (jan 31 2021).
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 30 2021 [updated because current date < previous end date]
1
jan 3 2021
jan 30 2021[updated because current date < previous end date]
1
jan 27 2021
jan 30 2021 [updated because current date < previous end date]
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
jan 31 2021[updated because current date < previous end date]
2
feb 10 2021
mar 12 2021
I think I should go this way. I use the datasource twice just to get the way its needed to perform the operation without updating or inserting into the table.
input table:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-31
1|2021-01-03|2021-02-01
1|2021-01-27|2021-02-26
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-02-09
2|2021-02-10|2021-03-12
code:
with num_raw_data as (
SELECT row_number() over(partition by customer)as num, customer,date_init,date_end
FROM `project-id.data-set.table`
), analyzed_data as(
select r.num,
r.customer,
r.date_init,
r.date_end,
case when date_init<(select date_end from num_raw_data where num=r.num-1 and customer=r.customer and EXTRACT(month FROM r.date_init)=EXTRACT(month FROM date_init)) then 1 else 0 end validation
from num_raw_data r
)
select customer,
date_init,
case when validation !=0 then (select MIN(date_end) from analyzed_data where validation=0 and customer=ad.customer and date_init<ad.date_end) else date_end end as date_end
from analyzed_data ad
order by customer,num
output:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-30
1|2021-01-03|2021-01-30
1|2021-01-27|2021-01-30
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-01-31
2|2021-02-10|2021-03-12
Using column validation from analyzed_data to get to know where I should be looking for changes. I'm not sure if its fast (probably not) but it works for the scenario you bring in your question.

Oracle SQL: Counting consecutive site visits based on a sub-string and previous (lag) row

Using Oracle SQL, I’m trying to calculate total unique visits to a website. The table I’m using to write the query does not have a timestamp which includes minutes and seconds just DDMMYY and every row in the table represents a customer click on the page. The table designates a new “session” every hour, regardless of whether that actually reflects a new visit from the customer’s POV. What I must do is use non-consecutive sessions as a proxy for unique visits. So, if there is an hour break between visits the previous consecutive grouping is one visit. I define a visit as a unique combination of customer ID + session day + session hour. If there are consecutive session hours within a customer + day combination, I count that as a single session. The HOUR filed contains string values that concatenate date with hour. In order to do the appropriate visit count calculation, I will need to parse out the hour and subtract from the previous (lag) row in order to determine if there is greater than an hour “break”.
Example of Raw Data:
TRANS_TO_DATE CUSTOMER_ID HOUR
10/21/17 1007589445 October 21, 2017, Hour 1
10/21/17 1007589445 October 21, 2017, Hour 2
10/21/17 1007589445 October 21, 2017, Hour 2
10/21/17 1007589445 October 21, 2017, Hour 2
10/21/17 1007589445 October 21, 2017, Hour 3
10/21/17 1007589445 October 21, 2017, Hour 5
10/21/17 1007589445 October 21, 2017, Hour 6
10/21/17 1007589445 October 21, 2017, Hour 23
10/21/17 1007589445 October 21, 2017, Hour 23
10/21/17 1007589445 October 21, 2017, Hour 23
11/1/17 1007589445 November 1, 2017, Hour 10
1/1/18 1007589445 January 1, 2018, Hour 10
1/1/18 1007589445 January 1, 2018, Hour 10
1/1/18 1007589445 January 1, 2018, Hour 11
1/1/18 1007589445 January 1, 2018, Hour 14
1/1/18 1007589445 January 1, 2018, Hour 20
1/1/18 1007589445 January 1, 2018, Hour 22
The visit count is actually this:
Customer_id Day Hour Visit Grouping
1007589445 October 21, 2017 1 Visit 1
1007589445 October 21, 2017 2 Visit 1
1007589445 October 21, 2017 3 Visit 1
1007589445 October 21, 2017 5 Visit 2
1007589445 October 21, 2017 6 Visit 2
1007589445 October 21, 2017 23 Visit 3
1007589445 November 1, 2017 10 Visit 1
1007589445 January 1, 2018 10 Visit 1
1007589445 January 1, 2018 11 Visit 1
1007589445 January 1, 2018 14 Visit 2
1007589445 January 1, 2018 20 Visit 3
1007589445 January 1, 2018 21 Visit 4
Customer 1007589445 had
3 visits on October 21, 2017
- 1 visit on November 1, 2017
- 4 visits on January 1, 2018
Total visits: 8
Below is the sql code I have so far which needs to be modifide to satisfy the critera above.
select
CUSTOMER_ID,
TRANS_TO_DATE,
HOUR,
count (HOUR) as visits
from mstr_clickstream_vw
where trans_to_date between start_date and end_date
and web_store_ind='US'
group by CUSTOMER_ID, TRANS_TO_DATE,HOUR
You can get the hour with:
cast(trim(substr(hour, -2)) as int)
Then to use this to assign sessions by using lag() and a cumulative conditional aggregation:
select cs.*,
sum(case when trans_to_date = prev_ttd and prev_hh = hh then 0
when trans_to_date = prev_ttd and prev_hh = hh - 1 then 0
when hh = 0 and prev_hh = 23 and trans_to_date = prev_ttd + interval '1' day then 0
else 1
end) over (partition by customer_id order by trans_to_date, hh) as grouping
from (select cs.*,
lag(trans_to_date) over (partition by customer_id order by trans_to_date, hh) as prev_ttd,
lag(hh) over (partition by customer_id order by trans_to_date, hh) as prev_hh
from (select cs.*,
cast(trim(substr(hour, -2)) as int) as hh
from mstr_clickstream_vw cs
) cs
) cs;

Finding repeat rate using sql

I have a table mentioned below:
If X customers had made the purchase in the month of Jan, how many of them made them in Feb too i.e Y. (Repeat Rate: Y/X*100)
customer_no month
---------------------
1 jan
2 jan
3 jan
4 jan
11 jan
1 feb
2 feb
3 feb
9 feb
10 feb
Output:
Repeat_Rate
60%
i would do it like:
SELECT CAST(COUNT(yourtable_feb.customer_no) as FLOAT)
/ CAST(COUNT(yourtable_jan.customer_no) AS FLOAT) AS Repeating_Rate
FROM yourtable yourtable_jan
LEFT JOIN yourtable yourtable_feb
ON yourtable_jan.customer_no = yourtable_feb.customer_no
AND yourtable_feb.mymonth = 'feb'
WHERE yourtable_jan.mymonth = 'jan'
here a rextester, if you'd like to retest my query:
http://rextester.com/ESO11614

Adding set lists of future dates to rows in a SQL query

So I am doing a cohort analysis for customers, where a cohort is a group of people who started using the product in the same month. I then keep track of each cohort's total use for every subsequent month up till present time.
For example, the first "cohort month" is January 2012, then I have "use months" January 12, Feb 12, March 12, ..., March 17(current month). One column is "cohort month", and another is "use month". This process repeats for every subsequent cohort month. The table looks like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
...
Feb 17 | Feb 17
Feb 17 | Mar 17
Mar 17 | Mar 17
The problem arises because I want to do forecasting for one year out for both existing and future cohorts.
That means for the Jan 12 cohort, I want to do prediction for April 17 to Mar 18.
I also want to do predictions for the April 17 cohort (which doesn't exist yet) from April 17 to Mar 18. And so on till predictions for the Mar 18 cohort in Mar 18.
I can handle the predictions, don't worry about that.
My issue is that I cannot figure out how to add in this list of (April 17 .. Mar 17) in the "use month" column before every cohort switches.
I also need to add in cohorts April 17 to Mar 18, and have the applicable parts of this list of (April 17 ... Mar 17) for each of these future cohorts.
So I want the table to look like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Jan 12 | Apr 17
..
Jan 12 | Mar 18
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
Feb 12 | Apr 17
...
Feb 12 | Mar 18
...
...
Feb 17 | Feb 17
Feb 17 | Mar 17
...
Feb 17 | Mar 18
Mar 17 | Mar 17
...
Mar 17 | Mar 18
I know the first solution to come to mind is to do a create a list of all dates Jan 12 to Mar 18, cross join it to itself, and then left outer join to the current table I have (where cohort / use months range from Jan 12 to Mar 17). However, this is not scalable.
Is there a way I can just iteratively add in this list of the months of the next year?
I am using HP Vertica, could use Presto or Hive if absolutely necessary
I think you should use the query here below to create a temporary table out of nothing, and join it with the rest of your query. You can't do anything in a procedural manner in SQL, I'm afraid. You won't be able to get away without a CROSS JOIN. But here, you limit the CROSS JOIN to the generation of the first-of-month pairs that you need.
Here goes:
WITH
-- create a list of integers from 0 to 100 using the TIMESERIES clause
i(i) AS (
SELECT dt::DATE - '2000-01-01'::DATE
FROM (
SELECT '2000-01-01'::DATE + 0
UNION ALL SELECT '2000-01-01'::DATE + 100
) d(d)
TIMESERIES dt AS '1 day' OVER(ORDER BY d::TIMESTAMP)
)
,
-- limits are Jan-2012 to the first of the current month plus one year
month_limits(month_limit) AS (
SELECT '2012-01-01'::DATE
UNION ALL SELECT ADD_MONTHS(TRUNC(CURRENT_DATE,'MONTH'),12)
)
-- create the list of possible months as a CROSS JOIN of the i table
-- containing the integers and the month_limits table, using ADD_MONTHS()
-- and the smallest and greatest month of the month limits
,month_list AS (
SELECT
ADD_MONTHS(MIN(month_limit),i) AS month_first
FROM month_limits CROSS JOIN i
GROUP BY i
HAVING ADD_MONTHS(MIN(month_limit),i) <= (
SELECT MAX(month_limit) FROM month_limits
)
)
-- finally, CROSS JOIN the obtained month list with itself with the
-- filters needed.
SELECT
cohort.month_first AS cohort_month
, use.month_first AS use_month
FROM month_list AS cohort
CROSS JOIN month_list AS use
WHERE use.month_first >= cohort.month_first
ORDER BY 1,2
;

Sum of Previous Yr

I have a simple query which does the below:
SELECT
B.WEEK_DT WEEK_DT,
SUM(A.PROFIT) PROFIT
FROM
CUSTOMERS A
INNER JOIN WEEK_TABLE B
ON A.WEEK_ID = B.WEEK_ID
Now, I want to extend this query to get Sum of profit for all of yr 2013. That means, the above data gives me value at weekly level and i also want a separate column which give me 2013_Profit, summing up all weeks of previous yr.
week_dt is in the format of mm-dd-yyyy
also, we have an offset in the week table, if that helps:
- WK_OFFSET WK_DT
-13 February 22, 2014
-12 March 1, 2014
-11 March 8, 2014
-10 March 15, 2014
-9 March 22, 2014
-8 March 29, 2014
-7 April 5, 2014
-6 April 12, 2014
-5 April 19, 2014
-4 April 26, 2014
-3 May 3, 2014
-2 May 10, 2014
-1 May 17, 2014
Please let me know how i can get another column for each customer which gives a sum previous yr profits.
Some thing like the below:
Customer Curr_WK_Profit Prev_YR_Profit
AAA 10 520
BBB 20 1040
CCC 30 1560