Using Impala get the count of consecutive trips - sql

Sample Data
touristid|day
ABC|1
ABC|1
ABC|2
ABC|4
ABC|5
ABC|6
ABC|8
ABC|10
The output should be
touristid|trip
ABC|4
Logic behind 4 is count of consecutive days distinct consecutive days sqq 1,1,2 is 1st then 4,5,6 is 2nd then 8 is 3rd and 10 is 4th
I want this output using impala query

Get previous day using lag() function, calculate new_trip_flag if the day-prev_day>1, then count(new_trip_flag).
Demo:
with table1 as (
select 'ABC' as touristid, 1 as day union all
select 'ABC' as touristid, 1 as day union all
select 'ABC' as touristid, 2 as day union all
select 'ABC' as touristid, 4 as day union all
select 'ABC' as touristid, 5 as day union all
select 'ABC' as touristid, 6 as day union all
select 'ABC' as touristid, 8 as day union all
select 'ABC' as touristid, 10 as day
)
select touristid, count(new_trip_flag) trip_cnt
from
( -- calculate new_trip_flag
select touristid,
case when (day-prev_day) > 1 or prev_day is NULL then true end new_trip_flag
from
( -- get prev_day
select touristid, day,
lag(day) over(partition by touristid order by day) prev_day
from table1
)s
)s
group by touristid;
Result:
touristid trip_cnt
ABC 4
The same will work in Hive also.

Related

How can i do a rolling 12 month sum when some year month values are missing?

I am calculating rolling sum as such:
select
city,
month_year,
person,
sum(total) over (partition by person,city order by month_year rows between 11 preceding and current row) rolling_one_year
from
(select
city,
month_year,
person,
sum(amount_dollar) as total
from db1 d
group by 1,2,3) ;
however sometimes the not every person has a month_year value: e.g. a rolling 12 year some is as below IF we had consecutive month values:
but what if a month was missing for person e.g. 202208, according to the logic above it would calculate the following 202201 - 202301 which as we know 13 months.
How can i adapt my code above to ensure that the range of months selected is within 1 year?
A possible solution is to LEFT JOIN your data to the calendar table.
Here is a guide on how to create the calendar table if you don't have one.
Create a date table in hive
You should use a logical window frame RANGE instead of ROWS. consider below query.
WITH monthly_total AS (
SELECT '201911' year_month, 4 total UNION ALL
SELECT '201912' year_month, 10 total UNION ALL
SELECT '202201' year_month, 1 total UNION ALL
SELECT '202202' year_month, 3 total UNION ALL
SELECT '202203' year_month, 9 total UNION ALL
SELECT '202204' year_month, 4 total UNION ALL
SELECT '202205' year_month, 2 total UNION ALL
SELECT '202206' year_month, 8 total UNION ALL
SELECT '202207' year_month, 6 total UNION ALL
SELECT '202209' year_month, 3 total UNION ALL
SELECT '202210' year_month, 10 total UNION ALL
SELECT '202211' year_month, 1 total UNION ALL
SELECT '202212' year_month, 3 total UNION ALL
SELECT '202301' year_month, 50 total
)
SELECT *, SUM(total) OVER w AS rolling_12m_sum
FROM monthly_total
WINDOW w AS (
ORDER BY CAST(SUBSTR(year_month, 1, 4) AS INTEGER) * 12 + CAST(SUBSTR(year_month, 5, 2) AS INTEGER)
RANGE BETWEEN 11 PRECEDING AND CURRENT ROW
) ORDER BY year_month;
I'ved ignored partition by person,city for simplicity.
Below would be helpful in case you're not familiar with RANGE
https://learnsql.com/blog/difference-between-rows-range-window-functions/
Query results

Incremental business day column that resets each month

I need to create a table that contains records with 1) all 365 days of the year and 2) a counter representing which business day of the month the day is. Non-business days should be represented with a 0. For example:
Date | Business Day
2019-10-01 1
2019-10-02 2
2019-10-03 3
2019-10-04 4
2019-10-05 0 // Saturday
2019-10-06 0 // Sunday
2019-10-07 5
....
2019-11-01 1
2019-11-02 0 // Saturday
2019-11-03 0 // Sunday
2019-11-04 2
So far, I've been able to create a table that contains all dates of the year.
CREATE TABLE ${TMPID}_days_of_the_year
(
`theDate` STRING
);
INSERT OVERWRITE TABLE ${TMPID}_days_of_the_year
select
dt_set.theDate
from
(
-- last 0~99 months
select date_sub('2019-12-31', a.s + 10*b.s + 100*c.s) as theDate
from
(
select 0 as s union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9
) a
cross join
(
select 0 as s union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9
) b
cross join
(
select 0 as s union all select 1 union all select 2 union all select 3
) c
) dt_set
where dt_set.theDate between '2019-01-01' and '2019-12-31'
order by dt_set.theDate DESC;
And I also have a table that contains all of the weekend days and holidays (this data is loaded from a file, and the date format is YYYY-MM-DD)
CREATE TABLE ${TMPID}_company_holiday
(
`holidayDate` STRING
)
;
LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE ${TMPID}_company_holiday;
My question is.... how do I join these tables together while creating the business day counter column shown as in the sample data above?
You can use row_number() for the enumeration. This is a little tricky, because it needs to be conditional, but the information you need is provided by a left join:
select dy.*,
(case when ch.holiday_date is null
then row_number() over (partition by trunc(dy.date, 'MONTH'), ch.holiday_date
order by dy.date
)
else 0
end) as business_day
from days_of_the_year dy left join
company_holiday ch
on dy.date = ch.holiday_date;

SQL: How to create a weekly user count summary by month

I’m trying to create a week over week active user count summary report/table aggregated by month. I have one table for June 2017 and one table for May 2016 which I need to join together in order to. The date timestamp is created_utc which is a UNIX timestamp which I can figure out to transform into a human-readable format and from there extract the week of the year value so 1 through 52. The questions I have are:
Number the weeks just by values of 1 through 4. So, week 1 for June, Week 1 for May, Week 2 for June week 2 for May and so on.
Joining the tables based by those weeks 1 through 4 values
Pivoting the table and adding a WOW Change variable.
I'd like the final table to look like this:
W
| Week | June_count | May_count |WOW_Change |
|:-----------|:-----------:|:------------:|:----------:
| Week_1 | 5 | 8 | 0.6 |
| Week_2 | 2 | 1 | -0.5 |
| Week_3 | 10 | 5 | -0.5 |
| Week_4 | 30 | 6 | 1 |
Below is some sample data as well as the code I've started.
CREATE TABLE June
(created_utc int, id varchar(6))
;
INSERT INTO June
(created_utc, userid)
VALUES
(1496354167, '6eq4xf'),
(1496362973, '6eqzz3'),
(1496431934, '6ewlm8'),
(1496870877, '6fwied'),
(1496778080, '6fo79k'),
(1496933893, '6g1gcg'),
(1497154559, '6gjkid'),
(1497618561, '6hmeud'),
(1497377349, '6h1osm'),
(1497221017, '6god73'),
(1497731470, '6hvmic'),
(1497273130, '6gs4ay'),
(1498080798, '6ioz8q'),
(1497769316, '6hyer4'),
(1497415729, '6h5cgu'),
(1497978764, '6iffwq')
;
CREATE TABLE May
(created_utc int, id varchar(6))
;
INSERT INTO May
(created_utc, userid)
VALUES
(1493729491, '68sx7k'),
(1493646801, '68m2s2'),
(1493747285, '68uohf'),
(1493664087, '68ntss'),
(1493690759, '68qe5k'),
(1493829196, '691fy9'),
(1493646344, '68m1dv'),
(1494166859, '69rhkl'),
(1493883023, '6963qb'),
(1494362328, '6a83wv'),
(1494525998, '6alv6c'),
(1493945230, '69bkhb'),
(1494050355, '69jqtz'),
(1494418011, '6accd0'),
(1494425781, '6ad0xm'),
(1494024697, '69hx2z'),
(1494586576, '6aql9y')
;
#standardSQL
SELECT created_utc,
DATE(TIMESTAMP_SECONDS(created_utc)) as event_date,
CAST(EXTRACT(WEEK FROM TIMESTAMP_SECONDS(created_utc)) AS STRING) AS week_number,
COUNT(distinct userid) as user_count
FROM June
SELECT created_utc,
DATE(TIMESTAMP_SECONDS(created_utc)) as event_date,
CAST(EXTRACT(WEEK FROM TIMESTAMP_SECONDS(created_utc)) AS STRING) AS week_number,
COUNT(distinct userid) as user_count
FROM May
Below is for BigQuery Standard SQL
#standardSQL
SELECT
CONCAT('Week_', CAST(week AS STRING)) Week,
June.user_count AS June_count,
May.user_count AS May_count,
ROUND((May.user_count - June.user_count) / June.user_count, 2) AS WOW_Change
FROM (
SELECT COUNT(DISTINCT userid) user_count,
DIV(EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1, 7) + 1 week
FROM `project.dataset.June`
GROUP BY week
) June
JOIN (
SELECT COUNT(DISTINCT userid) user_count,
DIV(EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1, 7) + 1 week
FROM `project.dataset.May`
GROUP BY week
) May
USING(week)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.June` AS (
SELECT 1496354167 created_utc, '6eq4xf' userid UNION ALL
SELECT 1496362973, '6eqzz3' UNION ALL
SELECT 1496431934, '6ewlm8' UNION ALL
SELECT 1496870877, '6fwied' UNION ALL
SELECT 1496778080, '6fo79k' UNION ALL
SELECT 1496933893, '6g1gcg' UNION ALL
SELECT 1497154559, '6gjkid' UNION ALL
SELECT 1497618561, '6hmeud' UNION ALL
SELECT 1497377349, '6h1osm' UNION ALL
SELECT 1497221017, '6god73' UNION ALL
SELECT 1497731470, '6hvmic' UNION ALL
SELECT 1497273130, '6gs4ay' UNION ALL
SELECT 1498080798, '6ioz8q' UNION ALL
SELECT 1497769316, '6hyer4' UNION ALL
SELECT 1497415729, '6h5cgu' UNION ALL
SELECT 1497978764, '6iffwq'
), `project.dataset.May` AS (
SELECT 1493729491 created_utc, '68sx7k' userid UNION ALL
SELECT 1493646801, '68m2s2' UNION ALL
SELECT 1493747285, '68uohf' UNION ALL
SELECT 1493664087, '68ntss' UNION ALL
SELECT 1493690759, '68qe5k' UNION ALL
SELECT 1493829196, '691fy9' UNION ALL
SELECT 1493646344, '68m1dv' UNION ALL
SELECT 1494166859, '69rhkl' UNION ALL
SELECT 1493883023, '6963qb' UNION ALL
SELECT 1494362328, '6a83wv' UNION ALL
SELECT 1494525998, '6alv6c' UNION ALL
SELECT 1493945230, '69bkhb' UNION ALL
SELECT 1494050355, '69jqtz' UNION ALL
SELECT 1494418011, '6accd0' UNION ALL
SELECT 1494425781, '6ad0xm' UNION ALL
SELECT 1494024697, '69hx2z' UNION ALL
SELECT 1494586576, '6aql9y'
)
SELECT
CONCAT('Week_', CAST(week AS STRING)) Week,
June.user_count AS June_count,
May.user_count AS May_count,
ROUND((May.user_count - June.user_count) / June.user_count, 2) AS WOW_Change
FROM (
SELECT COUNT(DISTINCT userid) user_count,
DIV(EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1, 7) + 1 week
FROM `project.dataset.June`
GROUP BY week
) June
JOIN (
SELECT COUNT(DISTINCT userid) user_count,
DIV(EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1, 7) + 1 week
FROM `project.dataset.May`
GROUP BY week
) May
USING(week)
-- ORDER BY week
with result (as sample data is limited to just first two weeks result is also showing two weeks only which should not be an issue when you apply it to real data)
Row Week June_count May_count WOW_Change
1 Week_1 5 12 1.4
2 Week_2 6 5 -0.17
Use arithmetic on the day of the month to get the week:
SELECT j.weeknumber, j.user_count as june_user_count,
m.user_count as may_user_count
FROM (SELECT (EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1) / 7 as week_number,
COUNT(distinct userid) as user_count
FROM June
GROUP BY week_number
) j JOIN
(SELECT (EXTRACT(DAY FROM DATE(TIMESTAMP_SECONDS(created_utc))) - 1) / 7 as week_number,
COUNT(distinct userid) as user_count
FROM May
GROUP BY week_number
) m
ON m.week_number = j.week_number;
Note that splitting data into different tables just based on the date is bad idea. The data should all go into one table, perhaps partitioned if data volume is an issue.

SqlServer:Select and group by Month

I want to write a SQL to count the sales of my last six months, just like the code below.
SELECT
MONTH (pc.createTime) AS MONTH,
SUM (partsModelSum) AS totalSum
FROM
partscontractlinkmodel AS pl
RIGHT JOIN partscontract pc ON pl.partsContractID = pc.partsContractID
AND pc.companyID = 8
AND pc.createTime BETWEEN '2013/11/01 00:00:00'
AND '2014/04/30 23:59:59'
WHERE
pl.partsModelID = 21028
GROUP BY
MONTH (pc.createTime)
ORDER BY
totalSum DESC
AND results is:
month totalSum
4 24
But the problem the problem arises,No sales record month does not appear in the query results, I want there is no sales records in results and a value of 0
like this:
month totalSum
4 24
3 0
2 0
1 0
12 0
11 0
So,How to modify sql solve my problem ;)
thanks
If you have some data every month, you can use conditional aggregation:
SELECT MONTH (pc.createTime) AS MONTH,
SUM(CASE WHEN pl.partsModelID = 21028 THEN partsModelSum END) AS totalSum
FROM partscontract pc LEFT JOIN
partscontractlinkmodel pl
ON pl.partsContractID = pc.partsContractID AND
pc.companyID = 8 AND
pc.createTime BETWEEN '2013/11/01 00:00:00' AND '2014/04/30 23:59:59'
GROUP BY
MONTH(pc.createTime)
ORDER BY totalSum DESC;
If this doesn't work, you need to generate the list of months using a subquery or CTE.
Get a list of month from a table or sub query. Left join the months table/query and partscontract via month(createTime) and month from table/sub-query. Left join partscontract and partscontractlinkmodel like what you did. See below for sample:
;WITH CTE_Month
as
(
SELECT 1 as MonthN
UNION
SELECT 2 as MonthN
UNION
SELECT 3 as MonthN
UNION
SELECT 4 as MonthN
UNION
SELECT 5 as MonthN
UNION
SELECT 6 as MonthN
UNION
SELECT 7 as MonthN
UNION
SELECT 8 as MonthN
UNION
SELECT 9 as MonthN
UNION
SELECT 10 as MonthN
UNION
SELECT 11 as MonthN
UNION
SELECT 12 as MonthN
),
SELECT
N.MonthN AS MONTH,
SUM (ISNULL(partsModelSum,0)) AS totalSum
FROM
CTE_Month M
LEFT JOIN partscontract pc ON MONTH (pc.createTime) = N.MonthN
LEFT JOIN partscontractlinkmodel AS pl
ON pl.partsContractID = pc.partsContractID
AND pc.companyID = 8
AND pc.createTime BETWEEN '2013/11/01 00:00:00'
AND '2014/04/30 23:59:59'
WHERE
pl.partsModelID = 21028
GROUP BY
N.MonthN
ORDER BY
totalSum DESC
You can create a temp table of list of months and use it in the join...may be something like this...
SELECT
MONTH (pc.createTime) AS MONTH,
SUM (partsModelSum) AS totalSum
FROM
(select 1 monthNum union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10 union select 11 union select 12 ) MonthList
left join partscontract pc ON MonthList.monthNum = MONTH(pc.createTime)
left join partscontractlinkmodel AS pl ON pc.partsContractID = pl.partsContractID
AND pc.companyID = 8
AND pc.createTime BETWEEN '2013/11/01 00:00:00'
AND '2014/04/30 23:59:59'
WHERE
pl.partsModelID = 21028
GROUP BY
MONTH (pc.createTime)
ORDER BY
totalSum DESC

How to make a time dependent distribution in SQL?

I have an SQL Table in which I keep project information coming from primavera.
Suppose that i have columns for Start Date,End Date,Duration, and Total Qty as shown below .
How can i distribute Total Qty over Months using these information. What kind of additional columns, sql queries i need in order to get correct monthly distribution?
Thanks in Advance.
Columns in order:
itemname,quantity,startdate,duration,enddate
item1 -- 108 -- 2013-03-25 -- 720 -- 2013-07-26
item2 -- 640 -- 2013-03-25 -- 720 -- 2013-07-26
.
.
I think the key is to break the records apart by month. Here is an example of how to do it:
with months as (
select 1 as mon union all select 2 union all select 3 union all
select 4 as mon union all select 5 union all select 6 union all
select 7 as mon union all select 8 union all select 9 union all
select 10 as mon union all select 11 union all select 12
)
select item, m.mon, quantity / nummonths
from (select t.*, (month(enddate) - month(startdate) + 1) as nummonths
from t
) t join
months m
on month(t.startDate) <= m.mon and
months(t.endDate) >= m.mon;
This works because all the months are within the same year -- as in your example. You are quite vague on how the split should be calculated. So, I assumed that every month from the start to the end gets an equal amount.