Hive percentile group by two variables - hive

I have a Hive table where I want to find the 10th percentile, median, and 90th percentile of a value at a location/weekday basis. A mockup of the table is below. How can I write a query so that the output table columns are location, weekday, 10th percentile, median, and 90th percentile of MyValue? (Assume that the actual table has a lot of different Locations and multiple entries per location/weekday combination)
I have tried:
create table myschema.my_output_table as
select location, weekday,
percentile(MyValue,0.1) over location,weekday as Weekday10pctile
from myschema.my_input_table
Sample Data:
Location Weekday MyValue
Location_A Monday 2.844958857
Location_A Monday 1.22455235
Location_A Monday 2.415189236
Location_A Monday 2.162431558
Location_A Tuesday 2.200264375
Location_A Tuesday 1.218341845
Location_A Tuesday 1.668882003
Location_A Tuesday 0.077343061
Location_A Wednesday 2.977162672
Location_A Wednesday 2.059018125
Location_A Wednesday 2.309147998
Location_A Wednesday 1.241566476

Use percentile_approx function as the values in the column are DOUBLE. Note that the values returned might not be in from the dataset.
select location, weekday,
percentile(MyValue,0.1) over w as Weekday10pctile,
percentile(MyValue,0.9) over w as Weekday90pctile
from myschema.my_input_table
window w as (partition by location,weekday)
percentile_approx description from the documentation
percentile_approx(DOUBLE col, p [, B])
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

Related

SQL week number for the whole table

How to create a new column which calculates week number but for the whole table ignoring year?
Desired output is as follows:
Appreciate any help :)
You can do this by calculating 1st day of week of oldest row, and then calculate day diff of 1st day of week of current row and coldest row, after that, divide it by 7 days plus 1 will give you the desired week number across the full table.
Assuming you are using MySQL and the first day of the week is Sunday:
WITH min_week_start AS (
SELECT
SUBDATE(MIN(record_date), dayofweek(MIN(record_date)) - 1) as week_start_date
FROM
record_table
),
record_week_start AS (
SELECT
record_date,
SUBDATE(record_date, dayofweek(record_date) - 1) as week_start_date
FROM
record_table
)
SELECT
record_week_start.record_date,
DATEDIFF(record_week_start.week_start_date, min_week_start.week_start_date) / 7 + 1 as week_num
FROM
record_week_start
CROSS JOIN
min_week_start

Calculate Week Numbers based on the initial given date to end date

I have below scenario that Business want to calculate Week Number based on Given Start Date to End Date.
For Ex: Start Date = 8/24/2020 End Date = 12/31/2020 ( These Start date & end date are not constant they may change from year to year )
Expected Output below:
[Date 1 Date 2 Week Number
8/24/2020 8/30/2020 week1
8/31/2020 9/6/2020 week2
9/7/2020 9/14/2020 week3
9/15/2020 9/21/2020 week4
9/22/2020 9/28/2020 week5
9/29/2020 10/5/2020 week6
10/6/2020 10/12/2020 week7
10/13/2020 10/19/2020 week8
10/20/2020 10/26/2020 week9
10/27/2020 11/02/2020 week10
11/03/2020 11/09/2020 week11
11/10/2020 11/16/2020 week12
11/17/2020 11/23/2020 week13
11/24/2020 11/30/2020 week14
I need Oracle Query to calculate Week Number(s) like above .. Based on Start date for 7 days then week number will be calcuated.. But remember that crossing months some month have 30 days and some month 31 days etc.. How to calculate ? Appreciate your help!!
Seems your looking for custom week definition rather that built-ins. But not overly difficult. The first thing is to convert from strings to dates (if columns actually coming off table this conversion is not required), and from there let Oracle do all the calculations as you can apply arithmetic operations to dates, except adding 2 dates. Oracle will automatically handle differing number of days per month correctly.
Two methods for this request:
Use a recursive CTE (with)
with dates(start_date,end_date) as
( select date '2020-08-24' start_date
, date '2020-12-31' end_date
from dual
)
, weeks (wk, wk_start, wk_end, e_date) as
( select 1, start_date, start_date+6 ld, end_date from dates
union all
select wk+1, wk_end+1, wk_end+7, e_date
from weeks
where wk_end<e_date
)
select wk, wk_start, wk_end from weeks;
Use Oracle connect by
with dates(start_date,end_date) as
( select date '2020-08-24' start_date
, date '2020-12-31' end_date
from dual
)
select level wk
, start_date+7*(level-1) wk_start
, start_date+6+7*(level-1)
from dates
connect by level <= ceil( (end_date-start_date)/7.0);
Depend on how strict you need to be with the end date specified you may need to adjust the last row returned. Both queries do not make adjust for that. They just ensure no week begins after that date. But the last week contains the full 7 days, which may end after the specified end date.
If your date datatype is varchar then first convert it to date and then convert it back to varchar.
convert date to to_char(to_date('8/24/2020','MM/DD/YYYY'),'WW')
If you to keep week datatype as a number then you can do something like this
to_number(to_char(to_date('8/24/2020','MM/DD/YYYY'),'WW'))
Few options according to your need.
WW Week of year (1-53) where week 1 starts on the first day of the year and continues to the seventh day of the year.
W Week of month (1-5) where week 1 starts on the first day of the month and ends on the seventh.
IW Week of year (1-52 or 1-53) based on the ISO standard.

Difference of timestamp event rows using WHERE clause

I have two event tables with timestamped data: Registered, Signed_In.
Both have rows such as: original_timestamp, user_id
I am trying to find out users who haven't signed in within 30 days after registering. I have used the following query but I cannot add a WHERE clause to it.
I tried a query but I am getting hourly difference, whereas I wanted days difference which is unsupported in BigQuery.
SELECT Signed_In.user_id, TIMESTAMP_DIFF(Registered.original_timestamp, Signed_In.original_timestamp, HOUR) AS days_difference
FROM `test_db.Signed_In` signed_in
JOIN `test_db.Registered` registered
ON Signed_In.user_id = Registered.user_id
GROUP BY 1,2
ORDER BY 2 DESC
WHERE days_difference > '30'
I am getting two columns: user_id, days_difference but the days_difference shows hours and my WHERE clause is rejected when I use it.
You can try this below code-
Note: Using Ordinal Position for GROUP BY and ORDER BY is not a good practice. Its always safe and standard to use the column names directly.
SELECT Signed_In.user_id,
TIMESTAMP_DIFF(Registered.original_timestamp, Signed_In.original_timestamp, HOUR) AS days_difference
FROM `test_db.Signed_In` signed_in
JOIN `test_db.Registered` registered
ON Signed_In.user_id = Registered.user_id
WHERE DATE_DIFF(Registered.original_timestamp, Signed_In.original_timestamp, Day) > '30'
GROUP BY 1,2
ORDER BY 2 DESC
Just replace HOUR to DAY in your query:
SELECT Signed_In.user_id, TIMESTAMP_DIFF(Registered.original_timestamp, Signed_In.original_timestamp, DAY) AS days_difference
Correct values are:
MICROSECOND
MILLISECOND
SECOND
MINUTE
HOUR
DAYOFWEEK
DAY
DAYOFYEAR
WEEK: Returns the week number of the date in the range [0, 53]. Weeks begin with Sunday, and dates prior to the first Sunday of the year are in week 0.
WEEK(<WEEKDAY>): Returns the week number of timestamp_expression in the range [0, 53]. Weeks begin on WEEKDAY. datetimes prior to the first WEEKDAY of the year are in week 0. Valid values for WEEKDAY are SUNDAY, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, and SATURDAY.
ISOWEEK: Returns the ISO 8601 week number of the datetime_expression. ISOWEEKs begin on Monday. Return values are in the range [1, 53]. The first ISOWEEK of each ISO year begins on the Monday before the first Thursday of the Gregorian calendar year.
MONTH
QUARTER
YEAR
ISOYEAR: Returns the ISO 8601 week-numbering year, which is the Gregorian calendar year containing the Thursday of the week to which date_expression belongs.
DATE
DATETIME
TIME

Pass column value as Date Part argument

I am trying to generate a string array of weekdays and use it find how many times each day appears in a month
I am using standard sql on BigQuery
My query would look like
with weeks as (select array['SUNDAY','MONDAY','TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY'] as wk)
select DATE_DIFF('2019-01-31','2019-01-01',WEEK(wk)) AS week_weekday_diff
from weeks, unnest(wk) as wk
The query however fails with the error A valid date part argument for WEEK is required, but found wk. wk is a column value having the Days of Week, WEEK is a Functions which expects a literal DAYOFWEEK. Is there a way i pass the column value as arguments
Below is for BigQuery Standard SQL
error "A valid date part argument for WEEK is required, but found wk"
WEEK(<WEEKDAY>): Valid values for WEEKDAY are literal SUNDAY, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, and SATURDAY.
... Is there a way i pass the column value as arguments?
If you wish - you can submit feature request at https://issuetracker.google.com/issues/new?component=187149&template=0
find how many times each day appears in a month
To get your expected result and overcome above "issue" you can approach task from opposite angle - just extract weekdays positions and then do needed stats as in example below
#standardSQL
WITH weekdays AS (SELECT ['SUNDAY','MONDAY','TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY'] AS wk)
SELECT wk[ORDINAL(pos)] weekday, COUNT(1) cnt
FROM weekdays,
UNNEST(GENERATE_DATE_ARRAY('2019-01-01','2019-01-31')) day,
UNNEST([EXTRACT(DAYOFWEEK FROM day)]) pos
GROUP BY pos, weekday
ORDER BY pos
with result
Row weekday cnt
1 SUNDAY 4
2 MONDAY 4
3 TUESDAY 5
4 WEDNESDAY 5
5 THURSDAY 5
6 FRIDAY 4
7 SATURDAY 4
Trying your query, what I have noticed to be returning an error is:
select DATE_DIFF('2019-01-31','2019-01-01',WEEK('WEDNESDAY')) AS week_weekday_diff;
as the function WEEK(< WEEKDAY >) is expecting something like:
select DATE_DIFF('2019-01-31','2019-01-01',WEEK(`WEDNESDAY`)) AS week_weekday_diff;
OR
select DATE_DIFF('2019-01-31','2019-01-01',WEEK(WEDNESDAY)) AS week_weekday_diff;
I think that the WEEK(< WEEKDAY >) only accepts the weekdays in the format exposed here, so no strings should be valid.

How to retrieve first row of the week for each week

I have a table that looks like that
foo_date | bar
---------------
2018-01-01 | bar
2018-01-09 | bar
2018-01-10 | bar
2018-01-20 | bar
And I would like to build a request that retrieves,for each week, the row which occurs first in the week.
Cheers
You can simply do:
select distinct on (datetrunc('week', foo_date)) t.*
from t
order by datetrunc('week', foo_date), foo_date;
demo:db<>fiddle
SELECT DISTINCT ON (year, week)
foodate, bar
FROM (
SELECT
foodate,
bar,
EXTRACT('isoyear' FROM foodate) as year,
EXTRACT('week' FROM foodate) as week
FROM dates
ORDER BY foodate
)s
EXTRACT('week'...) gives the week. So two date in the same week give the same output at this column.
DISTINCT ON (week) gives the first (ordered!) row for each week.
Postgres Date functions
Notice the definition of the week:
The number of the ISO 8601 week-numbering week of the year. By
definition, ISO weeks start on Mondays and the first week of a year
contains January 4 of that year. In other words, the first Thursday of
a year is in week 1 of that year.
Edit: If you have data from more then a year of course you should add the year as well. Other wise you get the first row of all first weeks of all years for example.