Hive Average over Fixed Date Interval for Each Entry

Hive Average over Fixed Date Interval for Each Entry - hive

SELECT
date,
user_id,
AVG(value) OVER (
PARTITION BY user_id
ORDER BY unix_timestamp(ftime,'yyyyMMddHH')
RANGE BETWEEN 604800 PRECEDING AND CURRENT ROW) AS value
FROM TABLE
I'm trying to get a 7 day average for each user in Hive, wondering why this isn't working:
Error in semantic analysis: line 1:205 Invalid Function 'yyyyMMddHH'

Related

Create a Series of Dates between two Dates in a table - SQL

I have a table like this:
I want to list the rows per day between their Start Date and and End Date and Total Payment divided by number of days (I assume I would need a window function partition by name here). But my main concern is how to create those series of dates for each name based on their Start Date and End Date.
Using the table above I would like the output to look like this:

Consider a range join with count window function to spread out total by days:
SELECT t."Name",
t."Total Payment" / COUNT(dates) OVER(PARTITION BY t."Name") AS Payment,
t."Start Date",
t."End Date",
dates AS "Date of"
FROM generate_series(
timestamp without time zone '2022-01-01',
timestamp without time zone '2022-12-31',
'1 day'
) AS dates
INNER JOIN my_table t
ON dates BETWEEN t."Start Date" AND t."End Date"

You can get what your after is a single query by generate_series for getting each day, and by just subtracting the 2 dates. (Since you seem to want both dates included in the day count an additional 1 needs added).
select name, (total_payment/( (end_date-start_date) +1))::numeric(6,2), start_date, end_date, d::date date_of
from test t
cross join generate_series(t.start_date
,t.end_date
,interval ' 1 day'
) gs(d)
order by name desc, date_of;
See demo. I leave for you what to do when the total_payment is not a multiple of the number of days. The demo just ignores it.

BigQuery doesn't recognize filter

BigQuery doesn't recognize filter over column timestamp and outputs this:
Cannot query over table 'xxxxxx' without a filter over column(s) 'timestamp' that can be used for partition elimination
Query code that produced this message is:
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (DATETIME(timestamp) BETWEEN DATETIME_ADD(DATETIME('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND DATETIME_SUB(DATE_TRUNC(CURRENT_DATETIME(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2

Everything works if we switch every conversion to DATETIME and all DATETIME operations with TIMESTAMP format and TIMESTAMP type operations.
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (timestamp BETWEEN TIMESTAMP_ADD(TIMESTAMP('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2

The table when being created was created with require partition filter set to true.
Any query on this table should have a filter on the timestamp.
Refer :- Cannot query over table without a filter that can be used for partition elimination

getting day wise query result for a certain time period in postgresql

i have a table in postgresql database called orders. where all the order related informations are stored. now, if an order gets rejected that certain order row gets moved from the orders table and gets stored in the rejected_orders table. As a result, the count function does not provide the correct number of orders.
Now, if I want to get the number of order request(s) in a certain day. I have to subtract the id numbers between the last order of the day and first order of the day. Below, i have the query for number total request for March 1st, 2022. Sadly, the previous employe forgot to save the timezone correctly in the database. Data is saved in the DB at UTC+00 timezone, Fetched data needs to be in GMT+06 timezone.
select
(select id from orders
where created_at<'2022-03-02 00:00:00+06'
order by created_at desc limit 1
)
-
(select id from orders
where created_at>='2022-03-01 00:00:00+06'
order by created_at limit 1
) as march_1st;
march_1st
-----------
185
Now,
If I want to get total request per day for certain time period(let's for month March, 2021). how can I do that in one sql query without having to write one query per day ?
To wrap-up,
total_request_per_day = id of last order of the day - id of first
order of the day.
How do I write a query based on that logic that would give me total_request_per_day for every day in a certain month.
like this,
|Date | total requests|
|01-03-2022 | 187 |
|02-03-2022 | 202 |
|03-03-2022 | 227 |
................
................

With respect, using id numbers to determine numbers of rows in a time period is incorrect. DELETEing rows leaves gaps in id number sequences; they are not designed for this purpose.
This is a job for date_trunc(), COUNT(*), and GROUP BY.
The date_trunc('day', created_at) function turns an arbitrary timestamp into midnight on its day. For example, it turns ``2022-03-02 16:41:00into2022-03-02 00:00:00`. Using that we can write the query this way.
SELECT COUNT(*) order_count,
date_trunc('day', created_at) day
FROM orders
WHERE created_at >= date_trunc('day', NOW()) - INTERVAL '7 day'
AND created_at < date_trunc('day', NOW())
GROUP BY date_trunc('day', created_at)
This query gives the number of orders on each day in the last 7 days.
Every minute you spend learning how to use SQL data arithmetic like this will pay off in hours saved in your work.

Try this :
SELECT d.ref_date :: date AS "date"
, count(*) AS "total requests"
FROM generate_series('20220301' :: timestamp, '20220331' :: timestamp, '1 day') AS d(ref_date)
LEFT JOIN orders
ON date_trunc('day', d.ref_date) = date_trunc('day', created_at)
GROUP BY d.ref_date
generate_series() generates the list of reference days where you
want to count the number of orders
Then you join with the orders table by comparing the reference date with the created_at date on year/month/day only. LEFT JOIN allows you to select reference days with no existing order.
Finally you count the number of orders per day by grouping by reference day.

How to do a rolling sum in BigQuery for groups of weeks?

I have the following table, which contains two columns: date and total_visits (website visits). I need to compute two new variables (rolling sums).
Column_A: A rolling sum for each day. For each day, I have to present the sum of the total visits from the last 14 days (without considering the current day). Of course, the first 14 days in this table can not have this value due to the fact there are not enough previous days to compute this value.
Column_B: A rolling sum for each day. For each day, I have to present the sum of the total visits considering the days between 4 weeks and 2 weeks before the current day. This means, for example, for 2019-01-29, the value we should be seeing is the sum of the total visits between 2021-01-01 and 2021-01-14. Of course, the first 28 days in the table won't have values for this column due to the fact there are no enough data to compute the value.
The next table is an example:
I currently have a solution in SQL (Workbench), but I need to apply this for a database store in GCP and there are syntax differences that I have not been able to understand. Any hint? Thanks in advance

Consider below approach
select *,
if(dense_rank() over win <= 14, null, sum(total_visits) over rolling_last_14_day) as total_last_14_day,
if(dense_rank() over win <= 28, null, sum(total_visits) over rolling_between_4_and_2_weeks_ago) as total_between_4_and_2_weeks_ago
from `project.dataset.table`
window win as (order by unix_date(date)),
rolling_last_14_day as (win range between 14 preceding and 1 preceding),
rolling_between_4_and_2_weeks_ago as (win range between 28 preceding and 15 preceding)
If applied to sample data in your question - output is

If you have data on every day, just use a window frame:
select t.*,
(case when row_number() over (order by date) >= 14
then sum(total_visits) over (order by date rows between 13 preceding and current row)
end) as total_14_day,
(case when row_number() over (order by date) >= 28
then sum(total_visits) over (order by date rows between 27 preceding and 14 preceding)
end) as total_14_day
from t;

SQL: Getting the min date of a series of dates partitioning by if previous date is more than 1 day ago

I have a data import which happens every week and when it starts, lasts a couple of days. As a result, in the date column, I have multiple dates for each data import. I would like to get the min date of each import. Is this possible in SQL? Specifically, in Google BigQuery. Example:
date desired_output
4/25/17 4/25/17
4/26/17 4/25/17
4/27/17 4/25/17
5/2/17 5/2/17
5/3/17 5/2/17
5/10/17 5/10/17
5/16/17 5/16/17
5/17/17 5/16/17
5/23/17 5/23/17
5/24/17 5/23/17
5/30/17 5/30/17
5/31/17 5/30/17
6/5/17 6/5/17
6/6/17 6/6/17

You can identify groups of dates that are in order sequentially -- this is a gaps and islands problem. Perhaps this will do what you want:
select date,
min(date) over (partition by date_add(date, interval - seqnum_d day)) as desired_output
from (select t.*,
dense_rank() over (order by date) as seqnum_d
from t
) t
The date arithmetic identifies sequences of dates by subtracting a sequence -- voila! The result is a constant.
Note: This assumes that sequences of dates have gaps.
Also, I used dense_rank() so it can handle multiple entries on a single date.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive Average over Fixed Date Interval for Each Entry - hive

Related

Create a Series of Dates between two Dates in a table - SQL

BigQuery doesn't recognize filter

getting day wise query result for a certain time period in postgresql

How to do a rolling sum in BigQuery for groups of weeks?

SQL: Getting the min date of a series of dates partitioning by if previous date is more than 1 day ago

Categories

Resources