hive sql query get last week, month, in one query - hive

I have the following hive sql query
I would like to use the same query to get the
last week of data
last month
last three months
Ideally I would like it done in one query. Is that possible
SELECT
'global_authenticated' AS experiment_type,
--experiment data
experiment_name,
variant_name,
MIN(first_date) AS min_date,
COUNT(DISTINCT visid)
FROM dwo_analysis.spark_global_authenticated_experiment_dashboard_report_activity c
GROUP BY experiment_name, variant_name;

You just need to read from the table three times, label each set of read data as either "last_week", "last_month" and "last_3_months", then UNION ALL the three data sets. See:
SELECT *
FROM (
SELECT 'global_authenticated' AS experiment_type
,
--experiment data
experiment_name
,variant_name
,MIN(first_date) AS min_date
,COUNT(DISTINCT visid)
,"last week" AS date_extract
FROM dwo_analysis.spark_global_authenticated_experiment_dashboard_report_activity c
WHERE DATE < "the last week"
GROUP BY experiment_name
,variant_name
UNION ALL
SELECT 'global_authenticated' AS experiment_type
,
--experiment data
experiment_name
,variant_name
,MIN(first_date) AS min_date
,COUNT(DISTINCT visid)
,"last month" AS date_extract
FROM dwo_analysis.spark_global_authenticated_experiment_dashboard_report_activity c
WHERE DATE < "the last month"
GROUP BY experiment_name
,variant_name
) first_union
UNION ALL
SELECT 'global_authenticated' AS experiment_type
,
--experiment data
experiment_name
,variant_name
,MIN(first_date) AS min_date
,COUNT(DISTINCT visid)
,"last 3 months" AS date_extract
FROM dwo_analysis.spark_global_authenticated_experiment_dashboard_report_activity c
WHERE DATE < "the last 3 months"
GROUP BY experiment_name
,variant_name

Related

Cumulative count over weeks in SQL

I have table of items with owner ids referencing to a user from users table.
I want to show for each week (group by week) how many items were created that week per user + all the items created before - cumulative count.
For this table:
id
owner
created
1
xxxxx
'2021-01-01'
2
xxxxx
'2021-01-01'
3
xxxxx
'2021-01-09'
I want to get:
count
owner
week
2
xxxxx
'2021-01-01' - '2021-01-07'
3
xxxxx
'2021-01-08' - '2021-01-14'
This is code for non-cumulative count. How can I change it to be cumulative?
select
count(*),
uu.id,
date_trunc('week', CAST(it.created AS timestamp)) as week
from items it
left join users uu on uu.id = item.owner_id
group by uu.id, week
I'm a little confused by your query:
You have a left join from items to users as if you expect some items with no valid user id.
You are using u.id in the select, but that would be NULL with no match.
I would suggest:
select it.owner_id,
date_trunc('week', it.created::timestamp) as week_start,
date_trunc('week', it.created::timestamp) + interval '6 day' as week_end,
count(*) as this_week,
sum(count(*)) over (partition by uu.id order by min(timestamp)) as running_count
from items it
group by it.owner_id, week_start;
This uses Postgres syntax because your code looks like Postgres.
Remove user id from the GROUP BY clause and from SELECT list:
select
count(*),
date_trunc('week', CAST(it.created AS timestamp)) as week
from items it
left join users uu on uu.id = item.owner_id
group by week
here's a little runnable sample (SQL Server), maybe it will help:
create table #temp (week int, cnt int)
select * from #temp
insert into #temp select 1,2
insert into #temp select 1,1
insert into #temp select 2,3
insert into #temp select 3,3
select
week,
sum(count(*)) over (order by week) as runningCnt
from #temp
group by week
The output is:
week - runningCnt
1 - 3
2 - 5
3 - 6
So 1st week there were 3, next week there came 2 more, and last week one more.
You could also do a cumulative sum of the values in the cnt-column.

Running total with Over

I'm trying to create a running total of the number of files per opened by day so I can use the data for a graph showing cumulative results.
The data is basically the file opening date, a calculated field showing 'This month' or 'Last Month' depending on the date and the running total field that I'm trying to figure out.
Date Month Count
==== ===== =====
2019-08-01 Last Month 6
2019-08-02 Last Month 2
2019-08-03 Last Month 5
I want to have a running total...so 6, 8, 13 etc
But all I'm getting is a row count (1,2,3 etc) for my count field.
select
FileDate,
Month,
sum(Count) OVER(PARTITION BY month order by Filedate) as 'Count'
from (
select
1 as 'Count',
Case
When month(cast(concat(right(d.var_val,4),substring(d.var_val,4,2),left(d.var_val,2)) as DATE) ) = Month(getdate()) then 'This Month'
else 'Last Month'
end as 'Month'
FROM data d
left join otherdata m on d.VAR_FileID = m.MAT_FileID
left join otherdata u on m.MAT_Fee_Earner = u.User_ID
left join otherdata br on m.MAT_BranchID = br.BR_ID
WHERE d.var_no IN ( '1628' )
and Len(var_val) = 10
)files
where Month(FileDate) in (MONTH(FileDate()),MONTH(getDate())-1)
and Year(Filedate) = Year(Getdate())
and Dept = 'Peterborough Property'
group by Month, FileDate, count
GO
I'm assuming I've not quite grasped the proper usage of 'OVER' - any pointers would be great!
The Partition clause indicates when to reset the count, so by partitioning by month you are only counting records for each discreet month to get a running total, over the whole dataset, you don't want the partition clause at all, just the order by clause.
Hope your clear with OVER clause now (with "Sentinel" answer), in which case you should replace desired column as follows, so that count continuously increase for all the rows from sub-query based on order by clause: for more details on OVER Clause..
sum(Count) OVER (Oder by Filedate) as [Count]
-- or
sum(Count) OVER (Oder by Filedate desc) as [Count]

Group By - select by a criteria that is met every month

The below query returns all USERS that have SUM(AMOUNT) > 10 in a given month. It includes Users in a month even if they don't meet the criteria in other months.
But I'd like to transform this query to return all USERS who must meet the criteria SUM(AMOUNT) > 10 every single month (i.e., from the first month in the table to the last one) across the entire data.
Put another way, exclude users who don't meet SUM(AMOUNT) > 10 every single month.
select USERS, to_char(transaction_date, 'YYYY-MM') as month
from Table
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10;
One approach uses a generated calendar table representing all months in your data set. We can left join this calendar table to your current query, and then aggregate over all months by user:
WITH months AS (
SELECT DISTINCT TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
),
cte AS (
SELECT USERS, TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10
)
SELECT
t.USERS
FROM months m
LEFT JOIN cte t
ON m.month = t.month
GROUP BY
t.USERS
HAVING
COUNT(t.USERS) = (SELECT COUNT(*) FROM months);
The HAVING clause above asserts that the number of months to which a user matches is in fact the total number of months. This would imply that the user meets the sum criteria for every month.
Perhaps you could use a correlated subquery, such as:
select t.*
from (select distinct table.users from table) t
where not exists
(
select to_char(u.transaction_date, 'YYYY-MM') as month
from table u
where u.users = t.users
group by month
having sum(u.amount) <= 10
)
One option would be using sign(amount-10) vs. sign(amount) logic as
SELECT q.users
FROM
(
with tab(users, transaction_date,amount) as
(
select 1,date'2018-11-24',8 union all
select 1,date'2018-11-24',18 union all
select 2,date'2018-10-24',13 union all
select 3,date'2018-11-24',18 union all
select 3,date'2018-10-24',28 union all
select 3,date'2018-09-24', 3 union all
select 4,date'2018-10-24',28
)
SELECT users, to_char(transaction_date, 'YYYY-MM') as month,
sum(sign(amount-10)) as cnt1,
sum(sign(amount)) as cnt2
FROM tab t
GROUP BY users, month
) q
GROUP BY q.users
HAVING sum(q.cnt1) = sum(q.cnt2)
GROUP BY q.users
users
-----
2
4
Rextester Demo
You need to compare the number of months > 10 to the number of months between the min and the max date:
SELECT users, Count(flag) AS months, Min(mth), Max(mth)
FROM
(
SELECT users, date_trunc('month',transaction_date) AS mth,
CASE WHEN Sum(amount) > 10 THEN 1 end AS flag
FROM tab t
GROUP BY users, mth
) AS dt
GROUP BY users
HAVING -- adding the number of months > 10 to the min date and compare to max
Min(mth) + (INTERVAL '1' MONTH * (Count(flag)-1)) = Max(mth)
If missing months don't count it would be a simple count(flag) = count(*)

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Multiple SELECT on the same field in one statement

I got the following table:
**stats**
id INT FK
day INT
value INT
I would like to create an SQL query that will sum the values in value column in the last day, last week and last month, in one statement.
So Far i got this:
select sum(value) from stats as A where A.day > now() - 1
union
select sum(value) from stats as B where B.day > now() - 7
union
select sum(value) from stats as C where C.day > now() - 30
This returns just the first sum(value), i was expecting 3 values to return.
Running: select sum(value) from stats as A where A.day > now() - X ( Where x = 1/7/30) in different queries works as it should.
What's wrong with the query? Thanks!
UNION is implicit distinct. Use UNION ALL instead like so:
SELECT 'last day' ItemType, sum(value) FROM stats as A WHERE A.day > now() - 1
UNION ALL
SELECT 'last week', SUM(value) FROM stats as B WHERE B.day > now() - 7
UNION ALL
SELECT 'last month', SUM(value) FROM stats as C WHERE C.day > now() - 30
Note that: I added a new column ItemType to indicate what is the type of the sum value whether it is last day, last week or last month