Hive SQL aggregate merge multiple sqls into one - hive

I have a serial sqls like:
select count(distinct userId) from table where hour >= 0 and hour <= 0;
select count(distinct userId) from table where hour >= 0 and hour <= 1;
select count(distinct userId) from table where hour >= 0 and hour <= 2;
...
select count(distinct userId) from table where hour >= 0 and hour <= 14;
Is there a way to merge them into one sql?

It looks like you are trying to keep a cumulative count, bracketed by the hour. To do that, you can use a window function, like this:
SELECT DISTINCT
A.hour AS hour,
SUM(COALESCE(M.include, 0)) OVER (ORDER BY A.hour) AS cumulative_count
FROM ( -- get all records, with 0 for include
SELECT
name,
hour,
0 AS include
FROM
table
) A
LEFT JOIN
( -- get the record with lowest `hour` for each `name`, and 1 for include
SELECT
name,
MIN(hour) AS hour,
1 AS include
FROM
table
GROUP BY
name
) M
ON M.name = A.name
AND M.hour = A.hour
;
There might be a simpler way, but this should yield the correct answer in general.
Explanation:
This uses 2 subqueries against the same input table, with a derived field called include to keep track of which records should contribute to the final total for each bucket. The first subquery simply takes all records in the table and assigns 0 AS include. The second subquery finds all unique names and the lowest hour slot in which that name appears, and assigns them 1 AS include. The 2 subqueries are LEFT JOIN'ed by the enclosing query.
The outermost query does a COALESCE(M.include, 0) to fill in any NULL's produced by the LEFT JOIN, and those 1's and 0's are SUM'ed and windowed by hour. This needs to be a SELECT DISTINCT rather than using a GROUP BY becuse a GROUP BY will want both hour and include listed, but it ends up collapsing every record in a given hour group into a single row (still with include=1). The DISTINCT is applied after the SUM so it will remove duplicates without discarding any input rows.

Related

Hive SQL nested query use similar column

I have a query that includes two subqueries with similar column 'day'. I would like to show values in a following way:
day cnt1 cnt_total
But in a query I have it does not recognize that the day column is similar and makes a multiplication of all rows in nested statement one by all rows in nested statement two.
Is there a way to make it recognize that the day column is similar?
The query looks as follows:
SELECT p1.day, p1.count AS cnt1, p2.count AS cnt_total
FROM
(
SELECT day, COUNT(DISTINCT id) AS count FROM table
WHERE 1=1
AND service="service"
AND action="action"
AND path LIKE "%search%"
AND year="2021"
GROUP BY day
) p1,
(
SELECT day, COUNT(DISTINCT id) AS count FROM table
WHERE 1=1
AND service="service"
AND action="action"
AND year="2021"
GROUP BY day
) p2;
You should be able to do this with conditional aggregation, so only one SELECT is needed:
SELECT day,
COUNT(DISTINCT CASE WHEN action = 'mousedown' AND data["path"] LIKE '%go-to-latest-search%' THEN gsid END) AS count,
COUNT(DISTINCT CASE WHEN action = 'impress' THEN gsid END) as cnt_total
FROM hit
WHERE service = 'sauto' AND
year = '2021' AND
month = '07'
GROUP BY day

PostrgreSQL count how many transactions in the last 7 days

Hi I'm a SQL noobie and have been working on this problem for hours on end.
I have a table of transactions and the field txnDate is of date data type. I've altered the table to add a column called txnLast7days which should count how many transactions exist in the table within the last 7 days of txnDate.
This is my table
What statement can I use to update all the table records at once and counts the # of transactions within a 7 day period based on txnDate and inserts the result in the txnLast7days column for each row?
This is the statement I'm currently using based on a suggestion, but I'm still not getting the right result.
UPDATE temp2
SET txnLast7Days = subquery.txnLast7Days
FROM
(
SELECT txnDate, sum(dateCounts.transactionCount) OVER (ORDER BY txnDate ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as txnLast7Days
FROM (SELECT count(*) transactionCount, txnDate FROM temp2 GROUP BY txnDate) as dateCounts
) subquery
WHERE temp2.txnDate = subquery.txnDate
My current query is not updating the txnlast7days with the right count, as you can see here
my current query output
What you need to do is get a count for each txnDate and then get the rolling 7 day count for each txnDate.
The former is done with a simple COUNT(*) and GROUP BY on your table. The latter is done with a window function that looks back over the last 7 records, ordered by txnDate, and sums those counts up.
You can then use those results in an UPDATE query to populate your new column.
UPDATE yourtable
SET txnLast7Days = subquery.txnLast7Days
FROM
(
SELECT txnDate, sum(dateCounts.transactionCount) OVER (ORDER BY txnDate ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as txnLast7Days
FROM (SELECT count(*) transactionCount, txnDate FROM yourtable GROUP BY txnDate) as dateCounts
) subquery
WHERE txnDate = subquery.txnDate

How to filter records by them amount per date?

i have a tablet 'A' that have a column of date. and the same date can be in a few records. I'm trying to filter the records where the amount of the records by day is less than 5. And still keep all the fields of the tablet.
I mean that if i have only 4 records on 11/10/2017 I need to filter all of this 4 records.
So You can SELECT them basing at sub-query . In SUB-Query group them by this date column and then use HAVING with aggregated count to know how many in every date-group we have and then select all which have this count lesser than 5 ;
SELECT *
FROM A
WHERE A.date in (SELECT subA.date
FROM A
GROUP BY A.date
HAVING COUNT(*) < 5 );
Take Care's answer is good. Alternatively, you can use an analytic/windowing function. I'd benchmark both and see which one works better.
with cte as (
select *, count(1) over (partition by date) as cnt
from table_a
)
select *
from cte
where cnt < 5

How to count rows in SQL Server 2012?

I am trying to find whether a person (id = A3) is continuously active in a program at least five months or more in a given year (2013). Any suggestion would be appreciated. My data look like as follows:
You simply use group by and a conditional expression:
select id,
(case when count(ActiveMonthYear) >= 5 then 'YES!' else 'NAW' end)
from table t
where ListOfTheMonths between '201301' and '201312'
group by id;
EDIT:
I suppose "continuously" doesn't just mean any five months. For that, there are various ways. I like the difference of row numbers approach
select distinct id
from (select t.*,
(row_number() over (partition by id order by ListOfTheMonths) -
count(ActiveMonthYear) over (partition by id order by ListOfTheMonths)
) as grp
from table t
where ListOfTheMonths between '201301' and '201312'
) t
where ActiveMonthYear is not null
group by id, grp
having count(*) >= 5;
The difference in the subquery is constant for groups of consecutive active months. This is then used a grouping. The result is a list of all ids that meet this criteria. You can add a where for a particular id (do it in the subquery).
By the way, this is written using select distinct and group by. This is one of the rare cases where these two are appropriately used together. A single id could have two periods of five months in the same year. There is no reason to include that person twice in the result set.

SQL GROUP BY ( DATEPART(), field1 ) result set to zero nulls

I want to aggregate counts, grouped by a datepart and column.
For example, a table with 3 columns with each row representing a unique event: id, name, date
I want to select total counts grouped by name and hour, with zeros when there are no events. If I'm only grouping by name, I can join it with a table of every name. With an hour I could do something similar.
How would I handle the case of grouping by both without having a table with a row for every name+hour combination?
The following is the mysql solution:
create table hours (hour int)
insert hours (hour) values (0), (1) .... (23)
select hour, name, sum(case when name is null then 0 else 1 end)
from hours left outer join
event on (hour(event.date) = hours.hour)
group by hour, name
the sum(case when name is null then 0 else 1 end) handles the case when there are no events for a particular hour and name. the count will show as 0. For others each matching row contributes 1 to the sum.
For sql server use datepart(hour, event.date) instead. The rest should be similar
You can use cross join to generate all the rows and then other logic to fill in the values:
select h.hour, n.name, count(a.name) as cnt
from (select distinct hour(date) as hour from atable) h cross join
(select distinct name from atable) n left join
atable a
on hour(a.date) = h.hour and a.name = n.name
group by h.hour, n.name;