Hive SQL nested query use similar column - sql

I have a query that includes two subqueries with similar column 'day'. I would like to show values in a following way:
day cnt1 cnt_total
But in a query I have it does not recognize that the day column is similar and makes a multiplication of all rows in nested statement one by all rows in nested statement two.
Is there a way to make it recognize that the day column is similar?
The query looks as follows:
SELECT p1.day, p1.count AS cnt1, p2.count AS cnt_total
FROM
(
SELECT day, COUNT(DISTINCT id) AS count FROM table
WHERE 1=1
AND service="service"
AND action="action"
AND path LIKE "%search%"
AND year="2021"
GROUP BY day
) p1,
(
SELECT day, COUNT(DISTINCT id) AS count FROM table
WHERE 1=1
AND service="service"
AND action="action"
AND year="2021"
GROUP BY day
) p2;

You should be able to do this with conditional aggregation, so only one SELECT is needed:
SELECT day,
COUNT(DISTINCT CASE WHEN action = 'mousedown' AND data["path"] LIKE '%go-to-latest-search%' THEN gsid END) AS count,
COUNT(DISTINCT CASE WHEN action = 'impress' THEN gsid END) as cnt_total
FROM hit
WHERE service = 'sauto' AND
year = '2021' AND
month = '07'
GROUP BY day

Related

How to write SQL query without join?

Recently during an interview I was asked a question: if I have a table like as below:
The requirement is: how many orders and how many shipments per day (based on date column) - output needs to be like this:
I have written the following code, but interviewer ask me to write a SQL query without JOIN and UNION, achieve the same output.
SELECT
COALESCE(a.order_date, b.ship_date), orders, shipments
FROM
(SELECT
order_date, COUNT(1) AS orders
FROM
table
GROUP BY 1) a
FULL JOIN
(SELECT
ship_date, COUNT(1) AS shipments
FROM table) b ON a.order_date = b.ship_date
Is this possible? Could you guys please advice?
You can use UNION and GROUP BY with conditional aggregation as follows:
SELECT DATE_,
COUNT(CASE WHEN FLAG = 'ORDER' THEN 1 END) AS ORDERS,
COUNT(CASE WHEN FLAG = 'SHIP' THEN 1 END) AS SHIPMENTS
FROM (SELECT ORDER_DATE AS DATE_, 'ORDER' AS FLAG FROM YOUR_TABLE
UNION ALL
SELECT SHIP_DATE AS DATE_, 'SHIP' AS FLAG FROM YOUR_TABLE) T
In BigQuery, I would express this as:
select date, countif(n = 0) as orders, countif(n = 1) as numships
from t cross join
unnest(array[order_date, ship_date]) date with offset n
group by 1
order by date;
The advantage of this approach (over union all) is two-fold. First, it only scans the table once. More importantly, the unnest() is all on the same node where the data resides -- so data does not need to be moved for the unpivot.

Sum of multiple select count distinct with case function

I try to make a sum of multiple select count distinct with case function. For example:
SELECT id_dept,
count(DISTINCT case when e.statut='pub' then id_patients end) AS nb_patients_pub,
count(DISTINCT case when e.statut='priv' then id_patients end) AS nb_patients_priv
FROM venues
I would like to make of these two results in only one columns.
Is it possible?
I think that you want in:
SELECT
id_dept,
COUNT(DISTINCT CASE WHEN e.statut IN ('pub', 'priv') THEN id_patients END) AS nb_patients_pub_and_venues
FROM venues
GROUP BY id_dept
Note that I added a GROUP BY clause to the query, which was initially missing (this is a syntax error in almost all databases).
Depending on your data, this might not do exactly what you want; if a given id_patient has both statuses, then it will be counted only once, whereas your code counted it once in each count(distinct ...). If so, then you can just keep the two separated counts, and sum them:
SELECT
id_dept,
COUNT(DISTINCT CASE WHEN e.statut IN = 'pub' THEN id_patients END)
+ COUNT(DISTINCT CASE WHEN e.statut IN = 'priv' THEN id_patients END)
AS nb_patients_pub_and_venues
FROM venues
GROUP BY id_dept
If you're happy with current code, then either sum (using +) those counts, or use that query as a CTE (or an inline view) and
with test as
(SELECT id_dept,
count(DISTINCT case when e.statut='pub' then id_patients end)
AS nb_patients_pub,
count(DISTINCT case when e.statut='priv' then id_patients end)
AS nb_patients_priv
FROM venues
GROUP BY id_dept
)
select id, nb_patients_pub + nb_patients_priv as result
from test;

SQL (BigQuery): How do i use a single value, derived with another query?

This is my query:
WITH last_transaction AS (
SELECT
month
FROM db.transactions
ORDER BY date DESC
LIMIT 1
)
SELECT
*
FROM db.transactions
-- WHERE month = last_transaction.month
WHERE month = 11
GROUP BY
id
Commented out line doesn't work, but intention is clear, i assume: i need to select transactions for the latest month. Business logic might not make sense, because i've extracted it from a bigger query. The main question is: how do i use a single value, derived with another query.
You have only one row, so you can use a scalar subquery:
SELECT t.*
FROM db.transactions t
WHERE month = (SELECT last_transaction.month FROM last_transaction);
I removed the GROUP BY id because it would be a syntax error in BigQuery and it logically does not make sense. Why would a column called id be duplicated in the table?
However, this query would often be written as:
SELECT t.*
FROM (SELECT t.*, MAX(month) OVER () as max_month
FROM db.transactions t
WHERE month = max_month;
Try to JOIN the last_transaction.
A bit like this;
SELECT *
FROM db.transactions
JOIN last_transaction
ON db.transactions.id = last_transaction.id
WHERE month = last_transaction.month
GROUP BY id

How to count rows in SQL Server 2012?

I am trying to find whether a person (id = A3) is continuously active in a program at least five months or more in a given year (2013). Any suggestion would be appreciated. My data look like as follows:
You simply use group by and a conditional expression:
select id,
(case when count(ActiveMonthYear) >= 5 then 'YES!' else 'NAW' end)
from table t
where ListOfTheMonths between '201301' and '201312'
group by id;
EDIT:
I suppose "continuously" doesn't just mean any five months. For that, there are various ways. I like the difference of row numbers approach
select distinct id
from (select t.*,
(row_number() over (partition by id order by ListOfTheMonths) -
count(ActiveMonthYear) over (partition by id order by ListOfTheMonths)
) as grp
from table t
where ListOfTheMonths between '201301' and '201312'
) t
where ActiveMonthYear is not null
group by id, grp
having count(*) >= 5;
The difference in the subquery is constant for groups of consecutive active months. This is then used a grouping. The result is a list of all ids that meet this criteria. You can add a where for a particular id (do it in the subquery).
By the way, this is written using select distinct and group by. This is one of the rare cases where these two are appropriately used together. A single id could have two periods of five months in the same year. There is no reason to include that person twice in the result set.

Hive SQL aggregate merge multiple sqls into one

I have a serial sqls like:
select count(distinct userId) from table where hour >= 0 and hour <= 0;
select count(distinct userId) from table where hour >= 0 and hour <= 1;
select count(distinct userId) from table where hour >= 0 and hour <= 2;
...
select count(distinct userId) from table where hour >= 0 and hour <= 14;
Is there a way to merge them into one sql?
It looks like you are trying to keep a cumulative count, bracketed by the hour. To do that, you can use a window function, like this:
SELECT DISTINCT
A.hour AS hour,
SUM(COALESCE(M.include, 0)) OVER (ORDER BY A.hour) AS cumulative_count
FROM ( -- get all records, with 0 for include
SELECT
name,
hour,
0 AS include
FROM
table
) A
LEFT JOIN
( -- get the record with lowest `hour` for each `name`, and 1 for include
SELECT
name,
MIN(hour) AS hour,
1 AS include
FROM
table
GROUP BY
name
) M
ON M.name = A.name
AND M.hour = A.hour
;
There might be a simpler way, but this should yield the correct answer in general.
Explanation:
This uses 2 subqueries against the same input table, with a derived field called include to keep track of which records should contribute to the final total for each bucket. The first subquery simply takes all records in the table and assigns 0 AS include. The second subquery finds all unique names and the lowest hour slot in which that name appears, and assigns them 1 AS include. The 2 subqueries are LEFT JOIN'ed by the enclosing query.
The outermost query does a COALESCE(M.include, 0) to fill in any NULL's produced by the LEFT JOIN, and those 1's and 0's are SUM'ed and windowed by hour. This needs to be a SELECT DISTINCT rather than using a GROUP BY becuse a GROUP BY will want both hour and include listed, but it ends up collapsing every record in a given hour group into a single row (still with include=1). The DISTINCT is applied after the SUM so it will remove duplicates without discarding any input rows.