SQL using the result of a UNION query in another query - sql

How can I complete this query?
Right now, the query I have working is this, but it is not producing the right data.
SELECT date, coalesce(count,0) AS count
FROM
generate_series(
'2014-12-13 12:00:00'::timestamp,
'2015-01-06 11:00:00'::timestamp,
'1 hour'::interval
) AS date
LEFT OUTER JOIN (
SELECT
date_trunc('day', TABLE1.created_at) as day,
count(DISTINCT TABLE1.user) as count
FROM TABLE1
WHERE org_id = 1
GROUP BY day
) results ON (date = results.day);
Instead of TABLE1, I need to feed the query with data from another query which looks like this:
SELECT TABLE2.user_a as userid, TABLE2.created_at as createdat from TABLE2
UNION ALL
SELECT TABLE3.user_b as userid, TABLE3.created_at as createdat from TABLE3
UNION ALL
SELECT TABLE4.sender as userid, TABLE4.created_at as createdat from TABLE4;
How do I do this?

Any part of a select query that receives a table (e.g., a from clause, a join clause, etc) can receive a query surrounded in parenthesis - this is called a subquery. Note that in Postgres this subquery must be given an alias (i.e., a name that it can be referenced by). So in your case:
SELECT date, coalesce(count,0) AS count
FROM
generate_series(
'2014-12-13 12:00:00'::timestamp,
'2015-01-06 11:00:00'::timestamp,
'1 hour'::interval
) AS date
LEFT OUTER JOIN (
SELECT
date_trunc('day', subquery.created_at) as day,
count(DISTINCT subquery.user) as count
-- Subquery here:
FROM (SELECT TABLE2.user_a as userid, TABLE2.created_at as createdat from TABLE2
UNION ALL
SELECT TABLE3.user_b as userid, TABLE3.created_at as createdat from TABLE3
UNION ALL
SELECT TABLE4.sender as userid, TABLE4.created_at as createdat from TABLE4)
subquery
WHERE org_id = 1
GROUP BY day
) results ON (date = results.day);

Related

BigQuery: 'join lateral' alternative for referencing value in subquery

I have a BigQuery table that holds append-only data - each time an entity is updated a new version of it is inserted. Each entity has its unique ID and each entry has a timestamp of when it was inserted.
When querying for the latest version of the entity, I order by rank, partition by id, and select the most recent version.
I want to take advantage of this and chart the progression of these entities over time. For example, I would like to generate a row for each day since Jan. 1st, with a summary of the entities as they were on that day. In postgres, I would do:
select
...
from generate_series('2022-01-01'::timestamp, '2022-09-01'::timestamp, '1 day'::interval) query_date
left join lateral (
select *
from (
with snapshot as (
select distinct on (id) *
from table
where "createdOn" <= query_date
order by id, "createdOn" desc
)
This basically behaves like a for-each, having each subquery run once for each query_date (day, in this instance) which I can reference in the where clause. Each subquery then filters the data so that it only uses data up to a certain time.
I know that I can create a saved query for the "subquery" logic and then schedule a prefill to run once for each day over the timeline, but I would like to understand how to write an exploratory query.
EDIT 1
Using a correlated subquery is a step in the right direction, but does not work when the subquery needs to join with another table (another append-only table holding a related entity).
So this works:
select
day
, (
select count(*)
from `table` t
where date(createdOn) < day
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
But if I need the subquery to join with another table, like in:
select
day
, (
select as struct *
from (
select
id
, status
, rank() over (partition by id order by createdOn desc) as rank
from `table1`
where date(createdOn) < day
qualify rank = 1
) t1
left join (
select
id
, other
, rank() over (partition by id order by createdOn desc) as rank
from `table2`
where date(createdOn) < day
qualify rank = 1
) t2 on t2.other = t1.id
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
I get an error saying Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. Another SO question about that error (Avoid correlated subqueries error in BigQuery) solves the issue by moving the correlated query to a join in the top query - which misses what I am trying to achieve.
Took me a while, but I figured out a way to do this using the answer in Bigquery: WHERE clause using column from outside the subquery.
Basically, it requires to flip the order of the queries, here's how it's done:
select *
from (
select *
from `table1` t1
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t1.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t1.id ORDER BY t1.createdOn desc) = 1
)
left join (
select
* -- aggregate here
from (
SELECT
id, other, createdOn
FROM `table2` t2
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t2.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t2.id ORDER BY t2.createdOn desc) = 1
) snapshot
group by rs.other, day
) t2 on t2.other = t1.id and t2.day = t1.day
group by t1.day

Windows functions orderen by date when some dates doesn't exist

Suppose this example query:
select
id
, date
, sum(var) over (partition by id order by date rows 30 preceding) as roll_sum
from tab
When some dates are not present on date column the window will not consider the unexistent dates. How could i make this windowns aggregation including these unexistent dates?
Many thanks!
You can join a sequence containing all dates from a desired interval.
select
*
from (
select
d.date,
q.id,
q.roll_sum
from unnest(sequence(date '2000-01-01', date '2030-12-31')) d
left join ( your_query ) q on q.date = d.date
) v
where v.date > (select min(my_date) from tab2)
and v.date < (select max(my_date) from tab2)
In standard SQL, you would typically use a window range specification, like:
select
id,
date,
sum(var) over (
partition by id
order by date
range interval '30' day preceding
) as roll_sum
from tab
However I am unsure that Presto supports this syntax. You can resort a correlated subquery instead:
select
id,
date,
(
select sum(var)
from tab t1
where
t1.id = t.id
and t1.date >= t.date - interval '30' day
and t1.date <= t.date
) roll_sum
from tab t
I don't think Presto support window functions with interval ranges. Alas. There is an old fashioned way to doing this, by counting "ins" and "outs" of values:
with t as (
select id, date, var, 1 as is_orig
from t
union all
select id, date + interval '30 day', -var, 0
from t
)
select id.*
from (select id, date, sum(var) over (partition by id order by date) as running_30,
sum(is_org) as is_orig
from t
group by id, date
) id
where is_orig > 0

How to show a row for the dates not in records of a table as zero

I am trying to show the records as zero for the dates not found.
Below is my basic query:
Select date_col, count(distinct file_col), count(*) from tab1
where date_col between 'date1' and 'date2'
group by date_col;
The output is for one date.
I want all the dates to be shown in result.
The general way to deal with this type of problem is to use something called a calendar table. This calendar table contains all the dates which you want to appear in your report. We can create a crude one by using a subquery:
SELECT
t1.date,
COUNT(DISTINCT t2.file_col) AS d_cnt,
COUNT(t2.file_col) AS cnt
FROM
(
SELECT '2018-06-01' AS date UNION ALL
SELECT '2018-06-02' UNION ALL
...
) t1
LEFT JOIN tab1 t2
ON t1.date = t2.date_col
WHERE
t1.date BETWEEN 'date1' and 'date2'
GROUP BY
t1.date;
Critical here is that we left join the calendar table to your table containing the actual data, but we count a column in your data table. This means that zero would be reported for any day not having matching data.
If you are using postgreSQL, you could generate series with necessary dates period.
SELECT
t1.date,
COUNT(DISTINCT t2.file_col) AS d_cnt,
COUNT(t2.file_col) AS cnt
FROM
(
select to_char( '?'::DATE + (interval '1' month * generate_series(0,11)),'yyyy-mm-dd')as month) x
...
) t1
LEFT JOIN tab1 t2
ON t1.date = to_char(t2.date_col,'yyyy-mm')
WHERE
t1.date BETWEEN 'date1' and 'date2'
GROUP BY
t1.date;
In this example show how to generate sequence for month period.

SQL Server: Attempting to output a count with a date

I am trying to write a statement and just a bit puzzled what is the best way to put it together. So I am doing a UNION on a number of tables and then from there I want to produce as the output a count for the UserID within that day.
So I will have numerous tables union such as:
Order ID, USERID, DATE, Task Completed.
UNION
Order ID, USERID, DATE, Task Completed
etc
Above is layout of the table which will have 4 tables union together with same names.
Then statement output I want is for a count of USERID that occurred within the last 24 hours.
So output should be:
USERID--- COUNT OUTPUT-- DATE
I was attempting a WHERE statement but think the output is not what I am after exactly, just thinking if anyone can point me in the right direction and if there is alternative way compared to the union? Maybe a joint could be a better alternative, any help be appreciated.
I will eventually then put this into a SSRS report, so it gets updated daily.
You can try this:
select USERID, count(*) as [COUNT], cast(DATE as date) as [DATE]
from
(select USERID, DATE From SomeTable1
union all
select USERID, DATE From SomeTable2
....
) t
where DATE <= GETDATE() AND DATE >= DATEADD(hh, -24, GETDATE())
group by USERID, cast(DATE as date)
First, you should use union all rather than union. Second, you need to aggregate and use count distinct to get what you want:
So, the query you want is something like:
select count(distinct userid)
from ((select date, userid
from table1
where date >= '2015-05-26'
) union all
(select date, userid
from table2
where date >= '2015-05-26'
) union all
(select date, userid
from table3
where date >= '2015-05-26'
)
) du
Note that this hardcodes the date. In SQL Server, you would do something like:
date >= cast(getdate() - 1 as date)
And in MySQL
date >= date_sub(curdate(), interval 1 day)
EDIT:
I read the question as wanting a single day. It is easy enough to extend to all days:
select cast(date as date) as dte, count(distinct userid)
from ((select date, userid
from table1
) union all
(select date, userid
from table2
) union all
(select date, userid
from table3
)
) du
group by cast(date as date)
order by dte;
For even more readability, you could use a CTE:
;WITH cte_CTEName AS(
SELECT UserID, Date, [Task Completed] FROM Table1
UNION
SELECT UserID, Date, [Task Completed] FROM Table2
etc
)
SELECT COUNT(UserID) AS [Count] FROM cte_CTEName
WHERE Date <= GETDATE() AND Date >= DATEADD(hh, -24, GETDATE())
I think this is what you are trying to achieve...
Select
UserID,
Date,
Count(1)
from
(Select *
from table1
Union All
Select *
from table2
Union All
Select *
from table3
Union All
Select *
from table4
) a
Group by
Userid,
Date

Hits per day in Google Big Query

I am using Google Big Query to find hits per day. Here is my query,
SELECT COUNT(*) AS Key,
DATE(EventDateUtc) AS Value
FROM [myDataSet.myTable]
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
This is working fine but it ignores the date with 0 hits. I wanna include this. I cannot create temp table in Google Big Query. How to fix this.
Tested getting error Field 'day' not found.
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (
select date(date_add(day, i, "DAY")) day
from (select '2015-05-01 00:00' day) a
cross join
(select
position(
split(
rpad('', datediff(CURRENT_TIMESTAMP(),'2015-05-01 00:00')*2, 'a,'))) i
from (select NULL)) b
) d
left join [sample_data.requests] t on d.day = t.day
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
You can query data that exists in your tables, the query cannot guess which dates are missing from your table. This problem you need to handle either in your programming language, or you could join with a numbers table and generates the dates on the fly.
If you know the date range you have in your query, you can generate the days:
select date(date_add(day, i, "DAY")) day
from (select '2015-01-01' day) a
cross join
(select
position(
split(
rpad('', datediff('2015-01-15','2015-01-01')*2, 'a,'))) i
from (select NULL)) b;
Then you can join this result with your query table:
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (...the.above.query.pasted.here...) d
left join [myDataSet.myTable] t on d.day = t.day
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;