Sampling issue with query in BigQuery (Standard SQL)

Sampling issue with query in BigQuery (Standard SQL) - google-bigquery

I have been running a query of the format below
SELECT b.date as Date,COUNT(DISTINCT user_id) AS NewUsers FROM (
SELECT user_id,MIN(date) as min_date
FROM tableA
WHERE date >= '2018-10-10'
AND filter1 = "XYZ"
GROUP BY ) a
CROSS JOIN (
SELECT date FROM tableB
WHERE date >= '2018-10-19' AND date <= CURRENT_DATE()
GROUP BY 1) b
WHERE a.date >= DATE_SUB(b.date, INTERVAL 6 DAY) AND a.date <= b.date
GROUP BY 1
Let's say the above is result1
SELECT b.date as Date,COUNT(DISTINCT user_id) AS NewUsers FROM (
SELECT user_id,MIN(date) as min_date
FROM tableA
WHERE date >= '2018-07-10'
AND filter1 = "XYZ"
GROUP BY ) a
CROSS JOIN (
SELECT date FROM tableB
WHERE date >= '2018-07-19' AND date <= CURRENT_DATE()
GROUP BY 1) b
WHERE a.date >= DATE_SUB(b.date, INTERVAL 6 DAY) AND a.date <= b.date
GROUP BY 1
The above is result2
Here 2018-07-19 is the launch date.
Since I have the data till 2018-10-19, I want to run the query from the later date to optimize the cost and the data consumption by the query....but some how, I am getting incorrect data.
But, if I run the same query from the launch date, I am getting the correct results.
I mean the NewUsers from result1 for the corresponding dates (like date >= 2018-10-19) are more than the NewUsers from result2.
No sure, where I am missing something.
Any help would be greatly appreciated.
Thanks

I think - it is because of use of 'MIN(date)' - You see shift in counts because you restricted dates so those users who were first seen in earlier dates - now those same "old" users are counted for recent days - thus the confusion

Related

Rolling 12 month Calculation SQL

I am trying to do a 12 month rolling calculation, but I get a syntax error at "rows", here is what I have so far:
(SUM(YTDValue) OVER (ORDER BY PerformanceDate ROWS BETWEEN 11 PRECEDING AND CURRENT ROW)) / 12 AS Yearly_YTDValue

It might be your RDMBS doesn't support that syntax, because at first glance it looks correct to me.
This method will only work if you are guaranteed to have exactly 12 PerformanceDates, so sometimes I prefer this method because it does not require me to aggregate the data to monthly levels first.
WITH BASIC_OFFSET_7DAY AS (
SELECT
A.DATE,
SUM(B.YTDValue) as Yearly_YTDValue
FROM
UserActivity A
JOIN UserActivity B
WHERE
B.DATE >= DATEADD(year, -12, A.DATE)
AND B.DATE <= A.DATE
GROUP BY A.DATE
)
SELECT
src.*,
BASIC_OFFSET_7DAY.Yearly_YTDValue
FROM
UserActivity src
LEFT OUTER JOIN BASIC_OFFSET_7DAY ON BASIC_OFFSET_7DAY.DATE = src.DATE

How to define the filter in dates?

With the query, I basically want to compare avg_clicks at different time periods and set a filter according to the avg_clicks.
The below query gives us avg_clicks for each shop in January 2020. But I want to see the avg_clicks that is higher than 0 in January 2020.
Question 1: When I add the where avg_clicks > 0 in the query, I am getting the following error: Column 'avg_clicks' cannot be resolved. Where to put the filter?
SELECT AVG(a.clicks) AS avg_clicks,
a.shop_id,
b.shop_name
FROM
(SELECT SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= CAST('2020-01-01' AS date)
AND date <= CAST('2020-01-31' AS date)
GROUP BY shop_id, date) as a
JOIN Y as b
ON a.shop_id = b.shop_id
GROUP BY a.shop_id, b.shop_name
Question 2: As I wrote, I want to compare two different times. And now, I want to see avg_clicks that is 0 in February 2020.
As a result, the desired output will show me the list of shops that had more than 0 clicks in January, but 0 clicks in February.
Hope I could explain my question. Thanks in advance.

For your Question 1 try to use having clause. Read execution order of SQL statement which gives you a better idea why are you getting avg_clicks() error.
SELECT AVG(a.clicks) AS avg_clicks,
a.shop_id,
b.shop_name
FROM
(SELECT SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-01-01'
AND date <= '2020-01-31'
GROUP BY shop_id, date) as a
JOIN Y as b
ON a.shop_id = b.shop_id
GROUP BY a.shop_id, b.shop_name
HAVING AVG(a.clicks) > 0
For your Question 2, you can do something like this
SELECT
shop_id,
b.shop_name,
jan_avg_clicks,
feb_avg_clicks
FROM
(
SELECT
AVG(clicks) AS jan_avg_clicks,
shop_id
FROM
(
SELECT
SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-01-01'
AND date <= '2020-01-31'
GROUP BY
shop_id,
date
) as a
GROUP BY
shop_id
HAVING AVG(clicks) > 0
) jan
join
(
SELECT
AVG(clicks) AS feb_avg_clicks,
shop_id
FROM
(
SELECT
SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-02-01'
AND date < '2020-03-01'
GROUP BY
shop_id,
date
) as a
GROUP BY
shop_id
HAVING AVG(clicks) = 0
) feb
on jan.shop_id = feb.shop_id
join Y as b
on jan.shop_id = b.shop_id

Start with conditional aggregation:
SELECT shop_id,
SUM(CASE WHEN DATE_TRUNC('month', date) = '2020-01-01' THEN clicks_on END) / COUNT(DISTINCT date) as avg_clicks_jan,
SUM(CASE WHEN DATE_TRUNC('month', date) = '2020-02-01' THEN clicks_on END) / COUNT(DISTINCT date) as avg_clicks_feb
FROM X
WHERE site = 'com' AND
date >= '2020-01-01' AND
date < '2020-03-01'
GROUP BY shop_id;
I'm not sure what comparison you want to make. But if you want to filter based on the aggregated values, use a HAVING clause.

Records 1 Hour Before and 1 Hour After

I have a table with a set of trouble ticket data. I would like to write a query that selects all the records in this table that have occurred from 1 hour before to 1 hour after a particular record was inserted.
Example:
Error "xyz" occurred at 2018-01-03 15:30:06.000
I would like to return EVERY trouble ticket that was created between 14:30:06.000 and 16:30:06.000 on 2018-01-03. I would like this to happen for all occurrences of that error since the beginning of the year.
Is this possible?
This is what I have, considering the example provided below. I'm still only returning the results in the temp table, and not the +1h and -1h of records from the original table.
select * into #temp
from Incident
where INCIDENT_REPORTED_DATE_TIME > '01/01/2018'
and SUMMARY like '%error%'
select i.*
from Incident i
join #temp t on i.INCIDENT_ID = t.INCIDENT_ID
where i.INCIDENT_REPORTED_DATE_TIME >= DATEADD(HH, -1, t.INCIDENT_REPORTED_DATE_TIME)
and i.INCIDENT_REPORTED_DATE_TIME < DATEADD(HH, 1, t.INCIDENT_REPORTED_DATE_TIME)
order by i.INCIDENT_REPORTED_DATE_TIME

Here is an ANSI SQL approach:
select t.*
from t join
t t2
on t2.col = 'xyz' and
t.created >= t2.created - interval '1 hour' and
t.created <= t2.created + interval '1 hour'
order by t.created;
Note that the exact syntax varies by database (which isn't specified as I write this). But this idea should work in almost any database, with the right date/time functions.
EDIT:
In SQL Server, this looks like:
select t.*
from t join
t t2
on t2.col = 'xyz' and
t.created >= dateadd(hour, -1, t2.created) and
t.created <= dateadd(hour 1, t2.created)
order by t.created;

Select records that don't exist in a union between a table and a subset of that table

I have a table with appointments, past, present and future. I would like to be able to run a single query that would give me a list of appointments from a given date, with a status of "no show" that DO NOT have an appointment in the table with a date in the future.
So, what I have so far is (pseduocodey)
SELECT *
FROM (SELECT *
FROM Appointments
WHERE Appointments.Date >= Today's Date)
WHERE NOT EXISTS
(SELECT *
FROM Appointments
WHERE Appointments.PatID = SUBQUERYRESULTS.PatID)
The subquery would be
SELECT *
FROM Appointments
WHERE (Appointments.Status = "NoShow" AND (Appointment.Date is >= Start_date and <= End_date))
I'm not sure how to include the subquery to get it to work. I'm new to this, so please excuse the idiocy. Thank you.

You seem to want not exists as a where condition. Based on your description, this seems to be:
select a.*
from appointments a
where a.status = 'no show' and
a.date = #date and
not exists (select 1
from appointments a2
where a2.patid = a.patid and a2.date > current_ate
);
If the date column has a time component, then the date comparison needs to take this into account.

appointments ... with a status of "no show" that DO NOT have an appointment in the table with a date in the future
This seems to work (tested with Access 2010), and includes "Start_date" and "End_date" comparisons to limit the 'NoShow' appointments to a date range:
SELECT a1.*
FROM Appointments a1
WHERE a1.Status='NoShow'
AND a1.Date >= Start_date AND a1.Date <= End_date
AND NOT EXISTS
(
SELECT *
FROM Appointments a2
WHERE a2.PatID = a1.PatID
AND a2.Date > a1.Date
)

Here's another option (albeit untested) which uses a LEFT JOIN in place of the subquery:
SELECT t.*
FROM
Appointments t LEFT JOIN Appointments u
ON t.PatID = u.PatID AND t.Date < u.Date
WHERE
t.Status = "NoShow" AND
t.Date >= Start_date AND
t.Date <= End_date AND
u.PatID IS NULL
The line u.PatID IS NULL essentially performs the selection of those records with no future appointment.

sql count statement with multiple date ranges

I have two table with different appointment dates.
Table 1
id start date
1 5/1/14
2 3/2/14
3 4/5/14
4 9/6/14
5 10/7/14
Table 2
id start date
1 4/7/14
1 4/10/14
1 7/11/13
2 2/6/14
2 2/7/14
3 1/1/14
3 1/2/14
3 1/3/14
If i had set date ranges i can count each appointment date just fine but i need to change the date ranges.
For each id in table 1 I need to add the distinct appointment dates from table 2 BUT only
6 months prior to the start date from table 1.
Example: count all distinct appointment dates for id 1 (in table 2) with appointment dates between 12/1/13 and 5/1/14 (6 months prior). So the result is 2...4/7/14 and 4/10/14 are within and 7/1/13 is outside of 6 months.
So my issue is that the range changes for each record and i can not seem to figure out how to code this.For id 2 the date range will be 9/1/14-3/2/14 and so on.
Thanks everyone in advance!

Try this out:
SELECT id,
(
SELECT COUNT(*)
FROM table2
WHERE id = table1.id
AND table2.start_date >= DATEADD(MM,-6,table1.start_date)
) AS table2records
FROM table1
The DATEADD subtracts 6 months from the date in table1 and the subquery returns the count of related records.

I think what you want is a type of join.
select t1.id, count(t2.id) as numt2dates
from table1 t1 left outer join
table2 t2
on t1.id = t2.id and
t2.startdate between dateadd(month, -6, t1.startdate) and t1.startdate
group by t1.id;
The exact syntax for the date arithmetic depends on the database.

Thank you this solved my issue. Although this may not help you since you are not attempting to group by date. But the answer gave me the insights to resolve the issue I was facing.
I was attempting to gather the total users a date criteria that had to be evaluated by multiple fields.
WITH data AS (
SELECT generate_series(
(date '2020-01-01')::timestamp,
NOW(),
INTERVAL '1 week'
) AS date
)
SELECT d.date, (SELECT COUNT(DISTINCT h.id) AS user_count
FROM history h WHERE h.startDate < d.date AND h.endDate > d.date
ORDER BY 1 DESC) AS total_records
FROM data d ORDER BY d.date DESC
2022-05-16, 15
2022-05-09, 13
2022-05-02, 13
...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sampling issue with query in BigQuery (Standard SQL) - google-bigquery

I think - it is because of use of 'MIN(date)' - You see shift in counts because you restricted dates so those users who were first seen in earlier dates - now those same "old" users are counted for recent days - thus the confusion

Related

Rolling 12 month Calculation SQL

How to define the filter in dates?

Records 1 Hour Before and 1 Hour After

Select records that don't exist in a union between a table and a subset of that table

sql count statement with multiple date ranges

Categories

Resources