Creating a new view in BigQuery, with case statements - take 2 - google-bigquery

I had asked this question before, but my original code snippet was wrong and I didn't see it until an answer came in.
The original question:
I am trying to create a new view in BigQuery using some of the Google hosted data. The data is for traffic collisions in New York.
For each unique day in the dataset, I want to find the borough and sum up some fields (people injured, killed, etc.)
Now, the dataset does have a borough field, but this is incomplete, and what I have seen is that there are also latitude and longitude fields. However these are also not complete. So I see 3 scenarios.
The borough is set, use that
No borough, but there is lat and long, so use those in a sub query.
There is no lat, long or borough, so just enter an "unknown" here to find the borough from lat and long, there is another public dataset, and I used this with a lat and long to check
SELECT UPPER(tz_loc.borough) FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` tz_loc
WHERE (ST_DWithin(tz_loc.zone_geom, ST_GeogPoint(-73.94398, 40.680088),0))
Originally, I tried this:
CREATE VIEW `your-project-id.your_dataset_id.collisions_data_bourgh` AS
SELECT CAST(timestamp as DATE) as collision_date,
COUNT(CAST(timestamp as DATE)) as NUM_COLLISIONS,
CASE
WHEN ds.borough IS NOT NULL THEN CAST(borough as STRING) -- when the borough is set
WHEN ((ds.latitude IS NOT NULL or ds.longitude IS NOT NULL) AND ds.borough IS NULL) THEN (SELECT CAST(UPPER(tz_loc.borough)as STRING) FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` tz_loc WHERE (ST_DWithin(tz_loc.zone_geom, ST_GeogPoint(CAST(ds.longitude AS FLOAT64), CAST(ds.latitude AS FLOAT64)),0))) -- when the borough is null and either lat or long is not null
WHEN (ds.latitude IS NULL OR ds.longitude IS NULL OR ds.borough IS NULL) THEN "Unknown"
END AS NEIGHBORHOOD,
SUM(CAST(number_of_cyclist_killed as INT64)) as CYCLISTS_KILLED,
SUM(CAST(number_of_cyclist_injured as INT64)) as CYCLISTS_INJURED,
SUM(CAST(number_of_motorist_killed as INT64)) as MOTORISTS_KILLED,
SUM(CAST(number_of_motorist_injured as INT64)) as MOTORISTS_INJURED,
SUM(CAST(number_of_pedestrians_killed as INT64)) as PEDS_KILLED,
SUM(CAST(number_of_pedestrians_injured as INT64)) as PEDS_INJURED,
SUM(CAST(number_of_persons_killed as INT64)) as PERSONS_KILLED,
SUM(CAST(number_of_persons_injured as INT64)) as PERSONS_INJURED,
FROM `bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions` ds
GROUP BY collision_date, NEIGHBORHOOD;
but was corrected by another user to the code below:
CREATE VIEW `uhi-assignment-1.assignment.collisions_data_bourgh22` AS
SELECT
CAST(timestamp AS DATE) AS collision_date,
COUNT(CAST(timestamp AS DATE)) AS NUM_COLLISIONS,
CASE
WHEN ds.borough IS NOT NULL THEN CONCAT('AA ', CAST(borough AS STRING)) -- when the borough is set
WHEN ds.borough IS NULL AND ds.location IS NOT NULL
THEN (
SELECT UPPER(tz_loc.borough) as STRING)
FROM bigquery-public-data.new_york_taxi_trips.taxi_zone_geom tz_loc
WHERE
ST_DWithin(tz_loc.zone_geom,
ST_GeogPoint(CAST(ds.longitude AS FLOAT64),
CAST(ds.latitude AS FLOAT64)),0)
AND tz_loc.borough = ds.borough
)
WHEN (ds.latitude IS NULL AND ds.longitude IS NULL AND ds.borough IS NULL) THEN "CC Unknown"
END
AS NEIGHBORHOOD,
SUM(CAST(number_of_cyclist_killed AS INT64)) AS CYCLISTS_KILLED,
SUM(CAST(number_of_cyclist_injured AS INT64)) AS CYCLISTS_INJURED,
SUM(CAST(number_of_motorist_killed AS INT64)) AS MOTORISTS_KILLED,
SUM(CAST(number_of_motorist_injured AS INT64)) AS MOTORISTS_INJURED,
SUM(CAST(number_of_pedestrians_killed AS INT64)) AS PEDS_KILLED,
SUM(CAST(number_of_pedestrians_injured AS INT64)) AS PEDS_INJURED,
SUM(CAST(number_of_persons_killed AS INT64)) AS PERSONS_KILLED,
SUM(CAST(number_of_persons_injured AS INT64)) AS PERSONS_INJURED,
FROM
bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions ds
GROUP BY
collision_date,
NEIGHBORHOOD
But this doesn't do as I needed. After some playing, I have found that I had no records being inserted using the seperate query.
From looking at the query, I think the line:
AND tz_loc.borough = ds.borough
is the issue as the ds.borough will be null (which is reason it goes down to check the taxi geo query.) But this line is needed to prevent a LEFT JOIN issue.
Anyone have any ideas?

The right way to do things in SQL is to think about sets, not per-row operations. The CASE in the suggested expression is rather anti-pattern.
Thinking about SET operations, we need to join rows with empty borough to zones, to get borough info this way, then use this data in original table:
join the collisions table rows where we don't have borough to the zones table - this will drop some rows, but that's OK. This only computes the join when needed, where borough is not filled out.
left join the collision table with the result of step 1 - this adds another column with neighborhood computed from zones table, without dropping anything.
use COALESCE(borough, tz_borough, 'UNKNOWN') to choose the first available data: first try original borough field, if not available - borough from taxi zones, otherwise UNKNOWN.
finally, group and count.
I dropped extra aggregations, but overall idea should be clear:
WITH taxi_zones_join AS (
SELECT ds.unique_key, tz_borough
FROM bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions ds
JOIN (
SELECT borough AS tz_borough, zone_geom AS tz_geom
FROM bigquery-public-data.new_york_taxi_trips.taxi_zone_geom tz_loc
)
ON ST_INTERSECTS(tz_geom, ST_GeogPoint(CAST(longitude AS FLOAT64), CAST(latitude AS FLOAT64)))
WHERE ds.borough IS NULL
),
collisions_plus AS (
SELECT
ds.* EXCEPT(borough),
UPPER(COALESCE(borough, tz_borough, 'UNKNOWN')) AS NEIGHBORHOOD
FROM bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions ds
LEFT JOIN taxi_zones_join bj
USING(unique_key)
)
SELECT
CAST(timestamp AS DATE) AS collision_date,
COUNT(*) AS NUM_COLLISIONS,
NEIGHBORHOOD
FROM collisions_plus
GROUP BY collision_date, NEIGHBORHOOD

Related

SQL BigQuery - Error that variable is not grouped by even though it is

SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER

I am stuck on getting a previous value

I have been working on this SQL code for a bit and I cannot get it to display like I want. I have an operation that we send parts outside of our business but there is no time stamp on when that operation sent out.
I am taking the previous operation's last labor date and the purchase order creation date to try and find out how long it takes that department to issued a purchase order.
I have tried LAST_Value to add to my query. I have even played with LAG and couldn't get a anything but errors.
SELECT
JobOpDtl.JobNum,
JobOpDtl.OprSeq,
JobOpDtl.OpDtlDesc,
LastValue.ClockInDate,
LastValue.LastValue
FROM Erp.JobOpDtl
LEFT OUTER JOIN Erp.LaborDtl ON
LaborDtl.JobNum = JobOpDtl.JobNum
and LaborDtl.OprSeq = JobOpDtl.OprSeq
LEFT OUTER JOIN (
Select
LaborDtl.JobNum,
LaborDtl.OprSeq,
MAX(LaborDtl.ClockInDate) as ClockInDate,
LAST_VALUE (LaborDtl.ClockInDate) OVER (PARTITION BY OprSeq ORDER BY JobNum) as LastValue
FROM Erp.LaborDtl
GROUP BY
LaborDtl.JobNum,
LaborDtl.OprSeq,
LaborDtl.ClockInDate
) as LastValue ON
JobOpDtl.JobNum = LastValue.JobNum
and JobOpDtl.OprSeq = LastValue.OprSeq
WHERE JobOpDtl.JobNum = 'PA8906'
GROUP BY
JobOpDtl.JobNum,
LastValue.OprSeq,
JobOpDtl.OpDtlDesc,
JobOpDtl.OprSeq,
LastValue.ClockInDate,
LastValue.LastValue
No errors, just not displaying how I am wanting it.
I would like it to display the OperSeq with the previous OperSeq last transaction date.
The basic function you want is LAG (as you suggested) but you need to wrap it in a COALESCE. Here is a sample code that illustrates the concept
SELECT * INTO #Jobs
FROM (VALUES ('P1','Step1', '2019-04-01'), ('P1','Step2', '2019-04-02')
, ('P1','Step3', '2019-04-03'), ('P1','Step4', NULL),
('P2','Step1', '2019-04-01'), ('P2','Step2', '2019-04-03')
, ('P2','Step3', '2019-04-06'), ('P2','Step4', NULL)
) as JobDet(JobNum, Descript, LastDate)
SELECT *
, COALESCE( LastDate, LAG(LastDate,1)
OVER(PARTITION BY JobNum
ORDER BY COALESCE(LastDate,GETDATE()))) as LastValue
FROM #Jobs
ORDER BY JobNum, Descript
DROP TABLE #Jobs
To apply it to your specific problem, I'd suggest using a COMMON TABLE EXPRESSION that replaces LastValue and using that instead of the raw table for your queries.
Your example picture doesn't match any tables you reference in your code (it would help us significantly if you included code that created temp tables matching those referenced in your code) so this is a guess, but it will be something like this:
;WITH cteJob as (
SELECT JobNum, OprSeq, OpDtlDesc, ClockInDate
, COALESCE( LastValue, LAG(LastValue,1)
OVER(PARTITION BY JobNum
ORDER BY COALESCE(LastValue,GETDATE()))) as LastValue
FROM Erp.JobOptDtl
) SELECT *
FROM cteJob as J
LEFT OUTER JOIN LaborDtl as L
on J.JobNum = JobNum
AND J.OprSeq = L.OprSeq
BTW, if you clean up your question to provide a better example of your data (i.e. SELECT INTO sttements like in the start of my answer that produce tables that correspond to the tables in your code instead of an image of an excel file) I might be able to get you closer to what you need, but hopefully this is enough to get you on the right track and it's the best I can do with what you've provided so far.

Access query inconsistently treats empty string as null

I have an application that is grabbing data from an Access database. I am seeking the minimum value of a column and the results I am getting back are inconsistent.
Have I run into a feature where Access inconsistently treating an empty string as a null depending on whether I add a filter or not, or is there something wrong with the way I am querying the data?
The column contains one blank value (not null) and several non-blank values that are all identical (about 30 instances of 'QLD'). The query I am using has a filter that involves multiple other tables, so that only the blank value and about half of the 'QLD' values are eligible.
It's probably easier to show the code and the effects rather than describe it. I have created a series of unioned queries which 'should' bring back identical results but do not.
Query:
SELECT 'min(LOC_STATE)' as Category
, min(LOC_STATE) as Result
FROM pay_run, pay_run_employee, employee, department, location
WHERE pr_id = pre_prid
AND em_location = loc_id
AND pre_empnum = em_empnum
AND em_department = dm_id
AND pr_date >= #2/24/2015#
AND pr_date <= #2/24/2016#
UNION ALL
(SELECT TOP 1 'top 1 LOC_STATE'
, LOC_STATE
FROM pay_run, pay_run_employee, employee, department, location
WHERE pr_id = pre_prid
AND em_location = loc_id
AND pre_empnum = em_empnum
AND em_department = dm_id
AND pr_date >= #2/24/2015#
AND pr_date <= #2/24/2016#
ORDER BY LOC_STATE)
UNION ALL
SELECT 'min unfiltered', min(loc_state)
FROM location
UNION ALL
(SELECT TOP 1 'iif is null', iif(loc_state is null, 'a', loc_state)
FROM location
ORDER BY loc_state)
Results:
Category Result
min(LOC_STATE) 'QLD'
top 1 LOC_STATE ''
min unfiltered ''
iif is null ''
If I do a minimum with the filter it brings back 'QLD' and not the empty string. At this stage it is possible that the empty string is not being included because it is treated as a null or the filter removes it.
The second query, which brings back the top 1 state using the filter shows that the empty string is not filtered out, which means that the Min function is ignoring the empty string.
The third query, which gets the minimum of the unfiltered table, brings back the empty string - so the minimum function does not exclude empty strings / treat them as null.
The fourth query, ensures that there is not a null in the empty string position.
My conclusion is that perhaps the inclusion of other tables and filter criteria is causing the empty string value to be treated as a null, but I feel that I must be missing something.
NB: I have a very similar query (date literals altered) that executes against the same data imported into a SQL Server database. It is correctly returning '' for all 4 queries.
Does anyone know why the empty string is ignored by the Min function in the first query?
PS: for those who prefer a query with joins
SELECT 'min(LOC_STATE)' as Category
, min(LOC_STATE) as Result
FROM (((pay_run
INNER JOIN pay_run_employee ON pay_run.pr_id = pay_run_employee.pre_prid)
INNER JOIN employee ON pay_run_employee.pre_empnum = employee.em_empnum)
INNER JOIN department ON employee.em_department = department.dm_id)
INNER JOIN location on employee.em_location = location.loc_id
WHERE
PR_DATE >= #2/24/2015# and
PR_DATE <= #2/24/2016#
union all
(SELECT TOP 1 'TOP 1 LOC_STATE'
, LOC_STATE
FROM (((pay_run
INNER JOIN pay_run_employee ON pay_run.pr_id = pay_run_employee.pre_prid)
INNER JOIN employee ON pay_run_employee.pre_empnum = employee.em_empnum)
INNER JOIN department ON employee.em_department = department.dm_id)
INNER JOIN location on employee.em_location = location.loc_id
WHERE
PR_DATE >= #2/24/2015# and
PR_DATE <= #2/24/2016#
order by LOC_STATE)
union all
select 'min unfiltered', min(loc_state)
from location
This has got nothing to do with corrupt data or unions or joins. The problem can be easily made visible by exectuting following queries in access:
create table testbug (Field1 varchar (255) NULL)
insert into testbug (Field1) values ('a')
insert into testbug (Field1) values ('')
insert into testbug (Field1) values ('c')
select min(field1) from testbug
To my opinion this is a bug in ms-access. When the MIN function in ms-access comes across an empty string ('') it forgets all the values he has come across and returns the minimum value from all the values below the empty string. (in my simple example only value 'c')

Bigquery: "Not enough memory"

Bigquery started to give me error:not enough memory when I run this query this morning. The two tables involved contain no more than 5GB data. Plus I'm using table decorators, 1407249067530 equals around 10:30am today(20140805). I wonder what's the problem.
Job ID: red-road-574:job_x8flLfo4QwA1gQ_FCrNWbKY-bZM
select * from
(
select t_connection.row_id AS debug_row_id,
t_connection.hardware_id AS hardware_id,
t_connection.debug_data AS debug_data,
t_connection.connection_status AS connection_status,
t_connection.date_time AS debug_date_time,
t_gps.hardware_id AS hardware_id2,
t_gps.latitude AS latitude,
t_gps.longitude AS longitude,
t_gps.date_time AS gps_date_time,
t_gps.zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id,
gg.hardware_id as hardware_id,
gg.latitude as latitude,
gg.longitude as longitude,
gg.date_time as date_time,
gg.zip_code as zip_code
from [my data set.table1_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
dd.date_time as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [my data set.table2_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id
)
) WHERE row_num=1
You're hitting an odd corner case. When you use allowLargeResults with results that are nested or repeated and you don't use flattenResults=false, the query goes into a special mode. (when you use timestamps, you're really using a nested data structure, which was a design decision that spawned 1000 bugs and is hopefully changing soon). This special query mode has some limitations, which are what you're hitting.
In general, we want this to be seamless, which is why it isn't documented. However, since you're running into a problem here, I'll explain a little about about how to avoid it.
You have a couple of options to get around this:
If you're using nested or repeated results (it looks like you're not, which is good):
rename your results without dots in the name.
set the flattenResults field on the query to 'false'. This means that nested and repeated fields will be actually nested and repeated in the results.
If you're using timestamps in the results:
Convert your timestamps to strings or numeric values. Sorry.
If you don't really need large results:
unset the allowLargeResults flag.
I realize that all of these options are deeply unsatisfying. This is an area we're actively working to improve.
Now with allowLargeReults=true and flattenResults=false and convert timestamps to numeric value at the first step
select * from
(
select row_id AS debug_row_id,
hardware_id AS hardware_id,
debug_data AS debug_data,
connection_status AS connection_status,
date_time AS debug_date_time,
hardware_id2 AS hardware_id2,
latitude AS latitude,
longitude AS longitude,
date_time2 AS gps_date_time,
zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time2-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id_gps,
gg.hardware_id as hardware_id2,
gg.latitude as latitude,
gg.longitude as longitude,
TIMESTAMP_TO_MSEC(gg.date_time) as date_time2,
gg.zip_code as zip_code
from [test.gps32_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
TIMESTAMP_TO_MSEC(dd.date_time) as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [test.debug_data_developer_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id2
)
) WHERE row_num=1
it gives me
Query Failed
Error: Resources exceeded during query execution.
Job ID: red-road-574:job_ikWQvffmPEUP6DtTvJaYpXHFJ2M
This is the functioning SQL with allowLargeResults=true, flattenResults=true. I don't know what I did to make this work, maybe only add a HAVING clause? But in the JOIN, I change one side to be a whole table instead of the one with decorator as above, so the data involved actually increased. I'm not sure whether it can keep successful or it's just temporary luck.

How to do multiple join / group by selects using sqlite3?

I have a sqlite3 database with one table called orig:
CREATE TABLE orig (sdate date, stime integer, orbnum integer);
What I want to do is select the first date/time for each orbnum. The only problem is that stime holds the time as a very awkward integer.
Assuming a six-digit number, the first two digits show the hour, the 3./4. show the minutes, and the last two digits show the seconds. So a value of 12345 is 1:23:45, whereas a value of 123456 is 12:34:56.
I figured I'd do this using two nested join/group statements, but somehow I cannot get it to work properly. Here's what I've got so far:
select s.orbnum, s.sdate, s.stime
from (
select t.orbnum, t.sdate, t.stime, min(t.sdate) as minsdate
from (
select orbnum, sdate, stime, min(stime) as minstime
from scia group by orbnum, sdate
) as t inner join orig as s on s.stime = t.minstime and s.sdate = t.sdate and s.orbnum = t.orbnum
) as d inner join scia as s on s.stime = d.stime and s.sdate = minsdate and s.orbnum = d.orbnum
where s.sdate >= '2002-08-01' limit 0,200;
This is the error I get:
Error: no such column: t.orbnum
I'm sure it's just some stupid mistake, but actually, I'm quite new to SQL ...
Any help is greatly appreciated :)
Edit:
After fixing the obvious typo, the query runs -- but returns an empty result set. However, the table holds ~10yrs of data, with about 12 orbnums per day and about 4-5 different times per orbnum. So I guess there's some mistake in the logic of the query ...
In your last join, you have d, which is the result of your double nested select, and you join s on it. From there, t is not visible. That’s why you get the “no such column: t.orbnum” error. Maybe you meant s.orbnum = d.orbnum?