Bigquery Query failes the first time and successfully completes the 2nd time - google-bigquery

I'm executing the following query.
SELECT properties.os, boundary, user, td,
SUM(boundary) OVER(ORDER BY rows) AS session
FROM
(
SELECT properties.os, ROW_NUMBER() OVER() AS rows, user, td,
CASE WHEN td > 1800 THEN 1 ELSE 0 END AS boundary
FROM (
SELECT properties.os, t1.properties.distinct_id AS user,
(t2.properties.time - t1.properties.time) AS td
FROM (
SELECT properties.os, properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.os, properties.distinct_id, properties.time,
ROW_NUMBER()
OVER (PARTITION BY properties.distinct_id
ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE properties.time > 1367916800
AND properties.time < 1380003200)) AS t1
JOIN (
SELECT properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.distinct_id, properties.time,
ROW_NUMBER() OVER
(PARTITION BY properties.distinct_id ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE
properties.time > 1367916800
AND properties.time < 1380003200 )) AS t2
ON t1.srlno = t2.prev_srlno
AND t1.properties.distinct_id = t2.properties.distinct_id
WHERE (t2.properties.time - t1.properties.time) > 0))
It fails the first time with the following error. However on 2nd run it completes without any issue. I'd appreciate any pointers on what might be causing this.
The error message is:
Query Failed
Error: Field 'properties.os' not found in table '__R2'.
Job ID: job_VWunPesUJVLxWGZsMgpoti14BM4
Thanks,
Navneet

We (the BigQuery team) are in the process of rolling out a new version of the query engine that fixes a number of issues like this one. You likely hit an old version of the query engine and then when you retried, hit the new one. It may take us a day or so with a portion of traffic pointing at the updated version in order to verify there aren't any regressions. Please let us know if you hit this again after 24 hours or so.

Related

SQL BigQuery - Error that variable is not grouped by even though it is

SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER

SQL return only the next 1 row of same type after a certain row

I have 4 'Operations' called Start, Finish, Available, Unavailable. Every time I see a row where 'Operation' = Available, I want to only return the next 1 row where the operation = 'Start' (while keeping the 'Finish' row for that same ID) until the next row where 'Operation' = Available (which, when this happens, I want to again return only the next 1 row where Operation = Start).
So starting with this dataset
Time ID Operation
6:34:50 AM 2016544 Finish
6:33:09 AM 2016544 Start
6:32:12 AM 2015289 Finish
6:32:07 AM 2015268 Finish
6:31:53 AM 2015834 Finish
6:31:39 AM 2015539 Finish
6:31:14 AM Available Available
6:31:12 AM Unavailable Unavailable
6:31:02 AM 2015289 Start
6:30:57 AM 2015268 Start
6:30:42 AM 2015834 Start
6:30:28 AM 2015539 Start
6:30:22 AM Available Available
I would like to get to this
Time ID Operation
6:34:50 AM 2016544 Finish
6:33:09 AM 2016544 Start
6:31:39 AM 2015539 Finish
6:31:14 AM Available Available
6:31:12 AM Unavailable Unavailable
6:30:28 AM 2015539 Start
6:30:22 AM Available Available
I don't fully follow the explanation. But your sample data and results suggests that you want the first row where a sequence of operations of the same type appear:
select t.*
from (select t.*, lag(operation) over (order by time) as prev_operation
from t
) t
where prev_operation is null or prev_operation <> operation;
From your desired result I guess you want last operation from sequence of the same operations. Try:
select Time, ID, Operation
from (
select Time, ID, Operation,
rownumber() over (partition by grp order by Time desc) rn
from (
select *,
rownumber() over (order by time) -
rownumber() over (partition by Operation order by time) grp
from MyTable
) a
) a where rn = 1

Netezza Box reboots when following query is executed

When I run the following query, my Netezza NPS reboots. Would someone please let me know what is causing this behaviour?
select avg ( bse.WEEKS_BETWEEN_RESPONSES_HR ) as g_AVG
, sqlext.median( bse.WEEKS_BETWEEN_RESPONSES_HR ) as g_med
from (
select WEEKS_BETWEEN_RESPONSES_HR
FROM (
select distinct LOYALTY_ACCOUNT_CARD_ID
, BONUS_END_DATE
, LAG(BONUS_END_DATE,1) OVER (partition by LOYALTY_ACCOUNT_CARD_ID order by BONUS_END_DATE) as PRIOR_BONUS_END_DATE
,(( BONUS_END_DATE - PRIOR_BONUS_END_DATE)/7) as WEEKS_BETWEEN_RESPONSES_HR
from JO_ACT_PTD_STEP_1 bse
where upper ( bonus_desc ) like '%SPEND%'
and redemption = 1
) BSE
where WEEKS_BETWEEN_RESPONSES_HR is not null and WEEKS_BETWEEN_RESPONSES_HR > 0
) bse limit 500 ```
You need to call the support people at IBM
There is probably a stack trace or a dump file somewhere that will tell them what happened
If I was experiencing your problem I would remove each of the function calls one by one and make the sql simpler and simpler until the error disappeared
But of course you will need to do that in the middle of the night or at a time when nobody else is being bothered by the constant re-boots

process of late events in SA

I was doing a test, when I generated data that was 30 days old.
When sent to SA job all that input was dropped, but per settings in event ordering blade I was expecting that all will be passed thru.
Part of job query contains:
---------------all incoming events storage query
SELECT stream.*
INTO [iot-predict-SA2-ColdStorage]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
so my expectation is to have everything that was pushed to SA job in blob storage.
When I sent events that were only 5 hours old - then the input was marked as late (expected) and processed.
Per SS first marked area is showing outdated events input, but no output (red), the second part shows late processed events.
full query
WITH AlertsBasedOnMin
AS (
SELECT stream.SensorGuid
,stream.Value
,stream.SensorName
,ref.AggregationTypeFlag
,ref.MinThreshold AS threshold
,ref.Count
,CASE
WHEN (ref.MinThreshold > stream.Value)
THEN 1
ELSE 0
END AS isAlert
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
WHERE ref.AggregationTypeFlag = 8
)
,AlertsBasedOnMax
AS (
SELECT stream.SensorGuid
,stream.Value
,stream.SensorName
,ref.AggregationTypeFlag
,ref.MaxThreshold AS threshold
,ref.Count
,CASE
WHEN (ref.MaxThreshold < stream.Value)
THEN 1
ELSE 0
END AS isAlert
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
WHERE ref.AggregationTypeFlag = 16
)
,alertMinMaxUnion
AS (
SELECT *
FROM AlertsBasedOnMin
UNION ALL
SELECT *
FROM AlertsBasedOnMax
)
,alertMimMaxComputed
AS (
SELECT SUM(alertMinMaxUnion.isAlert) AS EventCount
,alertMinMaxUnion.SensorGuid AS SensorGuid
,alertMinMaxUnion.SensorName
FROM alertMinMaxUnion
GROUP BY HoppingWindow(Duration(minute, 1), Hop(second, 30))
,alertMinMaxUnion.SensorGuid
,alertMinMaxUnion.Count
,alertMinMaxUnion.AggregationTypeFlag
,alertMinMaxUnion.SensorName
HAVING SUM(alertMinMaxUnion.isAlert) > alertMinMaxUnion.Count
)
,alertsMimMaxComputedMergedWithReference
AS (
SELECT System.TIMESTAMP [TimeStampUtc]
,computed.EventCount
,0 AS SumValue
,0 AS AvgValue
,0 AS StdDevValue
,computed.SensorGuid
,computed.SensorName
,ref.MinThreshold
,ref.MaxThreshold
,ref.TimeFrameInSeconds
,ref.Count
,ref.GatewayGuid
,ref.SensorType
,ref.AggregationType
,ref.AggregationTypeFlag
,ref.EmailList
,ref.PhoneNumberList
FROM alertMimMaxComputed computed
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = computed.SensorGuid
)
,alertsAggregatedByFunction
AS (
SELECT Count(1) AS eventCount
,stream.SensorGuid AS SensorGuid
,stream.SensorName
,ref.[Count] AS TriggerThreshold
,SUM(stream.Value) AS SumValue
,AVG(stream.Value) AS AvgValue
,STDEV(stream.Value) AS StdDevValue
,ref.AggregationTypeFlag AS flag
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
GROUP BY HoppingWindow(Duration(minute, 1), Hop(second, 30))
,ref.AggregationTypeFlag
,ref.[Count]
,ref.MaxThreshold
,ref.MinThreshold
,stream.SensorGuid
,stream.SensorName
HAVING
--as this is alert then this factor will be relevant to all of the aggregated queries
Count(1) >= ref.[Count]
AND (
--average
(
ref.AggregationTypeFlag = 1
AND (
AVG(stream.Value) >= ref.MaxThreshold
OR AVG(stream.Value) <= ref.MinThreshold
)
)
--sum
OR (
ref.AggregationTypeFlag = 2
AND (
SUM(stream.Value) >= ref.MaxThreshold
OR Sum(stream.Value) <= ref.MinThreshold
)
)
--stdev
OR (
ref.AggregationTypeFlag = 4
AND (
STDEV(stream.Value) >= ref.MaxThreshold
OR STDEV(stream.Value) <= ref.MinThreshold
)
)
)
)
,alertsAggregatedByFunctionMergedWithReference
AS (
SELECT System.TIMESTAMP [TimeStampUtc]
,0 AS EventCount
,computed.SumValue
,computed.AvgValue
,computed.StdDevValue
,computed.SensorGuid
,computed.SensorName
,ref.MinThreshold
,ref.MaxThreshold
,ref.TimeFrameInSeconds
,ref.Count
,ref.GatewayGuid
,ref.SensorType
,ref.AggregationType
,ref.AggregationTypeFlag
,ref.EmailList
,ref.PhoneNumberList
FROM alertsAggregatedByFunction computed
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = computed.SensorGuid
)
,allAlertsUnioned
AS (
SELECT *
FROM alertsAggregatedByFunctionMergedWithReference
UNION ALL
SELECT *
FROM alertsMimMaxComputedMergedWithReference
)
---------------alerts storage query
SELECT *
INTO [iot-predict-SA2-Alerts-ColdStorage]
FROM allAlertsUnioned
---------------alerts to alert events query
SELECT *
INTO [iot-predict-SA2-Alerts-EventStream]
FROM allAlertsUnioned
---------------alerts to stream query
SELECT *
INTO [iot-predict-SA2-TSI-EventStream]
FROM allAlertsUnioned
---------------all incoming events storage query
SELECT stream.*
INTO [iot-predict-SA2-ColdStorage]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
---------------all incoming events to time insights query
SELECT stream.*
INTO [iot-predict-SA2-TSI-AlertStream]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
Since you are using "TIMESTAMP BY", Stream Analytics job event ordering settings are taking effects. Please check your job's "event ordering" settings, specifically below two:
Events that arrive late -- the late arrival limit between 0 second and 21 days.
Handling other events -- error handling policy, drop or adjust the application time to system clock time.
I guess that, most likely, your late arrival limit was more than 5 hours, so that those 5-hours old events could be processed.
You may already figure out from above that Stream Analytics job can only process "old" events up to 21 days late. To work around this limitation, you can consider one of below options:
Remove TIMESTAMP BY, then all your windowing aggregate will be using enqueue time. This might generate incorrect result according to your query logic.
Select "adjust" as the error handling policy. Again, this might generate incorrect result according to your query logic.
Shifting the application time (stream.UtcTime) to a more resent time by using DATEADD() function, for example TIMESTAMP BY DATEADD(day, 10, UtcTime). This works well when this is a onetime task, and you know the time range of your events.
Use batch job(outside Stream Analytics) to process data that 30 days old.
After a chat with guys from MS, it emerged that my test have to had an extra step to perform.
To have late events processed, regardless late event settings, we need to start this job in a way, that late event is considered as a sent when job was started, so in this particular case, we have to start SA job using custom start date and set it 30 days ago.

Bigquery: "Not enough memory"

Bigquery started to give me error:not enough memory when I run this query this morning. The two tables involved contain no more than 5GB data. Plus I'm using table decorators, 1407249067530 equals around 10:30am today(20140805). I wonder what's the problem.
Job ID: red-road-574:job_x8flLfo4QwA1gQ_FCrNWbKY-bZM
select * from
(
select t_connection.row_id AS debug_row_id,
t_connection.hardware_id AS hardware_id,
t_connection.debug_data AS debug_data,
t_connection.connection_status AS connection_status,
t_connection.date_time AS debug_date_time,
t_gps.hardware_id AS hardware_id2,
t_gps.latitude AS latitude,
t_gps.longitude AS longitude,
t_gps.date_time AS gps_date_time,
t_gps.zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id,
gg.hardware_id as hardware_id,
gg.latitude as latitude,
gg.longitude as longitude,
gg.date_time as date_time,
gg.zip_code as zip_code
from [my data set.table1_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
dd.date_time as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [my data set.table2_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id
)
) WHERE row_num=1
You're hitting an odd corner case. When you use allowLargeResults with results that are nested or repeated and you don't use flattenResults=false, the query goes into a special mode. (when you use timestamps, you're really using a nested data structure, which was a design decision that spawned 1000 bugs and is hopefully changing soon). This special query mode has some limitations, which are what you're hitting.
In general, we want this to be seamless, which is why it isn't documented. However, since you're running into a problem here, I'll explain a little about about how to avoid it.
You have a couple of options to get around this:
If you're using nested or repeated results (it looks like you're not, which is good):
rename your results without dots in the name.
set the flattenResults field on the query to 'false'. This means that nested and repeated fields will be actually nested and repeated in the results.
If you're using timestamps in the results:
Convert your timestamps to strings or numeric values. Sorry.
If you don't really need large results:
unset the allowLargeResults flag.
I realize that all of these options are deeply unsatisfying. This is an area we're actively working to improve.
Now with allowLargeReults=true and flattenResults=false and convert timestamps to numeric value at the first step
select * from
(
select row_id AS debug_row_id,
hardware_id AS hardware_id,
debug_data AS debug_data,
connection_status AS connection_status,
date_time AS debug_date_time,
hardware_id2 AS hardware_id2,
latitude AS latitude,
longitude AS longitude,
date_time2 AS gps_date_time,
zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time2-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id_gps,
gg.hardware_id as hardware_id2,
gg.latitude as latitude,
gg.longitude as longitude,
TIMESTAMP_TO_MSEC(gg.date_time) as date_time2,
gg.zip_code as zip_code
from [test.gps32_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
TIMESTAMP_TO_MSEC(dd.date_time) as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [test.debug_data_developer_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id2
)
) WHERE row_num=1
it gives me
Query Failed
Error: Resources exceeded during query execution.
Job ID: red-road-574:job_ikWQvffmPEUP6DtTvJaYpXHFJ2M
This is the functioning SQL with allowLargeResults=true, flattenResults=true. I don't know what I did to make this work, maybe only add a HAVING clause? But in the JOIN, I change one side to be a whole table instead of the one with decorator as above, so the data involved actually increased. I'm not sure whether it can keep successful or it's just temporary luck.