Bigquery: "Not enough memory" - google-bigquery

Bigquery started to give me error:not enough memory when I run this query this morning. The two tables involved contain no more than 5GB data. Plus I'm using table decorators, 1407249067530 equals around 10:30am today(20140805). I wonder what's the problem.
Job ID: red-road-574:job_x8flLfo4QwA1gQ_FCrNWbKY-bZM
select * from
(
select t_connection.row_id AS debug_row_id,
t_connection.hardware_id AS hardware_id,
t_connection.debug_data AS debug_data,
t_connection.connection_status AS connection_status,
t_connection.date_time AS debug_date_time,
t_gps.hardware_id AS hardware_id2,
t_gps.latitude AS latitude,
t_gps.longitude AS longitude,
t_gps.date_time AS gps_date_time,
t_gps.zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id,
gg.hardware_id as hardware_id,
gg.latitude as latitude,
gg.longitude as longitude,
gg.date_time as date_time,
gg.zip_code as zip_code
from [my data set.table1_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
dd.date_time as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [my data set.table2_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id
)
) WHERE row_num=1

You're hitting an odd corner case. When you use allowLargeResults with results that are nested or repeated and you don't use flattenResults=false, the query goes into a special mode. (when you use timestamps, you're really using a nested data structure, which was a design decision that spawned 1000 bugs and is hopefully changing soon). This special query mode has some limitations, which are what you're hitting.
In general, we want this to be seamless, which is why it isn't documented. However, since you're running into a problem here, I'll explain a little about about how to avoid it.
You have a couple of options to get around this:
If you're using nested or repeated results (it looks like you're not, which is good):
rename your results without dots in the name.
set the flattenResults field on the query to 'false'. This means that nested and repeated fields will be actually nested and repeated in the results.
If you're using timestamps in the results:
Convert your timestamps to strings or numeric values. Sorry.
If you don't really need large results:
unset the allowLargeResults flag.
I realize that all of these options are deeply unsatisfying. This is an area we're actively working to improve.

Now with allowLargeReults=true and flattenResults=false and convert timestamps to numeric value at the first step
select * from
(
select row_id AS debug_row_id,
hardware_id AS hardware_id,
debug_data AS debug_data,
connection_status AS connection_status,
date_time AS debug_date_time,
hardware_id2 AS hardware_id2,
latitude AS latitude,
longitude AS longitude,
date_time2 AS gps_date_time,
zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time2-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id_gps,
gg.hardware_id as hardware_id2,
gg.latitude as latitude,
gg.longitude as longitude,
TIMESTAMP_TO_MSEC(gg.date_time) as date_time2,
gg.zip_code as zip_code
from [test.gps32_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
TIMESTAMP_TO_MSEC(dd.date_time) as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [test.debug_data_developer_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id2
)
) WHERE row_num=1
it gives me
Query Failed
Error: Resources exceeded during query execution.
Job ID: red-road-574:job_ikWQvffmPEUP6DtTvJaYpXHFJ2M

This is the functioning SQL with allowLargeResults=true, flattenResults=true. I don't know what I did to make this work, maybe only add a HAVING clause? But in the JOIN, I change one side to be a whole table instead of the one with decorator as above, so the data involved actually increased. I'm not sure whether it can keep successful or it's just temporary luck.

Related

How to run sql queries with multiple with clauses(sub-query refactoring)?

I have a code block that has 7/8 with clauses(
sub-query refactoring) in queries. I'm looking on how to run this query as I'm getting 'sql compilation errors' when running these!, While I'm trying to run them I'm getting errors in snowflake. for eg:
with valid_Cars_Stock as (
select car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
)
, car_sale_hist as (
select vw.issue_id, vw.delivery_effective_ts, bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
)
,
So how to run this 2 queries separately or together! I'm new to sql aswell any help would be ideal
So lets just reformate that code as it stands:
with valid_Cars_Stock as (
select
car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
), car_sale_hist as (
select
vw.issue_id,
vw.delivery_effective_ts,
bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw
on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b
on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
),
There are clearly part of a larger block of code.
For an aside of CTE, you should 100% ignore anything anyone (including me) says about them. They are 2 things, a statical sugar, and they allow avoidance of repetition, thus the Common Table Expression name. Anyways, they CAN perform better than temp tables, AND they CAN perform worse than just repeating the say SQL many times in the same block. There is no one rule. Testing is the only way to find for you SQL what is "fastest" and it can and does change as updates/releases are made. So ignoring performance comments.
if I am trying to run a chain like this to debug it I alter the point I would like to stop normally like so:
with valid_Cars_Stock as (
select
car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
)--, car_sale_hist as (
select
vw.issue_id,
vw.delivery_effective_ts,
bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw
on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b
on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
;), NEXT_AWESOME_CTE_THAT_TOTALLY_MAKES_SENSE (
-- .....
and now the result of car_sale_hist will be returned. because we "completed" the CTE chain by not "starting another" and the ; stopped the this is all part of my SQL block.
Then once you have that steps working nicely, remove the semi-colon and end of line comments, and get of with value real value.

Select only the row with the max value, but the column with this info is a SUM()

I have the following query:
SELECT DISTINCT
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI,
SUM(VLRNOTA) AS AMOUNT
FROM TGFCAB CAB, TGFPAR PAR, TSIBAI BAI
WHERE CAB.CODPARC = PAR.CODPARC
AND PAR.CODBAI = BAI.CODBAI
AND CAB.TIPMOV = 'V'
AND STATUSNOTA = 'L'
AND PAR.CODCID = 5358
GROUP BY
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI
Which the result is this. Company names and neighborhood hid for obvious reasons
The query at the moment, for those who don't understand Latin languages, is giving me clients, company name, company neighborhood, and the total value of movements.
in the WHERE clause it is only filtering sales movements of companies from an established city.
But if you notice in the Select statement, the column that is retuning the value that aggregates the total amount of value of sales is a SUM().
My goal is to return only the company that have the maximum value of this column, if its a tie, display both of em.
This is where i'm struggling, cause i can't seem to find a simple solution. I tried to use
WHERE AMOUNT = MAX(AMOUNT)
But as expected it didn't work
You tagged the question with the whole bunch of different databases; do you really use all of them?
Because, "PL/SQL" reads as "Oracle". If that's so, here's one option.
with temp as
-- this is your current query
(select columns,
sum(vrlnota) as amount
from ...
where ...
)
-- query that returns what you asked for
select *
from temp t
where t.amount = (select max(a.amount)
from temp a
);
You should be able to achieve the same without the need for a subquery using window over() function,
WITH T AS (
SELECT
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI,
SUM(VLRNOTA) AS AMOUNT,
MAX(VLRNOTA) over() AS MAMOUNT
FROM TGFCAB CAB
JOIN TGFPAR PAR ON PAR.CODPARC = CAB.CODPARC
JOIN TSIBAI BAI ON BAI.CODBAI = PAR.CODBAI
WHERE CAB.TIPMOV = 'V'
AND STATUSNOTA = 'L'
AND PAR.CODCID = 5358
GROUP BY CAB.CODPARC, PAR.RAZAOSOCIAL, BAI.NOMEBAI
)
SELECT CODPARC, RAZAOSOCIAL, NOMEBAI, AMOUNT
FROM T
WHERE AMOUNT=MAMOUNT
Note it's usually (always) beneficial to join tables using clear explicit join syntax. This should be fine cross-platform between Oracle & SQL Server.

Change duplicate value in a column

Can you please tell me what SQL query can I use to change duplicates in one column of my table?
I found these duplicates:
SELECT Model, count(*) FROM Devices GROUP BY model HAVING count(*) > 1;
I was looking for information on exactly how to change one of the duplicate values, but unfortunately I did not find a specific option for myself, and all the more information is all in abundance filled by deleting the duplicate value line, which I don't need. Not strong in SQL at all. I ask for help. Thank you so much.
You can easily use a Window Functions such as ROW_NUMBER() with partitioning option in order to group by Model column to eliminate the duplicates, and then pick the first rows(rn=1) returning from the subquery such as
WITH d AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Model) AS rn
FROM Devices
)
SELECT ID, Model -- , and the other columns
FROM d
WHERE rn = 1
Demo
use exists as follows:
update d
set Model = '-'
from Devices d
where exists (select 1 from device dd where dd.model = d.model and dd.id > d.id)
After the command:
SELECT Model, count (*) FROM Devices GROUP BY model HAVING count (*)> 1;
i get the result:
1895 lines = NULL;
3383 lines with duplicate values;
and all these values are 1243.
after applying your command:
update Devices set
Model = '-'
where id not in
(select
min(Devices .id)
from Devices
group by Devices.Model)
i got 4035 lines changed.
if you count, it turns out, (3383 + 1895) = 5278 - 1243 = 4035
and it seems like everything fits together, the result suits, it works.

SQL BigQuery - Error that variable is not grouped by even though it is

SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER

Aggregate values and Pivot

I am partly on my way to solving this, but have hit a stumbling block, which I think can be solved with pivot(s).
I have the following SQL query, combining two temporary table variables (may change these to temporary tables, as I think performance maybe come a problem as they will be hit a large number of times):
SELECT MeterId, MeterDataOutput.BuildingId, MeterDataOutput.Value,
MeterDataOutput.TimeStamp, UtilityId, SnapshotId
FROM #MeterDataOutput as MeterDataOutput INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
This produces the following table:
I have then modified the query to group by BuildingId, SnapshotId, Timestamp, Utility and applied the SUM() function to aggregate the Value field (and dropped the MeterId as its not required), as follows:
SELECT MeterDataOutput.BuildingId, SUM(MeterDataOutput.Value) AS Value, MeterDataOutput.TimeStamp, UtilityId, SnapshotId
FROM #MeterDataOutput as MeterDataOutput
INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
GROUP BY MeterDataOutput.BuildingId, MeterDataOutput.TimeStamp, UtilityId, SnapshotId
This query the provides me with the following table:
Now the bit I'm having trouble with is transforming the UtilityId values to columns, and placing the values from the Value field under each column. I.e:
For reference buildingId, Timestamp, Snapshot and Value are variable. UtilityId value 6 is always 'Electricity', 7 is always 'Gas' and 8 is always 'Water'.
I'm actually starting to get the hand of the SQL lark :)
Maybe something like this:
SELECT
pvt.BuildingId,
pvt.SnapshotId,
pvt.TimeStamp,
pvt.[6] AS Electricity,
pvt.[7] AS Gas,
pvt.[8] AS Water
FROM
(
SELECT
MeterDataOutput.BuildingId,
MeterDataOutput.Value,
MeterDataOutput.TimeStamp,
UtilityId,
SnapshotId
FROM #MeterDataOutput as MeterDataOutput
INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
) AS SourceTable
PIVOT
(
SUM(Value)
FOR UtilityId IN ([6],[7],[8])
) AS pvt