Accessing query parameters for BigQuery SQL queries - google-bigquery

We can query BigQuery with queries containing parameters, such as the one:
SELECT
T1.product AS product_0,
T1.date AS date_1,
SUM(COALESCE(T1.quantity, #default_value_0)) AS Quantity_QE_SUM_2
FROM `project.dataset.table` AS T1
GROUP BY T1.product, T1.date
Looking at the query history, I can find this SQL. But I have not found a way to find the provided values for default_value_0 - or any other parameters for more complex queries?
This is blocking me for investigating wrong results.

Here is my example:
bq query \
--use_legacy_sql=false \
--parameter=corpus::romeoandjuliet \
--parameter=min_word_count:INT64:250 \
'SELECT
word,
word_count
FROM
`bigquery-public-data.samples.shakespeare`
WHERE
corpus = #corpus
AND
word_count >= #min_word_count
ORDER BY
word_count DESC;'
Running parameterised queries in bq command-line tool. Here, 2 parameters used:
--parameter=corpus::romeoandjuliet
--parameter=min_word_count:INT64:250
If you look at the Personal History page, you will be able to see the query as follows:
SELECT
word,
word_count
FROM
`bigquery-public-data.samples.shakespeare`
WHERE
corpus = #corpus
AND
word_count >= #min_word_count
ORDER BY
word_count DESC;
At the bottom of Query Job Details window, click on the open as new query window:
In the opened query window, select your parameters to get the value as follows:
SELECT #corpus, #min_word_count;
It is not a straightforward option, but helps for immediate troubleshooting.

Related

SQL GROUP BY taking too long for multiple columns

It takes a reasonable amount of time to run the following query and get the results:
SELECT {col1} COUNT(col3) FROM {table1} GROUP BY {col1} LIMIT 100
But when I run for multiple columns (as below), it takes for ever!
SELECT {col1}, {col2} COUNT(col3) FROM {table1} GROUP BY {col1}, {col2} LIMIT 100
I looked for ways to make it more efficiently, but I did not find anything that works for me. I am posting to get new thoughts/directions. Thanks!
{table1} is relatively huge (~12B rows)

Oracle: I can find query in sharedpool, but i can not find query in source code. Where does it come from?

With sql below I can find a specific query in the Oracle sharedpool, that is taking 10s and is executed approximately once every 10 minutes. But I can not find this query in source code.
SELECT sql_id, hash_value, plan_hash_value, child_number, executions,
round(buffer_gets/executions) AS buffer_per_exec,
round(physical_read_bytes/executions/8192) AS phys_per_exec,
round(elapsed_time/executions/1000) as elapsed_time,
TO_CHAR (last_active_time, 'DD/Mon/YYYY HH24:MI:SS') as last_active_time,
IS_OBSOLETE, IS_BIND_SENSITIVE, IS_BIND_AWARE, IS_SHAREABLE
FROM v$sql
WHERE sql_text NOT LIKE '%v$sql%'
AND round(buffer_gets/executions) > 40000
AND executions > 0
ORDER BY to_date(last_active_time) DESC, elapsed_time, phys_per_exec DESC;
When I execute
select * from table(dbms_xplan.display_cursor(<sql_id>, <child_id>, 'allstats peeked_binds last'));
I get the output:
SELECT "A1"."T1_ID","A1"."T4_ID","A2"."T1_ID","A2"."T3_ID"
FROM "TABLE1" "A1","TABLE2" "A2" WHERE "A2"."T1_ID"="A1"."T1_ID"
What I wonder is whenever I explain_cursor for any other query, I don't get the result in ""(double quotes).
If I lookup in v$sesion with:
select v.* from v$session v
where <sql_id> in (v.prev_sql_id, v.sql_id);
I get the SID, SERIAL# and so on, but I see it as a ORACLE.EXE program.
Who is calling this query? Where can I find it? Is it possible that query is executed through a database link? Why is query in sharedpool in double qoutes?
This is typical of remote queries executed via a database link, e.g
select * from myschema.mytable#otherdb
will be rewritten and executed on the remote side by the database gateway as:
select "A1"."COL1",
"A1"."COL2"
from "MYSCHEMA"."MYTABLE" "A1"
You can confirm this by inspecting the machine column of the V$SESSION view.

What is the max limit of group_concat/string_agg in bigquery output?

I am using group_concat/string_agg (possibly varchar) and want to ensure that bigquery won't drop any of the data concatenated.
BigQuery will not drop data if a particular query runs out of memory; you will get an error instead. You should try to keep your row sizes below ~100MB, since beyond that you'll start getting errors. You can try creating a large string with an example like this:
#standardSQL
SELECT STRING_AGG(word) AS words FROM `bigquery-public-data.samples.shakespeare`;
There are 164,656 rows in this table, and this query creates a string with 1,168,286 characters (around a megabyte in size). You'll start to see an error if you run a query that requires more than something on the order of hundreds of megabytes on a single node of execution, though:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
This results in an error:
Resources exceeded during query execution.
If you click on the "Explanation" tab in the UI, you can see that the failure happened during stage 1 while building the results of STRING_AGG. In this case, the string would have been 3,303,599,000 characters long, or approximately 3.3 GB in size.
Adding to Elliot's answer - how to fix:
This query (Elliot's) fails:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
But you can LIMIT the number of strings concatenated to get a working solution:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus) LIMIT 10) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));

A 228 row results query job writing to table with gives 0 rows when allow large results is True

I have a SQL query that, when I write the results to a table without 'allow large results' set, will write 228 rows.
When I set allow large results however, the destination table will contain 0 rows. Both attempts use write disposition WRITE_TRUNCATE.
I see this both using the API and the BigQuery console.
The working no-allow-large-results job:
eagTEiR0wSMK6b5WLSL04vB9RfTUb8bhvEi1YFWjuhfaF_W0zEeLogxUYwOrhGyOheS_CyyaB1dUeafGPdyR592xMcbeEmpJ85_CO29PSbBAnmEBGHJVHWjpH5DvGyVCEjarfJ5XUQ9UmVT_FSHmkcEZktbfln9E_E1jobM65IuQv2sP4_r7eqK60aPaqxD7taEc1bpM2kS6GAtkxqFsUUOv_JXQgTn3ebCodHFKsdquhy3e1mfbu4QhqnoO5QCi
The non-working allow-large-results job:
G40HW4Z5zGTgL1NSCBBy380kY7Gu7WOU7s_zB9F8Kdrtao2gbzRLptWSSi76MC2gHCHPG0srssaGejfCIN4j1upjyh9vQnA3kPmuJcgm5ZgdYd3YwsmGzvcBXiPy9bY0x0GRhJXimHqhKiYbKz7fa3LljOb4kxNvB8wPazqeYj3xAXwbV8G2Sl3L6gmutvvYPalhd1CCtUbLfiw520_I4zKDgn7LYosyFjA0h9TwR8GQ80Scd5n8yKAsIEou7XDG
Query:
SELECT t1.email, MIN(t1.min_created_time), GROUP_CONCAT(t1.id)
FROM (
SELECT email, MIN(created) as min_created_time, id
FROM TABLE_QUERY([xxxxx], 'table_id in ("yyyyyy_201601", "yyyyyy_201602", "yyyyyy _201603", "yyyyyy_201604")')
WHERE created >= "2016-01-11 00:00:00" AND created < "2016-04-01 00:00:00" AND id != "null" AND name LIKE "%trike%"
GROUP BY email, id
) t1
GROUP EACH BY t1.email
IGNORE CASE
Also note, a simpler SQL works for both cases such as:
select email from xxxx group by email limit 100
This looks like a problem due to IGNORE CASE. The fix is underway, but in the meantime could you wrap string comparisons with LOWER() calls, i.e.
LOWER(id) != "null"
LOWER(name) LIKE "%trike%"
etc.

Finding top 10 occurrences in data

I am trying to find the top 10 mentions (#xxxxx) in my twitter data. I have created the initial table twitter.full_text_ts and loaded it with my data.
create table twitter.full_text_ts as
select id, cast(concat(substr(ts,1,10), ' ', substr(ts,12,8)) as timestamp) as ts, lat, lon, tweet
from full_text;
Ive been able to extract the mentions in the tweets by using this query (patterns)
select id, ts, regexp_extract(lower(tweet), '(.*)#user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50;
executing this gives me
USER_a3ed4b5a 2010-03-07 03:46:23 fffed220
USER_dc8cfa6f 2010-03-05 18:28:39 fffdabf9
USER_dc8cfa6f 2010-03-05 18:32:55 fffdabf9
USER_915e3f8c 2010-03-07 03:39:09 fffdabf9
and so on...
You can see the fffed220 etc is the extracted patterns.
Now what I would like to do is count the number of times each of these mentions (patterns) occurs and output the top 10. for example fffdabf9 occurs 20 times, fffxxxx occurs 17 times and so on.
The most readable way to do this would be to save your first query into a temporary table, then do a groupby on the temp table:
create table tmp as
--your query
select patterns, count(*) n_mentions
from tmp
group by patterns
order by count(*) desc
limit 10;
with mentions as
(select id, ts,
regexp_extract(lower(tweet), '(.*)#user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50)
select patterns, count(*)
from mentions
group by patterns
order by count(*) desc
limit 10;