Use calculated value in BQ - dynamic-sql

I need to use a calculated value called query_size in a select query in BQ.
It is calculated as a difference between two values.
query_size = 10 - 1
SELECT
store,
count(*) as no_of_purchases
FROM
`project.acq.purchase_table`
group by store
ORDER BY store DESC LIMIT query_size;
I have tried something like this, it does not have any formatting errors, but BQ cannot estimate its cost: WARNING: Could not compute bytes processed estimate for script.
DECLARE query_size FLOAT64;
SET query_size = SAFE_SUBTRACT(10, 1);
EXECUTE IMMEDIATE format("""SELECT store, count(*) as no_of_purchases FROM `project.acq.purchase_table` group by store ORDER BY store DESC LIMIT %f""", query_size);
Also, query_size should be integer (I have used FLOAT because I couldn't find the replacement in the format syntax (where I have used %f)).
Could you please advise?

Related

Athena pagination and performance issue

I have huge data set in S3 and using AWS Athena I am trying to query it, below 3 parameters are input for my query.
marketplaceId
startIndex
endIndex
but it's took 16 seconds to query just 50 records. ( I am using python to query data from Athena --> S3)
What I am doing wrong here? and the way I implemented pagination is right or not?
SQL Query which I am executing.
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_num
FROM
(
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_number() over (
order by
browse_node_id,
gl_product_group,
glance_view_count_t12m desc
) as row_num
from
(
select
*
from
category_info
WHERE
marketplace_id = '<marketplaceId>'
)
)
WHERE
row_num between '<startIndex>'
and '<endIndex>';
Update
After debugging my issue with timestamp I found It's taking 6 second for 1 query. and I am running two query.
1st - to get data , query which I mentioned above.
2nd - to get count of total number or rows in my table.
so that's why it's taking 12-16 sec.
So is there any way to get total number of rows without second query (select count(*) from category_info).

How to convert array to string value

Hello i am trying to get my queries log cost, i get the total amount but when i try to break it down per dataset i get this error:
' Cannot access field datasetId on a value with type ARRAY> at '
this is my query i am trying to run:
WITH
data AS (
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent,
(
SELECT
ARRAY_TO_STRING((
SELECT
ARRAY_AGG(datasetId)
FROM
UNNEST(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.referencedTables.datasetId) ))) AS datasetIds
FROM
`kkk111.bq_audit_log_export.cloudaudit_googleapis_com_data_access_20190206` )
SELECT
datasetIds,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY
datasetIds
ORDER BY
Estimated_USD_Cost DESC
I am using Standard SQL Dialect
how do i cast this field:
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.referencedTables.datasetId
from array to a string ?
what am i missing ?
Thanks.
Below is for BigQuery Standard SQL
#standardSQL
WITH data AS (
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent,
ref.datasetId AS datasetId
FROM `kkk111.bq_audit_log_export.cloudaudit_googleapis_com_data_access_20190206`,
UNNEST(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.referencedTables) ref
)
SELECT
datasetId,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM data
WHERE jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY datasetId
ORDER BY Estimated_USD_Cost DESC
As you can see there, obviously you need to UNNEST referencedTables ARRAY but also you need to make sure your final calculation of Cost is as close to correct one as possible. The same query can reference multiple tables from the same dataset, so you better have DISTINCT in your CTE. But also, same query can reference tables from multiple datasets - so in this same billing bytes will be attributed to multiple datasets, so you will have overestimation! I don't know your exact intent - but you might want to introduce some logic to distribute the cost among the referenced datasets.
You need to UNNEST the outer array in order to select the dataset ID inside:
SELECT
ARRAY_TO_STRING((
SELECT ARRAY_AGG(datasetId)
FROM UNNEST(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.referencedTables)
), ',') AS datasetIds
FROM ...

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Postgresql Writing max() Window function with multiple partition expressions?

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.

SQL Percentage of Occurrences

I'm working on some SQL code as part of my University work. The data is factitious just to be clear. I'm trying to count the occurances of 1 & 0 in the SQL table Fact_Stream, this is stored in the Free_Stream column/attribute as a Boolean/bit value.
As calculations cant be made on bit values (at least in the way I'm trying) I've converted the value to an integer -- Just to be clear on that. The table contains information on a streaming companies streams, a 1 indicates the stream was free of charge, a 0 indicates the stream was paid for. My code:
SELECT Fact_Stream.Free_Stream, ((CAST(Free_Stream AS INT)) / COUNT(*) * 100) As 'Percentage of Streams'
FROM Fact_Stream
GROUP BY Free_Stream
The result/output is nearly where I want it to be, but it doesn't display the percentage correctly.
Output:
Using MS SQL Management Studio | MS SQL Server 2012 (I believe)
The percentage should be based on all rows, so you need to divide the count per 1/0 by a count of all rows. The easiest way to get this is utilizing a Windowed Aggregate Function:
SELECT Fact_Stream.Free_Stream,
100.0 * COUNT(*) -- count per bit
/ SUM(COUNT(*)) OVER () -- sum of those counts = count of all rows
As "Percentage of Streams"
FROM Fact_Stream
GROUP BY Free_Stream
You have INTs as a devisor and devidened(not sure I am correct with namings). So the result is also INT. Just cast one of those to decimal(notice how did I change to 100.0). Also you should debide count of elements in group to total count of rows in the table:
select Free_Stream,
(count(*) / (select count(*) from Free_Stream)) * 100.0 as 'Percentage of Streams'
from Fact_Stream
group by Free_Stream
Your equation is dividing the identifier (1 or 0) by the number of streams for each one, instead of dividing the count of free or paid by the total count. One way to do this is to get the total count first, then use it in your query:
declare #totalcount real;
select #totalcount = count(*) from Fact_Stream;
SELECT Fact_Stream.Free_Stream,
(Cast(Count(*) as real) / #totalcount)*100 AS 'Percentage of Streams'
FROM Fact_Stream
group by Fact_Stream.Free_Stream