I'm attempting to create percentile scores. My query generates the ranks correctly, but the divide does nothing(the ranks are displayed in the columns rather than scores)
"/"(RANK() OVER(ORDER BY "Disk IO"),Count(*)) "Disk IO Score"
I've also tried generating the rank then selecting that and dividing, but it has the same result.
SELECT ..."/"("Disk IO Score",Count(*)) "Score"...
FROM(....RANK() OVER(ORDER BY "Disk IO") "Disk IO Score"...)
Thanks,
Buzkie
SELECT "System_Name", "/"(RANK() OVER(ORDER BY "Disk IO"),Count(*)) "Disk IO Score"
FROM (Select...)
GROUP BY "System_Name", "Disk IO"
Seems you are using aggregate COUNT(*) rather than analytic one.
Try this:
SELECT RANK() OVER (...) / COUNT(*) OVER (...)
And could you please post the whole query (including GROUP BY clauses)?
I guess it's answered. The count(*) was returning 1 so I was just dividing by 1.
Related
For a query in google BigQuery I want to replace a long hash by a shorter numeric unique identifier to save some memory afterwards, so I do:
SELECT
my_hash
, ROW_NUMBER() OVER (ORDER BY null) AS id_numeric
FROM hash_table_raw
GROUP BY my_hash
I don't even need an order in the id, but ROW_NUMBER() requires an ORDER BY.
When I try this on my dataset (> 1 billion rows) I get a memory error:
400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 126% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 99%
other/unattributed: 1%
Is there another way to replace a hash by an shorter identifier?
Thanks!
One does not really need to have populated over clause while doing this.
e.g. following will work:
select col, row_number() over() as row_num from (select 'A' as col)
So that will be your first try.
Now, with billion+ rows that you have: if above fails: you can do something like this (considering order is not at all important for you): but here you have to do it in parts:
SELECT
my_hash
, ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 0
And in subsequent queries:
you can get max(id_numeric) from previous run and add that as an offset to next:
SELECT
my_hash
, previous_max_id_numberic_val + ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 1
And keep appending outputs of these mod queries (0-4) to a single new table.
Using the Google Big Query database bigquery-public-data.crypto_ethereum_classic.transactions as reference.
For each transaction row, I want to calculate the count of all transactions to the same address that occurred before that transaction, and sum of the gas usage of them. I am sure I can do this with a join as I have tried and Google accepts my old query, but since there is so much data as a result of the (inner) join, there is almost always a "quota limit exceeded" error. At the same time, I think a subquery solution is inefficient, as it is querying almost the same thing in both aggregate functions.
In a perfect world the query would use something like a join to create a temporary table with all columns I need (transaction_hash, receipt_gas_used, to_address, block_timestamp), according to the conditions (where to_address = table_1.to_address and block_timestamp < table_1.block_timestamp), where I can then perform the aggregate functions on the columns of that table.
What I have so far and what I'm looking for is something like...:
SELECT
table_1.*,
COUNT(
DISTINCT IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `hash`, NULL)
) as txn_count,
SUM(
IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `receipt_gas_used`, NULL)
) as total_gas_used
from
`bigquery-public-data.crypto_ethereum_classic.transactions` as table_1
where block_number >= 3000000 and block number <= 3500000 #just to subset the data a bit
I think you want window functions:
select t.*,
row_number() over (partition by to_address order by block_timestamp) as txn_seqnum,
sum(receipt_gas_used) over (partition by to_address order by block_timestamp) as total_gas_used
from `bigquery-public-data.crypto_ethereum_classic.transactions` as t
where block_number >= 3000000 and block number <= 3500000 #just to subset the
If you really have ties and need the distinct, then use dense_rank() instead of row_number().
When I try to use PERCENT_RANK() over a large dataset, it gives me an error.
SELECT
a2_lngram,
a2_decade,
a2_totalfreq,
a2_totalbooks,
a2_freq, a2_bfreq,
a2_arf,
c_avgarf,
d_arf,
oi,
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
With a destination table and AllowLargeResults returns:
"Resources exceeded during query execution."
When I limit the results to few hundreds it runs fine.
JobID: otichyproject1:job_PpTpmMXYETUMiM_2scGgc997JVg
The dataset is public.
This is expected: The input for an analytic/window function needs to fit in one node for it to run successfully.
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
will only run if all the rows fit in one node. If they don't you'll see the "Resources exceeded during query execution" error.
There's a way to scale up with analytic functions: Partition your data.
PERCENT_RANK() OVER (PARTITION BY country ORDER BY d_arf DESC) plarf
... then the function can be run over multiple nodes, as long as each 'country' rows fit in one VM.
Not your case though - the fix I would do here is calculate the total on a separate subquery, join, and divide.
In summary, analytic functions are cool, but they have scalability issues on the size of each partition - luckily there are other ways to get the same results.
How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)
Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.