int64 overflow in sampling n number of rows (not %) - sql

The below script is to randomly sample an approximate number of rows (50k).
SELECT *
FROM table
qualify rand() <= 50000 / count(*) over()
This has worked a handful of times before, hence, I was shocked to find this error this morning:
int64 overflow: 8475548256593033885 + 6301395400903259047
I have read this post. But as I am not summing, I don't think it is applicable.
The table in question has 267,606,559 rows.
Looking forward to any ideas. Thank you.

I believe counting is actually a sum the way BQ (and other databases) compute counts. You can see this by viewing the Execution Details/Graph (in the BQ UI). This is true even on a simple select count(*) from table query.
For your problem, consider something simpler like:
select *, rand() as my_rand
from table
order by my_rand
limit 50000
Also, if you know the rough size of your data or don't need exactly 50K, consider using the tablesample method:
select * from table
tablesample system (10 percent)

Related

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

BigQuery join too slow for a table of small size

I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:
As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.
Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped

SQL COUNT - greater than some number without having to get the exact count?

There's a thread at https://github.com/amatsuda/kaminari/issues/545 talking about a problem with a Ruby pagination gem when it encounters large tables.
When the number of records is large, the pagination will display something like:
[1][2][3][4][5][6][7][8][9][10][...][end]
This can incur performance penalties when the number of records is huge, because getting an exact count of, say, 50M+ records will take time. However, all that's needed to know in this case is that the count is greater than the number of pages to show * number of records per page.
Is there a faster SQL operation than getting the exact COUNT, which would merely assert that the COUNT is greater than some value x?
You could try with
SQL Server:
SELECT COUNT(*) FROM (SELECT TOP 1000 * FROM MyTable) X
MySQL:
SELECT COUNT(*) FROM (SELECT * FROM MyTable LIMIT 1000) X
With a little luck, the SQL Server/MySQL will optimize this query. Clearly instead of 1000 you should put the maximum number of pages you want * the number of rows per page.

BigQuery COUNT(DISTINCT value) vs COUNT(value)

I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.
In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.
I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.