BigQuery COUNT(DISTINCT value) vs COUNT(value) - google-bigquery

I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.

In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.

I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

Related

int64 overflow in sampling n number of rows (not %)

The below script is to randomly sample an approximate number of rows (50k).
SELECT *
FROM table
qualify rand() <= 50000 / count(*) over()
This has worked a handful of times before, hence, I was shocked to find this error this morning:
int64 overflow: 8475548256593033885 + 6301395400903259047
I have read this post. But as I am not summing, I don't think it is applicable.
The table in question has 267,606,559 rows.
Looking forward to any ideas. Thank you.
I believe counting is actually a sum the way BQ (and other databases) compute counts. You can see this by viewing the Execution Details/Graph (in the BQ UI). This is true even on a simple select count(*) from table query.
For your problem, consider something simpler like:
select *, rand() as my_rand
from table
order by my_rand
limit 50000
Also, if you know the rough size of your data or don't need exactly 50K, consider using the tablesample method:
select * from table
tablesample system (10 percent)

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

BigQuery Error – UNIQUE_HEAP requires an int32 argument

Using legacy SQL, I am trying to use COUNT(DISTINCT field, n) in Google BigQuery. But I am get following error:
UNIQUE_HEAP requires an int32 argument which is greater than 0 (error code: invalidQuery)
Here is my query that I have used:
SELECT
hits.page.pagePath AS Page,
COUNT(DISTINCT CONCAT(fullVisitorId, INTEGER(visitId)), 1e6) AS UniquePageviews,
COUNT(DISTINCT fullVisitorId, 1e6) as Users
FROM
[xxxxxxxx.ga_sessions_20170101]
GROUP BY
Page
ORDER BY
UniquePageviews DESC
LIMIT
20
BigQuery is not even showing line number of error therefore I am not sure which line is causing this error.
What could be possible cause of above error?
Don't use 1e6 in your COUNT(DISTINCT). Instead, use an actual INTEGER value for the 2nd parameter 'N' (default is 1000), or use EXACT_COUNT_DISTINCT() instead.
COUNT(DISTINCT) documentation
EXACT_COUNT_DISTINCT() documentation
If you require greater accuracy from COUNT(DISTINCT), you can specify
a second parameter, n, which gives the threshold below which exact
results are guaranteed. By default, n is 1000, but if you give a
larger n, you will get exact results for COUNT(DISTINCT) up to that
value of n. However, giving larger values of n will reduce scalability
of this operator and may substantially increase query execution time
or cause the query to fail.
To compute the exact number of distinct values, use
EXACT_COUNT_DISTINCT. Or, for a more scalable approach, consider using
GROUP EACH BY on the relevant field(s) and then applying COUNT(*). The
GROUP EACH BY approach is more scalable but might incur a slight
up-front performance penalty.

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.