Using legacy SQL, I am trying to use COUNT(DISTINCT field, n) in Google BigQuery. But I am get following error:
UNIQUE_HEAP requires an int32 argument which is greater than 0 (error code: invalidQuery)
Here is my query that I have used:
SELECT
hits.page.pagePath AS Page,
COUNT(DISTINCT CONCAT(fullVisitorId, INTEGER(visitId)), 1e6) AS UniquePageviews,
COUNT(DISTINCT fullVisitorId, 1e6) as Users
FROM
[xxxxxxxx.ga_sessions_20170101]
GROUP BY
Page
ORDER BY
UniquePageviews DESC
LIMIT
20
BigQuery is not even showing line number of error therefore I am not sure which line is causing this error.
What could be possible cause of above error?
Don't use 1e6 in your COUNT(DISTINCT). Instead, use an actual INTEGER value for the 2nd parameter 'N' (default is 1000), or use EXACT_COUNT_DISTINCT() instead.
COUNT(DISTINCT) documentation
EXACT_COUNT_DISTINCT() documentation
If you require greater accuracy from COUNT(DISTINCT), you can specify
a second parameter, n, which gives the threshold below which exact
results are guaranteed. By default, n is 1000, but if you give a
larger n, you will get exact results for COUNT(DISTINCT) up to that
value of n. However, giving larger values of n will reduce scalability
of this operator and may substantially increase query execution time
or cause the query to fail.
To compute the exact number of distinct values, use
EXACT_COUNT_DISTINCT. Or, for a more scalable approach, consider using
GROUP EACH BY on the relevant field(s) and then applying COUNT(*). The
GROUP EACH BY approach is more scalable but might incur a slight
up-front performance penalty.
Related
I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!
The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.
the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.
There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)
I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.
I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?
Thanks
The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:
SELECT COUNT(DISTINCT x, 100000) FROM T
will return fairly accurate results if total number of distict values is less than 100,000.
The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.
I've saved this query to as a View:
SELECT nth(1,CodAlm) as FirstCode,
nth(1,DesAlm) as FirstDescription,
last(CodAlm) as LastCode,
Last(DesAlm) as LastDescription,
max(DATE(DataTic)) as LastVisit,
min(DATE(DataTic)) as FirstVisit,
DATEDIFF(CURRENT_TIMESTAMP(),TIMESTAMP(max(DATE(DataTic)))) as Diffdays,
count(distinct DATE(DataTic)) as countVisits,
count(distinct CodAlm) as NumberCodes,
sum(subtot) as Totalimport,
TarCli,
Last(nomcli) as Name,
Last(cogcli) as LastName,
Last(emailcli) as email,
Last(sexcli) as gender
FROM (SELECT CodAlm, DesAlm, DataTic,SubTot, TarCli, NomCli,CogCli,EmailCli,SexCli FROM [bime.Sales] where Year(DataTic)>2012 AND IsFirstLine="1" ORDER by TarCli, DataTic)
group each by tarcli
But, When I run any query over this view, bigquery returns me Resources exceeded during query execution. I think that ORDER BY is the cause of my problem, but I need this to show my results correctly. How I could rewrite this query correctly? bime.Sales table have 18 Milion rows.
See this question for more info:
What causes "resources exceeded" in BigQuery?
The error is likely caused by the GROUP EACH BY clause, and the most likely reason is that you have a skewed distribution of keys (i.e., one key with a disproportionate number of records). Can you look at your data distribution, and perhaps filter out any skewed keys?
Also note that the ORDER BY is not guaranteed to be preserved by GROUP EACH BY, so you need to apply the ordering after the GROUP EACH BY. You may find it useful to use analytic functions like FIRST_VALUE, NTH_VALUE, and LAST_VALUE with OVER(PARTITION BY tarcli ORDER BY DataTic) instead of GROUP EACH BY if you want to get reliable ordering.
Some things you should consider and try (if you haven't done that so far):
1) Do you really need the "group each by"? Have you tried with just "group by"?
2) Have you tried to use a table instead of a view? You could try to "materialize" the view to check if the resource consumption decreases.
3) Can you shard the data? Perhaps putting each year or month in a different table (using DataTic). That would decrease the size of each table and, therefore, the resource usage.
Cheers!
I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.
In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.
I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions