Is "count distinct" exact with BigQuery new standard SQL syntax? - google-bigquery

With the legacy BigQuery syntax, we have to use the exact_count_distinct function if we want to have the exact number of distinct values for a field.
With the Standard SQL 2011 syntax, I wonder if "count(distinct myfield)" will always return the exact number of distinct values if I don't select the 'Use Legacy SQL' option.

COUNT(DISTINCT input) gives an exact count in standard SQL.
One important distinction is that COUNT(DISTINCT input) is more scalable than EXACT_COUNT_DISTINCT(input) in legacy BigQuery SQL, so in general the performance will be better and you are less likely to encounter resource exceeded errors.
You can read about other differences between legacy and standard SQL in the migration guide.

Based on documentation for APPROX_COUNT_DISTINCT (with reading in between lines) :
COUNT(DISTINCT input) - exact count
APPROX_COUNT_DISTINCT(input) - approximate result

Related

Big Query - different number of users when using legacy and normal sql

I have written a query in Google Big Query and want to get the same number of users I see in Google Analytics. I used Legacy and Normal SQL and got 3 different users numbers while the sessions were the same. What did I do wrong, or does anyone have an explanation/solution for it? Every help is appreciated!
Normal SQL
SELECT COUNT(DISTINCT fullVisitorId) AS users, SUM(IF(totals.visits IS
NULL,0,totals.visits)) AS sessions
FROM `XXX.XXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181120' AND '20181120'
Legacy SQL
SELECT COUNT(DISTINCT fullVisitorId) AS users, SUM(IF(totals.visits IS
NULL,0,totals.visits)) AS sessions
FROM TABLE_DATE_RANGE([XXX:XXX.ga_sessions_], TIMESTAMP('2018-11-20'),
TIMESTAMP('2018-11-20'))
I think this warning from the documentation explains what is happening:
In legacy SQL, COUNT(DISTINCT x) returns an approximate count. In standard SQL, it returns an exact count.
StandardSQL has the correct number. You can test this by attempting to use EXACT_COUNT_DISTINCT() in legacy SQL.

Programmatically detemine if a query is legacy or SQL 2011 syntax?

Hi is there a way to programmatically detect if a query string is in the legacy or SQL-2011 syntax? I know the former uses [project:dataset.table] for table references while the later uses `project.dataset.table` but this doesn't seem very bullet proof.
There's no way to tell just from the query text in all cases, which is why BigQuery has the "Use Legacy SQL" checkbox in the UI and the use_legacy_sql option for the query API. For example, consider this query:
SELECT *
FROM (SELECT 1 AS x), (SELECT 2), (SELECT 3);
The results are very different despite the query being valid in both dialects.
Standard SQL queries can still contain [], too, such as for array literals.
Assuming query is syntax-wise correct and expected to actually work - you can do Dry Run using both options (Legacy and Standard) and see which fails and which not. Based on result you can potentially derive the answer

TABLE_DATE_RANGE for xxxx_yyyymm format tables

I'm having a problem trying to query for 15 months worth of data.
I know about bigquery's wildcard functions, but I can't seem to get them to work with my tables.
For example, if my tables are called:
xxxx_201501,
xxxx_201502,
xxxx_201503,
...
xxxx_201606
How can I select everything from 201501 until today (current_timestamp)?
It seems that it's necessary to have the tables per day, am I wrong?
I've also read that you can use regex but can't find the way.
With Standard SQL, you can use a WHERE clause on a _TABLE_SUFFIX pseudo column as described here:
Is there an equivalent of table wildcard functions in BigQuery with standard SQL?
In this particular case, it would be:
SELECT ... from `mydataset.xxx_*` WHERE _TABLE_SUFFIX >= '201501';
This is a bit long for a comment.
If you are using the standard SQL dialect, then I don't think the functionality is yet implemented.
If you are using the legacy SQL dialect, then you can use a function such as TABLE_DATE_RANGE(). This and other table wildcard functions are well documented.
EDIT:
Oh, I see. The simplest way would be to store the tables as YYYYMM01 so you can use the range query.
But, you can also use table_query():
from table_query(t, 'right(table_id, 6) >= ''201501'' ')

bitand condition in Query of Query (QoQ)

Is it possible to use a bitAnd() condition in coldfusion QoQ SQL?
I have checked adobe's documentation on QoQ (http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0e4fd-7ff0.html). It doesn't say anything about bitwise functions, but past experiences tell me that the coldfusion documentation isn't always complete.
Qoq SQL:
SELECT *
FROM srcTable
WHERE bitAnd(member_type_bit,2) = 2
This throws the error:
Query Of Queries syntax error. Encountered "bitAnd ( member_type_bit
,. Incorrect conditional expression, Expected one of
[like|null|between|in|comparison] condition,
Is it just not supported in QoQ or do I need to use a different syntax?
No, there's no bitAnd() function in the SQL dialect that QoQ uses.
You'll need to do it row by row, ie: loop over the recordset, and build a new recordset with only the rows you want. Or push this back to the DB and do it there (if poss).
For future reference, the entirety of what QoQ supports is listed here:
http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0e4fd-7ff0.html
That's all of it.

LINQ Count .. best method

My company has just started using LINQ and I still am having a little trouble with the abstractness (if thats a word) of the LINQ command and the SQL, my question is
Dim query = (From o In data.Addresses _
Select o.Name).Count
In the above in my mind, the SQL is returning all rows and the does a count on the number rows in the IQueryable result, so I would be better with
Dim lstring = Aggregate o In data.Addresses _
Into Count()
Or am I over thinking the way LINQ works ? Using VB Express at home so I can't see the actual SQL that is being sent to the database (I think) as I don't have access to the SQL profiler
As mentioned, these are functionally equivalent, one just uses query syntax.
As mentioned in my comment, if you evaluate the following as a VB Statement(s) in LINQPad:
Dim lstring = Aggregate o In Test _
Into Count()
You get this in the generated SQL output window:
SELECT COUNT(*) AS [value]
FROM [Test] AS [t0]
Which is the same as the following VB LINQ expression as evaluated:
(From o In Test_
Select o.Symbol).Count
You get the exact same result.
I'm not familiar with Visual Basic, but based on
http://msdn.microsoft.com/en-us/library/bb546138.aspx
Those two approaches are the same. One uses method syntax and the other uses query syntax.
You can find out for sure by using SQL Profiler as the queries run.
PS - The "point" of LINQ is you can easily do query operations without leaving code/VB-land.
An important thing here, is that the code you give will work with a wide variety of data sources. It will hopefully do so in a very efficient way, though that can't be fully guaranteed. It certainly will be done in an efficient way with a SQL source (being converted into a SELECT COUNT(*) SQL query. It will be done efficiently if the source was an in-memory collection (it gets converted to calling the Count property). It isn't done very efficiently if the source is an enumerable that is not a collection (in this case it does read everything and count as it goes), but in that case there really isn't a more efficient way of doing this.
In each case it has done the same conceptual operation, in the most efficient manner possible, without you having to worry about the details. No big deal with counting, but a bigger deal in more complex cases.
To a certain extent, you are right when you say "in my mind, the SQL is returning all rows and the does a count on the number rows". Conceptually that is what is happening in that query, but the implementation may differ. Compare with how the real query in SQL may not match the literal interpretation of the SQL command, to allow the most efficient approach to be picked.
I think you are missing the point as Linq with SQL has late binding the search is done when you need it so when you say I need the count number then a Query is created.
Before that Linq for SQL creates Expression trees that will be "translated" in to SQL when you need it....
http://weblogs.asp.net/scottgu/archive/2007/05/19/using-linq-to-sql-part-1.aspx
http://msdn.microsoft.com/en-us/netframework/aa904594.aspx
How to debug see Scott
http://weblogs.asp.net/scottgu/archive/2007/07/31/linq-to-sql-debug-visualizer.aspx
(source: scottgu.com)