CURRENT in BigQuery? - sql

I've noticed that CURRENT is a reserved keyword for BigQuery at: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical.
What exactly does CURRENT do? I've only seen it as a prefix for things such as CURRENT_TIME(), CURRENT_DATE(), and other such stuff but have never seen it by itself. Is this just reserved for future usage or do any SQL statements contain that as a keyword?

Just to add on the comment of #Jaytiger:
CURRENT keyword seems to be reserved as a part of SQL 2016 spec. en.wikipedia.org/wiki/SQL_reserved_words And you can check it's usage
in another DBMS implementations like Oracle, SQL Server. hope this is
helpful.
stackoverflow.com/questions/49110728/where-current-of-in-pl-sql
In BigQuery CURRENT clause is used on defining frame_start and frame_end in window functions.
A window function, also known as an analytic function, computes values
over a group of rows and returns a single result for each row.
A common usage for this is calculating a cumulative sum for each category in the table. See BigQuery window function examples for reference.

Related

List of aggregation functions in Spark SQL

I'm looking for a list of pre-defined aggregation functions in Spark SQL. I have in mind something analogous to Presto Aggregate Functions.
I Ctrl+F'd around a little in the SQL API docs to no avail... it's also hard to tell at a glance which functions are for aggregation vs. not. For example, if I didn't know avg is an aggregation function I'd be hard pressed to tell it is one (in a way that's actually scalable to the full set of functions):
avg - avg(expr) - Returns the mean calculated from values of a group.
If such a list doesn't exist, can someone at least confirm to me that there's no pre-defined function like any/bool_or or all/bool_and to determine if any or all of a boolean column in a group are true (or false)?
For now, my workaround is
select grp_col, count(if(bool_col, true, NULL)) > 0 any_agg
Just take a look at Spark Docs on Aggregate functions section
The list of functions is here under Relational Grouped Dataset - specifically the API's that return DataFrame (not RelationalGroupedDataSet):
https://spark.apache.org/docs/latest/api/scala/index.html?org/apache/spark/sql/RelationalGroupedDataset.html#org.apache.spark.sql.RelationalGroupedDataset

Request last hour data from Big Query with Standard SQL

This is my problem.
I would like to request only the data of the last hour from Big Query.
I would like to use Standard Sql.
I would like to pay only for read the data in this interval of time.
Example :
My partition of the day take 200 Go. I request data of the last hour (40Go). Is it possible to pay only for 40Go in Standard SQL ?
Thanks !
You can use table decorators (specifically range decorators) but they are supported in BigQuery Legacy SQL ONLY
To get data from the last hour you can use below:
SELECT <list_of_fields>
FROM [yourproject:yourdataset.yourtable#-3600000-]
Of course, the preferred query syntax for BigQuery is standard SQL - so you can either have your query logic built with Legacy SQL syntax and thus have whole logic in one query or you can use split logic to first get last hour data into temp table using legacy's sql decorators and then use standard sql to apply needed logic
Meantime see below opened issue on Google's Issue Tracker:
Support an equivalent to table decorators in standard SQL
From that thread - looks like the closest feature to meet your case could be hourly partitioning - whenever it will be available

TABLE_DATE_RANGE for xxxx_yyyymm format tables

I'm having a problem trying to query for 15 months worth of data.
I know about bigquery's wildcard functions, but I can't seem to get them to work with my tables.
For example, if my tables are called:
xxxx_201501,
xxxx_201502,
xxxx_201503,
...
xxxx_201606
How can I select everything from 201501 until today (current_timestamp)?
It seems that it's necessary to have the tables per day, am I wrong?
I've also read that you can use regex but can't find the way.
With Standard SQL, you can use a WHERE clause on a _TABLE_SUFFIX pseudo column as described here:
Is there an equivalent of table wildcard functions in BigQuery with standard SQL?
In this particular case, it would be:
SELECT ... from `mydataset.xxx_*` WHERE _TABLE_SUFFIX >= '201501';
This is a bit long for a comment.
If you are using the standard SQL dialect, then I don't think the functionality is yet implemented.
If you are using the legacy SQL dialect, then you can use a function such as TABLE_DATE_RANGE(). This and other table wildcard functions are well documented.
EDIT:
Oh, I see. The simplest way would be to store the tables as YYYYMM01 so you can use the range query.
But, you can also use table_query():
from table_query(t, 'right(table_id, 6) >= ''201501'' ')

Is "count distinct" exact with BigQuery new standard SQL syntax?

With the legacy BigQuery syntax, we have to use the exact_count_distinct function if we want to have the exact number of distinct values for a field.
With the Standard SQL 2011 syntax, I wonder if "count(distinct myfield)" will always return the exact number of distinct values if I don't select the 'Use Legacy SQL' option.
COUNT(DISTINCT input) gives an exact count in standard SQL.
One important distinction is that COUNT(DISTINCT input) is more scalable than EXACT_COUNT_DISTINCT(input) in legacy BigQuery SQL, so in general the performance will be better and you are less likely to encounter resource exceeded errors.
You can read about other differences between legacy and standard SQL in the migration guide.
Based on documentation for APPROX_COUNT_DISTINCT (with reading in between lines) :
COUNT(DISTINCT input) - exact count
APPROX_COUNT_DISTINCT(input) - approximate result

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.