Google BigQuery now support UDFs that works like mappers in mapreduce.
BigQuery supports user-defined functions (UDFs) written in JavaScript. A UDF is similar to the "Map" function in a MapReduce: it takes a single row as input and produces zero or more rows as output. The output can potentially have a different schema than the input.
From https://cloud.google.com/bigquery/user-defined-functions
What's the motivation behind implementing UDFs on rows over allowing UDFs which works as pure functions on columns/fields, like how UDFs work in hive https://cwiki.apache.org/confluence/display/Hive/HivePlugins.
I guess you can express any UDF that works on column (like hive UDF) as an UDFs that works on rows (BigQuery UDF) but not vise versa. That would be possible by defining a UDF (in BigQuery) with the same input and output schema as the dataset and all values just passed through but the field that you want to apply your function to.
This is of course cumbersome if you want to apply the same function to different datasets with different schemas. Please help me understand.
Current implementation of UDFs in BigQuery is just the first step. As you note - it is most generic way if you want to be able to deal with nested and repeated structures, but it makes it cumbersome when you want just simple scalar values. Expect future improvements in this area where simple UDFs will be simple.
Related
I've been digging the Internet over and over and couldn't find any reasonable answer. What's the difference between inlining and flattening in SQL query? I do actually use both interchangeably, eventually they lead to the same result - a big single query not many atomic ones.
But maybe there is a difference in definitions? For instance inline refer only to functions and flattening means convert subquery to join only as stands here? But in another source one can find an example of completely different transformation.
I guess there may be slight differences in how “inlining” and “flattening” are defined by people, but the way these terms are normally understood in the PostgreSQL community is that inlining is to pull the definition of a LANGUAGE sql function into the main query, and flattening is to transform a subquery or view into something else than a subquery, for example a join.
I am developing SSRS reports that require a user selection (via a parameter) to retrieve either live data or historical data.
The sources for live and historical data are separate objects in a SQL Server database (views for live data; table-valued functions accepting a date parameter for historical data), but their schemas - the columns they return - are the same, so other than the dataset definition, the rest of the report doesn't need to know what its source is.
The dataset query draws from several database objects, and it contains joins and case statements in the select.
There are several approaches I can take to surfacing data from different sources based on the parameter selection I've described (some of which I've tested), listed below.
The main goal is to ensure that performance for retrieving the live data (primary use case) is not unduly affected by the presence of logic and harnessing to support the history use case. In addition, ease maintainability of the solution (including database objects and rdl) is a secondary, but important, factor.
Use an expression in the dataset query text, to conditionally return full SQL query text with the correct sources included using string concatenation. Pros: can resolve to a straight query that isn't polluted by the 'other' use case for any given execution. All logic for the report is housed in the report. Cons: awful to work with, and has limitations for lengthy SQL.
Use a function in the report's code module to do the same as 1. Pros: as per 1., but marginally better design-time experience. Cons: as per 1., but also adds another layer of abstraction that reduces ease of maintenance.
Implement multi-step TVFs on the database, that process the parameter and retrieve the correct data using logic in T-SQL. Pros: flexibility of t-SQL functionality, no string building/substitution involved. Can select * from its results and apply further report parameters in the report's dataset query. Cons: big performance hit when compared to in-line queries. Moves some logic outside the rdl.
Implement stored procedures to do the same as 3. Pros: as per 3, but without ease of select *. Cons: as per 3.
Implement in-line TVFs that union together live and history data, but using a dummy input parameter that adds something that resolves to 1=0 in the where clause of the source that isn't relevant. Pros: clinging on to the in-line query approach, other pros as per 3. Cons: feels like a hack, adds performance hit just for a query component that is known to return 0 rows. Adds complexity to the query.
I am leaning towards options 3 or 4 at this point, but eager to hear what would be a preferred approach (even if not listed here) and why?
What's the difference between live and historical? Is "live" data, data that changes and historical does not?
Is it not possible to replicate or push live/historical data into a Data Warehouse built specifically for reporting?
To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)
I am a newbie in using functions and it appears to me that an inline function is very similar to a view. Am I correct?
Also, can I have UPDATE statements within a function?
After reading many of the answers here, I'd like to note that there is a big difference between an inline table-valued function and any other kind of function (scalar or multi-line TVF).
An inline TVF is simply a parameterized view. It can be expanded and optimized away just like a view. It is not required to materialize anything before "returning results" or anything like that (although, unfortunately, the syntax has a RETURN.
A big benefit I've found of an inline TVF over a view is that it does force required parameterization whereas with a view, you have to assume that the caller will appropriately join or restrict the usage of the view.
For example, we have many large fact tables in DW with a typical Kimball star model. I have a view on a fact table-centered model, which called without any restriction, will return hundreds of millions of rows. By using an inline TVF with appropriate parameterization, users are unable to accidentally ask for all the rows. Performance is largely indistinguishable between the two.
No difference. They are both expanded/unnested into the containing query.
Note: indexed views are considered differently but still may be expanded, and multi-valued table functions are black boxes to the containing query.
Tony Rogerson: VIEWS - they offer no optimisation benefits; they are simply inline macros - use sparingly
Adam Machanic: Scalar functions, inlining, and performance: An entertaining title for a boring post
Related SO question: Does query plan optimizer works well with joined/filtered table-valued functions?
Scary DBA (at the end)
Finally, writes to tables are not allowed in functions
Edit, after Eric Z Beard's comment and downvote...
The question and answers (not just mine) are not about scalar udfs. "Inline" means "inline table valued functions". Very different concepts...
Update: Looks like I missed the "inline" part. However, I'm leaving the answer here in case someone wants to read about difference between VIEWs and regular functions.
If you only have a function that does SELECT and output the data, then they are similar. However, even then, they are not the same because VIEWs can be optimized by the engine. For example, if you run SELECT * FROM view1 WHERE x = 10; and you have index on table field that maps to X, then it will be used. On the other hand, function builds a result set prior to searching, so you would have to move WHERE inside it - however, this is not easy because you might have many columns and you cannot ORDER BY all of those in the same select statement.
Therefore, if you compare views and functions for the same task of giving a "view" over data, then VIEWs are a better choice.
BUT, functions can do much more. You can do multiple queries without needing to join tables with JOINS or UNIONs. You can do some complex calculations with the results, run additional queries and output data to the user. Functions are more like stored procedures that are able to return datasets, then they are like views.
Nobody seems to have mentioned this aspect.
You can't have Update statements in an inline function but you can write Update statements against them just as though they were an updatable view.
One big difference is that a function can take parameters whereas a VIEW cannot.
I tend to favour VIEWs, being a Standard and therefore portable implementation. I use functions when the equivalent VIEW would be meaningless without a WHERE clause.
For example, I have a function that queries a relatively large valid-time state table ('history' table). If this was a VIEW and you tried to query it without a WHERE clause you'd get a whole lot of fairly data (eventually!) Using a function establishes a contract that if you want the data then you must supply a customer ID, a start date and an end date and the function is how I establish this contact. Why not a stored proc? Well, I expect a user to want to JOIN the resultset to further data (tables, VIEWs, functions, etc) and a function is IMO the best way of doing this rather then, say, requiring the user to write the resultset to a temporary table.
Answering your question about updates in a function (msdn):
The only changes that can be made by
the statements in the function are
changes to objects local to the
function, such as local cursors or
variables. Modifications to database
tables, operations on cursors that are
not local to the function, sending
e-mail, attempting a catalog
modification, and generating a result
set that is returned to the user are
examples of actions that cannot be
performed in a function.
A view is a "view" of data that is returned from a query, almost a pseudo-table. A function returns a value/table usually derived from querying the data. You can run any sql statement in a function provided the function eventually returns a value/table.
A function allows you to pass in parameters to create a more specific view. Lets say you wanted to have grab customers based on state. A function would allow you to pass in the state you are looking for and give you all the customers by that state. A view can't do that.
A function performs a task, or many tasks. A view retrieves data via a query. What ever fits in that query is what you are limited too. In a function I can update, select, create table variables, delete some data, send an email, interact with a CLR that I create, etc. Way more powerful than a lowly view!
A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.