SQL UDF - Struct Diff - google-bigquery

We have a table with 2 top level columns of type 'struct' - one is a 'before', and an 'after' image. The struct schemas are non trivial - nested, with arrays to a variable depth. The are sent to us from replication, so the schemas are always the same (but the schemas of course can be updated at some point, but always together)
Objective is for the two input structs, to return 2 struct 'diffs' of the before and after with only fields that have changed - essentially the 'delta' diff of the changes produce by the replication source. We know something has changed, but not 'what' since we get the full before and after image. this raw data lands in BQ and is then processed from there but need to determine the more granular change for high order BQ processing.
The table schema is very wide (1000's of leaf fields), and the data populated fairly spare (so alot of nulls will be present on both sides of the snapshot) - so would need to be performant as best as possible when executing over 10s of millions of rows.
All things are nullable for maximum flexibility.
So change could look like:
null -> value
value -> null
valueA -> valueB
Arrays:
recursive use of above for arrays of structs, ordering could be relaxed if that makes it easier?
It might not be possible.
Ive not attempted this yet as it seems really difficult so am looking to the community boffins for some support for this. I feel the arrays could be difficult part. There is probably an easy way perhaps in Python I dont or even doing some JSON conversion and comparison using JOSN tools? It feels like it would be a super cool feature built in to BQ as well, so if can get this to work, will add a feature request for it.
Id like to have a SQL UDF for reuse (we have SQL skills not python, although if easier in python then thats ok), and now with the new feature of persistent SQL UDFs, this seems the right time to ask and test the feature out!
sql
def struct_diff(before Struct, after Struct)
(beforeChange, afterChange) - type of signature but open to suggestions?

It appears to be really difficult to get a piece of reusable code. Since currently there is no support for recursive functions for SQL UDF, you cannot use a recursive approach for the nested structs.
Although, you might be able to get some specific SQL UDF functions depending on your array and structs structures. You can use an approach like this one to compare the structs.
CREATE TEMP FUNCTION final_compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(s1 as prev, s2 as cur)
);
CREATE TEMP FUNCTION compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(final_compare(s1.structA, s2.structA))
);
You can use UNNEST to work with arrays, and the final SQL UDF would really depend on your data.
As #rtenha suggested, Python could be a lot easier to handle this problem.
Finally, I did some tests using JavaScript UDF, and it was basically the same result, if not worst than SQL UDF.
The console allows a recursive definition of the function, however it will fail during execution. Also, javascript doesn't allow the ANY TYPE data type on the signature, so you would have to define the whole STRUCT definition or use a workaround like applying TO_JSON_STRING to your struct in order to pass it as a string.

Related

How to invoke UDF for each element in an array in hive?

I have a hive table with one column being an array of strings. I also have a set of custom UDFs that manipulate individual strings. I would like to make hive execute my custom UDF on each element in an array and then return the result as a modified array.
This seems like a simple requirement, but I wasn't able to find a simple solution for it. I found two possibilities, none of them being simple really:
Do a hive SQL gymnastic with explode and lateral view, then invoke UDF, then aggregate back into array. This seems way too big overkill as I don't see it executing in less than 2 mapreduce jobs (but I could be wrong here).
Implement each of my UDFs as GenericUDF that, is supplied with an array, processes each element in it and returns an array again. This requires a lot more development.
Is there any simple way to do this?
There's no way I know of to do it without either more custom UDF code, or as you say, requiring more MR jobs.
But I would suggest a possible third option - write a GenericUDF that takes two arguments: an array and the class name of another UDF. Instantiate and call the UDF through reflection, pass it everything in the array, and return the resulting array. This might be a bit difficult to write, but at least then you won't have to rewrite all of your existing UDFs, as you mentioned.

Using bind variables in large insert statements

I am inheriting an application which has to read data from various types of files and use the OCI interface to move the data into an Oracle database. Most of the tables in question have about 40-50 columns, so the SQL insert statements become pretty lengthy.
When I inherited this code, it basically built up the insert statements via a series of strcats as a C string, then passed it to the appropriate OCI functions to set up and execute the statement. However, since much of the data is read directly from files into the column values, this leaves the application open to easy SQL injection. So I am trying to use bind variables to solve this problem.
In every example OCI application I can find, each variable is statically allocated and bound individually. This would lead to quite a bit of boilerplate, however and I'd like to reduce it to some sort of looping construct. So my solution is to, for each table, create a static array of strings containing the names of the table columns:
const char const *TABLE_NAME[N_COLS] = {
"COL_1",
"COL_2",
"COL_3",
...
"COL_N"
};
along with a short function that makes a placeholder out of a column name:
void makePlaceholder(char *buf, const char *col);
// "COLUMN_NAME" -> ":column_name"
So I then loop through each array and bind my values to each column, generating the placeholders as I go. One potential problem here is that, because the types of each column vary, I bind everything as SQLT_STR (strings) and thus expect Oracle to convert to the proper datatype on insertion.
So, my question(s) are:
What is the proper/idiomatic (if such a thing exists for SQL/OCI) to use bind variables for SQL insert statements with a large number of columns/params? More generally, what is the best way to use OCI to make this type of large insert statement?
Do large numbers of bind calls have a significant cost in efficiency compared to building and using vanilla C strings?
Is there any risk in binding all variables as strings and allowing Oracle to make the proper type conversion?
Thanks in advance!
Not sure about the C aspects of this. My answer will be from a DBA perspective.
Question 2:
Always use bind variables. It prevent SQL-injection and enhances performance.
The performance aspect is often overlooked by programmers. When Oracle receives a SQL it makes a hash of the entire SQL-text and looks in it's internal repository of execution plans to see if it has one. If bind variables was used it the SQL-text will be the same each time you run the query, not matter what the value of a variable is. However if you have concatenated the string your self Oracle will hash the SQL-text including content of (what you aught to have put in) variables, getting a unique hash every time. So if you do a query one million times Oracle will if you used bind variables make one execution plan, while if you did not use bind variables it will make one million execution plans and waste loads of resources doing that.

SELECT FROM (lv_tablename) error: the output table is too small

I have an ABAP class method, say, select_something. select_something has an exporting parameter, say, et_result. et_result is of type standard table because the type of et_result cannot be determined until runtime.
The method sometimes gives a short dump saying With ABAP/4 Open SQL array select, the output table is too small at "select * into table et_result from (lv_tablename) where..."
Error analysis:
......in this particular case, the database table is 3806 bytes wide, but the internal table is only 70 bytes wide.
I tried "any table" too and the error is the same.
You could return a data reference. Your query will no longer fail, and you can assign the data to a correctly typed field symbol afterwards.
" Definition
class-methods select_all
importing
!tabname type string
returning
value(results) type ref to data.
...
...
" Implementation
method select_all.
data dref type ref to data.
create data dref type standard table of (tabname).
field-symbols <tab> type any table.
assign dref->* to <tab>.
select * from (tabname) into table <tab>.
get reference of <tab> into results.
endmethod.
Also, I agree with #vwegert that dynamic queries (and programming for that matter) should be avoided when possible.
What you're trying to do looks horribly wrong on many levels. NEVER use SELECT FROM (whatever) unless someone points a gun at your head AND the door is locked tight. You'll loose every kind of static error checking the system might be able to provide you with. For example, the compiler will no longer be able to tell you "Hey, that table you're reading from is 3806 bytes wide." It simply can't tell, even if you use constants. You'll find that out the hard way, producing short dumps, especially when switching between unicode and NUC systems, quite likely some in production systems. No fun.
(Actually there are a few - very very VERY few - good uses for dynamic table names in the SELECT statement. I need them about once every two to three years, and I code quite a lot weird stuff. Just avoid them wherever you can, even at the cost of writing more code. It's just not worth the trouble fixing broken stuff later.)
Then, changing the generic formal parameter type does not do anything to the type of the actual parameter. If you pass a STANRDARD TABLE OF mandt WITH DEFAULT KEY to your method, that table will have lines of 3 characters. It will be a STANDARD TABLE, and as such, it will also be an ANY TABLE, and that's about it. You can twist the generic types anywhere you like, there's no way to enforce correctness using generic types the way you use them. It's up to the caller to make sure that all the right types are used. That's a bad way to fly.
First off, I agree with vwegert's response, try to avoid dynamic sql selections if you can
That said, check the short dump. If the error is an exception class, you can wrap the SELECT statement in a try/catch block and at least stop it from dumping.
You can also try "INTO CORRESPONDING FIELDS OF TABLE et_result". If ET_RESULT is dynamic, you might have to cast it into the proper structure using RTTS. This might give you some ideas...
Couldn't agree more to vwegert, but if there is absolutely no other way (and there usually is) of performing your task than using dynamic select statements and dynamically typed parameters, do some checks on the type of the table and the parameter at runtime.
Use CL_ABAP_TYPEDESCR and its subclasses to do so.
This way, you can handle errors at runtime without your program dumping,
But as vwegert said, this dynamic stuff is pure evil and will most certainly break at some point during runtime. Adding the necessary error handling will most likely be a lot more work and a lot harder than redesigning your code to none dynamic SQL and typed parameters

Dynamic Pivot Query without storing query as String

I am fully familiar with the following method in the link for performing a dynamic pivot query. Is there an alternative method to perform a dynamic pivot without storing the Query as a String and inserting a column string inside it?
http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx
Short answer: no.
Long answer:
Well, that's still no. But I will try to explain why. As of today, when you run the query, the DB engine demands to be aware of the result set structure (number of columns, column names, data types, etc) that the query will return. Therefore, you have to define the structure of the result set when you ask data from DB. Think about it: have you ever ran a query where you would not know the result set structure beforehand?
That also applies even when you do select *, which is just a sugar syntax. At the end, the returning structure is "all columns in such table(s)".
By assembling a string, you dynamically generate the structure that you desire, before asking for the result set. That's why it works.
Finally, you should be aware that assembling the string dynamically can theoretically and potentially (although not probable) get you a result set with infinite columns. Of course, that's not possible and it will fail, but I'm sure you understood the implications.
Update
I found this, which reinforces the reasons why it does not work.
Here:
SSIS relies on knowing the metadata of the dataflow in advance and a
dynamic pivot (which is what you are after) is not compatible with
that.
I'll keep looking and adding here.

SQL:1999 Array Type Constructor Usage?

Can anyone confirm whether or not the SQL:1999 Array type Constructor provides any operations for searching the Array in a WHERE clause?.
As an Example If a table EMPLOYEES had a column
QUALIFICATION VARCHAR(20) ARRAY[10]
containing values such as ARRAY['BSC','MBA']
Does the standard support some way of querying EMPLOYEES to find all Employees with an MBA?
Well, you can always use an element reference (ISO/IEC 9075-2:1999, 6.13 ):
WHERE QUALIFICATION(1) = 'BSC'
OR QUALIFICATION(2) = 'BSC'
...
Of course, the problem is that you need to write a comparison for each possible position.
I am not aware of any operators that allows you to compare a scalar with an array, although I would suppose a DBMS that has native support for ARRAY types ould let you create a function that does the job.
I must say I never had the need for array types - I would typically build a one-to-many detail table, or in rare cases, add multiple columns (yeah - a repeating group. send the relational police to hunt me if you like :)
Would you care to explain why you need to know this, or what problem you are trying to solve with an ARRAY?