Dynamic FROM in U-SQL statement - azure-data-lake

I am trying to generate a dynamic FROM clause in U-SQL so that we can extract data from different files based on a previous query outcome. That's something like this:
#filesToExtract = SELECT whatevergeneratesthepaths from #foo; <-- this query generates a rowset with all the file we want to extract like: [/path/file1.csv, /path/file2.csv]
SELECT * FROM #filesToExtract; <-- here we want to extract the data from file1 and file2
I'm afraid that this kind of dynamics queries are not supported yet, but can someone help pointing me out the way to achieve this? It seems that the only feasible approach is to generate another U-SQL script and execute it afterwards.
Thanks in advance.

It is not fully clear from your question if you want the file names to be dynamically retrieved and passed to an EXTRACT statement, or the name of tables/rowsets and passed to a SELECT's FROM clause. Or both.
In general, you cannot dynamically generate source names from your U-SQL expression. You may want to file a feature request here http://aka.ms/adlfeedback for dynamically or statically parameterizable sources.
Having said that, depending on your exact requirements, there may be some ways to achieve your goals without the work-around you describe.
For example, you could write your code as a parameterized table-valued function and then pass the different rowsets with different scripts, or - if you statically can decide which rowset to choose - you can use the IF statement.
Here is a pseudo-code example:
DECLARE EXTERNAL #someconditionparameter Boolean = true;
IF (#someconditionparameter) THEN
#data = EXTRACT a int, b string FROM #fileset1 USING Extractors.Csv();
ELSE
#data = EXTRACT a int, b string FROM #file2 USING ...;
END;
#results = MyTableValuedFunction (#data);
...
If your files are schematized differently, you may be able to use flexible column sets (currently in preview, see release notes) in the TVF to handle the variability of the rowset schema.

Related

Adding today date in Table name when using Create Table function in standard sql GBQ

I am quite new to GBQ and any help is appreciated it.
I have a query below:
#Standard SQL
create or replace table `xxx.xxx.applications`
as select * from `yyy.yyy.applications`
What I need to do is to add today's date at the end of the table name so it is something like xxx.xxx.applications_<todays date>
basically create a filename with Application but add date at the end of the name applications.
I am writing a procedure to create a table every time it runs but need to add the date for audit purposes every time I create the table (as a backup).
I searched everywhere and can't get the exact answer, is this possible in Query Editor as I need to store this as a Proc.
Thanks in advance
BigQuery doesn't support dynamic SQL at the moment which means that this kind of construction is not possible.
Currently BigQuery supports Parameterized Queries but its not possible to use parameters to dynamically change the source table's name as you can see in the provided link.
BigQuery supports query parameters to help prevent SQL injection when
queries are constructed using user input. This feature is only
available with standard SQL syntax. Query parameters can be used as
substitutes for arbitrary expressions. Parameters cannot be used as
substitutes for identifiers, column names, table names, or other parts
of the query.
If you need to build a query based on some variable's value, I suggest that you use some script in SHELL, Python or any other programming language to create the SQL statement and then execute it using the bq command.
Another approach could be using the BigQuery client library in some of the supported languages instead of the bq command.

Use SQL Field in SSIS Variable

Is it possible to reference a SQL field in your SSIS variable?
For instance, I would like use the field from the "table" below
Select '999999' AS Physician_Profile_ID
as a dynamic variable (named "CMSPhysProID" in our example) here
I plan on concatenating multiple IDs into a In statement.
Possible by using execute sql taskIn left side pan of Execute SQL task, general tab 1.Select result set as single row2. Connection type ole db 3. Set connection and form SQL statement, As you mentioned Select '999999' AS Physician_Profile_ID 4.Go to result set in your left side pan 5. Add your variable where you want to store '999999' 6. Click ok
If you are looking to store the value within the variable to be used later, you can simply use an Execute SQL Task with a single row result set. More details in the following article:
SSIS Basics: Using the Execute SQL Task to Generate Result Sets
If you are looking to add a computed column while importing data, you must use a Derived Column Transformation within the data flow task to add a column based on another one, you can refer to the following article for more details about this component:
SSIS Derived Columns with Multiple Expressions vs Multiple Transformations
What are you trying to accomplish by concatenating the IDs into an "IN" statement? If the idea is to use the values of the IDs to limit the results, as a dynamic WHERE clause, you may have better luck just using a lookup against either a table you maintain with the desired IDs or even a static list generated in the package with a script task. (If you can use the lookup table method it will be much easier to maintain as you only have to update a table, not your source code.)
Alternatively, you may even be able to accomplish the goal with a join. Create a temp table from the profile IDs you want to keep and join to it, or, again, use it as a lookup component. Dynamically creating a where clause using IN will come in a lot slower and will be cumbersome to maintain.

How to run a select sql statement within a field in the Pentaho?

I have a table with a 'query' field containing a select sql and another 'parameters' field containing the sql parameters. I have merged these two fields into a new field containing a correct select sql statement. Now I need to execute this new field containing select sql, get the return from select (the output fields) and generate an excel file.
Use Table-Input if you are interested in a query result set. Table-Input supports SQL parameters, so no need to build the statement yourself using e.g. Replace-In-String, and tripping over escapes on your way. Also, there's variable substitution, just in case you can't live with a single template.
Update 21:14 GMT
I'm not very fond of the way you try to prepare the SELECT statement, but here we go, assuming it's a single statement we have:
Create a job with a Start entry and 2 Transformation entries (T1, T2). Let T1 produce the field containing your SELECT statement and use a Set-Variables step to make the statement available to T2 as variable SELECT. In T2 use a Table-Input step referencing ${SELECT} in the SQL statement text area. Don't forget to enable option "Replace variables in script".
From now on it's a matter of taste. I would prefer to create a CSV file using Text-File-Output. Using the right field separator Excel will open the file after double-clicking it. The advantage of Text-File-Output is that you don't have to specify the fields you don't know at design-time anyway. An empty field list will just handle all fields coming in. Comparable to the total projection in a Table-Input which will create the necessary fields from the retrieved columns downstream.
If you must produce an Excel workbook, you'll have to learn about metadata injection. That would be a separate project for a beginner, though. There are samples in your Kettle installation folder. And there is a very active community if you find yourself in trouble.

Using dynamic SQL in an OLE DB source in SSIS 2012

I have a stored proc as the SQL command text, which is getting passed a parameter that contains a table name. The proc then returns data from that table. I cannot call the table directly as the OLE DB source because some business logic needs to happen to the result set in the proc. In SQL 2008 this worked fine. In an upgraded 2012 package I get "The metadata could not be determined because ... contains dynamic SQL. Consider using the WITH RESULT SETS clause to explicitly describe the result set."
The problem is I cannot define the field names in the proc because the table name that gets passed as a parameter can be a different value and the resulting fields can be different every time. Anybody encounter this problem or have any ideas? I've tried all sorts of things with dynamic SQL using "dm_exec_describe_first_result_set", temp tables and CTEs that contains WITH RESULT SETS, but it doesn't work in SSIS 2012, same error. Context is a problem with a lot of the dynamic SQL approaches.
This is latest thing I tried, with no luck:
DECLARE #sql VARCHAR(MAX)
SET #sql = 'SELECT * FROM ' + #dataTableName
DECLARE #listStr VARCHAR(MAX)
SELECT #listStr = COALESCE(#listStr +',','') + [name] + ' ' + system_type_name FROM sys.dm_exec_describe_first_result_set(#sql, NULL, 1)
exec('exec(''SELECT * FROM myDataTable'') WITH RESULT SETS ((' + #listStr + '))')
So I ask out of kindness, by why on God's green earth are you using an SSIS Data Flow task to handle dynamic source data like this?
The reason you're running into trouble is because you're perverting every purpose of an SSIS Data flow task:
to extract a known source with known metadata that can be statically typed and cached in design-time
to run through a known process with straightforward (and ideally asynchronous) transformations
to take that transformed data and load it into a known destination also with known metadata
It's fine to have parameterized data sources that bring back different data. But to have them bring back entirely different metadata each time with no congruity between the different sets is, frankly, ridiculous, and I'm not entirely sure I want to know how you handled all your column metadata in the working 2008 package.
This is why it wants you add a WITH RESULTS SET to the SSIS query - so it can generate some metadata. It doesn't do this at runtime - it can't! It has to have a known set of columns (because it aliases them all into compiled variables anyway) to work with. It expects the same columns every time it runs that Data Flow Task - the exact same columns, down to the names, the types, and the constraints.
Which leads to one (terrible, terrible) solution - just stick all the data into a temporary table with Column1, Column2 ... ColumnN and then use the same variable you're using as the table name parameter to conditionally branch your code and do whatever you want with the columns.
Another more sane solution would be to create a data flow task for each of your source tables, and use your parameter in a precedence constraint to just pick which data flow task should run.
For a solution this poorly tailored for an out-of-the-box ETL, you should also highly consider just rolling your own in C# or a script task instead of the Data Flow Task provided by SSIS.
In short, please don't do this. Think of the children (packages)!
I've used CozyRoc Dynamic DataFlow Plus to achieve this.
Using configuration tables to build the SQL Select statements, I have a single SSIS package that loads data from Oracle and Sybase (or any OLEDB source) to MS SQL. Some of the result sets are in the millions of rows and performance is excellent.
Instead of writing a new package every time a new table is needed, this can be configured in minutes and run on a the pre-tested and robust existing package.
Without it I would have been up for writing hundreds of packages.

SSIS Variable SQL command for ADO.net source

While this seems like a basic problem, I've been ripping my hair out FOR DAYS trying to get an efficient solution to this.
I have a lookup table of values on a server that I read from and assemble into a string using a C# Script task. I write this string into a variable that I want to pass in as my WHERE parameters inside a large SQL query on a ADO.NET data source (from a different server which I only have read access to) in my data flow. For example, this string would just be something like
('Frank', 'John', 'Markus', 'Tom')
and I would append that as my WHERE clause.
I can't read from a variable directly for an ADO.NET data source AND I can't use the 'Expression' property to set my SQL either as my SQL query is over 4000 characters. I could use an Execute SQL Task to run my query, load the results into a recordset and I assume, then loop through the recordset but that's extremely inefficient.
What would be the best way to do this? My end goal is to put these results inside a table on the first server.
You could try to set up Script Component as source - variables and strings inside scripts can be longer than 4000 characters so you can fit your query inside.
Setup your component similar to this article: http://beyondrelational.com/modules/2/blogs/106/posts/11119/script-componentsource-part1.aspx
In this one you have example how to fetch data using ExecuteReader and put it to output of script component: http://beyondrelational.com/modules/2/blogs/106/posts/11124/ssis-script-component-as-source-adonet.aspx In this one you have instructions how to aquire connection properly: http://www.toadworld.com/platforms/sql-server/b/weblog/archive/2011/05/30/use-connections-properly-in-an-ssis-script-task
By joining this pieces of information you should be able to write your source Script Component which can fetch data using any length dynamically constructed query.
Good luck :)
You can do a simple select statement to return a list of values that will include ('Frank', 'John', 'Markus', 'Tom'). So your select would return :
Name
----------
Frank
John
Markus
Tom
Then, in SSIS, use a Merge Join Component (that will act as a INNER JOIN) instead of a where clause in your main query.
This is the cleaniest way to achieve what you want.