how to iterate over projects, datasets in BigQuery using a SQL query - sql

assume i have a list of projects in BigQuery and each project has several datasets. i'd like to extract data from all these tables into one table only using SQL:
this query below works on one project (yay!) but how can i iterate it through several projects?
DECLARE schema_list ARRAY<STRING>;
DECLARE iter INT64 DEFAULT 0;
SET schema_list = (
SELECT
ARRAY_AGG(schema_name)
FROM
$project.INFORMATION_SCHEMA.SCHEMATA
);
WHILE
iter < ARRAY_LENGTH(schema_list) DO
EXECUTE IMMEDIATE format("""
INSERT `$other_project.$data_set.$table` (col1, col2, something)
SELECT
col1,
col2,
(really clever calc) as something
FROM `$project.%s.198401*`
GROUP BY
col1,
col2,
""", schema_list[OFFSET(iter)]);
SET iter = iter + 1;
END WHILE;
i don't mind suppling the projects via an array but if the query could get the list of projects itself it would be a blast!
thanks a million!
even just for trying :)

An approach I can think of might require you to write code (Python, Nodejs, Java, etc) to use BigQuery API. This approach will loop through a list of projects and execute your query per iteration.
Use BQ endpoint projects.list to get a list of projects to which the user has been granted any project role. Or use Resource Manager API if necessary.
When you have the list of projects, loop through the list of projects, pass the project_id to your query (modify your query to accept query parameters).
Use query parameters to safely pass your project_id to your query to prevent SQL injection.
Execute your query that you have posted on your question using BQ API. See querying using a programming language.

Related

What is the equivalent of Select into in Google bigquery

I am trying to write a SQL Command to insert some data from one table to a new table without any insert statement in bigquery but I cannot find a way to do it. (something similar to select into)
Here is the table:
create table database1.table1
(
pdesc string,
num int64
);
And here is the insert statement. I also tried the select into but it is not supported in bigquery.
insert into database1.table1
select column1, count(column2) as num
from database1.table2
group by column1;
Above is a possible way to insert. but I am looking for a way that I do not need to use any select statement. I am looking for something similar to 'select into' statement.
I am thinking of declaring variables and then somehow feed the data into the tables but do not know how.
I am not a Google employee. However - I understand the reasoning for not supporting creating a copy of a table (or query) from the console.
The challenge is that each table needs to be created must have a number of features defined such as associated project and expiry time.
Looking through the documentation (briefly) - it is worth exploring using bq utility - specifically the cp command -
Explore the following operations :
cache the query results to a temporary table
get the name of said temporary table
pass to a copy table command perhaps?
Other methods are described in the google cloud doco https://cloud.google.com/bigquery/docs/managing-tables#copy-table

Documentum -- Custom queries

I am new to learning Documentum and we came across this query being run by the system that we are looking at how to potentially speed up:
SELECT ALL dm_document.r_object_id
FROM dm_document_sp dm_document
WHERE (
dm_document.object_name = :"SYS_B_0"
AND dm_document.r_object_id IN (
SELECT r_object_id
FROM dm_sysobject_r
WHERE i_folder_id = :"SYS_B_1"
)
)
AND (
dm_document.i_has_folder = :"SYS_B_2"
AND dm_document.i_is_deleted = :"SYS_B_3"
)
We looked at adding an index or using a SQL profile. However, the index would be somewhat large and will continue to grow. The SQL profile also would need to be re-examined periodically.
We thought it would be better to look at re-writing the SQL itself. Is there a way to override the system to use custom SQL (i.e. SQL written by the developers) for specific queries that Documentum auto-generates?
Unfortunately there is no way how to alter the default Documentum behavior of translation of DQL into result SQL.
But you can directly execute SQL in your custom applications, jobs, BOFs, components, etc using JDBC. For other than SELECT queries can be also used DQL EXECUTE statement like this:
EXECUTE exec_sql WITH query = 'sql_query'
Another option is to register specific *_s or *_r tables and access them directly in DQL. For example you can register dm_sysobject_s like this:
REGISTER TABLE dm_dbo.dm_sysobject_s ("r_object_id" CHAR(16))
And then you can use it in DQL:
SELECT object_name FROM dm_sysobject_s
And you can also normally join the registered table with Documentum types in DQL, for example:
SELECT object_name FROM dm_sysobject_s s, dmi_queue_item q WHERE s.r_object_id = q.item_id
But keep in mind that this is not recommended approach by Documentum to directly access their internal tables but when you really need to speed up your application then you have to use alternative ways.
Anyway I would recommend to use indexes at first and if it is not suficent then you can continue with steps described above.

How to submit multiple queries in Google BigQuery Composer and Cloud Shell

Just a simple question, please don't tell me that submitting multiple queries is not supported in Query Composer and Google Cloud Shell.
When I submit two statements(for example drop table statements delimited by ";"), it tells me that the drop word on the next line is unexpected.
Turns out that there are no ways to execute multiple queries in either the BigQuery Composer or the the Google Cloud Shell. However, 1 workaround that I have found is to create a local text file in Cloud Shell which stores the queries, delimited by ";". And then set the IFS (Internal Field Separator) to ";" so that I can use a for loop to loop through the file and execute the queries one by one.
Example:
queries.txt
select 1+2;
select 2+3;
select 3+4;
Cloud Shell command
IFS=";"
alias bqn="bq query --nouse_legacy_sql"
for q in $(<"queries.txt"); do bqn $q; done;
BigQuery now has support for multi-statement execution. Check out the scripting documentation. Copying the example:
-- Declare a variable to hold names as an array.
DECLARE top_names ARRAY<STRING>;
-- Build an array of the top 100 names from the year 2017.
SET top_names = (
SELECT ARRAY_AGG(name ORDER BY number DESC LIMIT 100)
FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
);
-- Which names appear as words in Shakespeare's plays?
SELECT
name AS shakespeare_name
FROM UNNEST(top_names) AS name
WHERE name IN (
SELECT word
FROM `bigquery-public-data`.samples.shakespeare
);
Google BigQuery is an SQL like language and not all implementations of a mainstream SQL language will be directly compatible with BigQuery.
That being said, there are many ways to workaround. If you are creating table to materialize data in order to have better Query performance and limit the cost of storing data in BigQuery, you can set a expiration date on the temporary table.
This is the command with the expiration date flag:
bq --location=[LOCATION] mk --dataset --default_table_expiration [INTEGER] --description [DESCRIPTION] [PROJECT_ID]:[DATASET]

Encapsulating complex code in BigQuery

I recently had to generate a BQ table out of other BQ tables. The logic was rather involved and I ended up writing a complex SQL statement.
In Oracle SQL I would have written a PL/SQL procedure with the logic broken down into separate pieces (most often merge statements). In some cases I would encapsulate some code into functions. The resulting procedure would be a sequence of DML statements, easy to read and maintain.
However nothing similar exists for BQ. The UDF's are only temporary and cannot be stored within -say- a view.
Question: I am looking for ways to make my complex BQ SQL code more modular and readable. Is there any way I could accomplish this?
currently available option is to use WITH Clause
The WITH clause contains one or more named subqueries whose output acts as a temporary table which subsequent SELECT statements can reference in any clause or subquer
I would still consider User-Defined Functions as a really good option.
JS and SQL UDF are available in BigQuery and from what is known BigQuery team is working on introducing permanent UDF to be available soon
Meantime you can just store body of JS UDF as a js library and reference it in your UDF using OPTIONS section. see Including external libraries in above reference
October 2019 Update
The ability to use scripting and stored procedures is now in Beta.
So, you can send multiple statements to BigQuery in one request, to use variables, and to use control flow statements such as IF and WHILE, etc.
And, you can use procedure, which is a block of statements that can be called from other queries.
Note: it is Beta yet
BigQuery supports persistent user-defined functions. To get started, see the documentation.
For example, here's a CREATE FUNCTION statement that creates a function to compute the median of an array:
CREATE FUNCTION dataset.median(arr ANY TYPE) AS (
(
SELECT
IF(
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
)
);
After executing this statement, you can reference it in a follow-up query:
SELECT dataset.median([7, 1, 2, 10]) AS median;
You can also reference the function inside logical views. Note that you currently need to qualify the reference to the function inside the view using a project, however:
CREATE VIEW dataset.sampleview AS
SELECT x, `project-name`.dataset.median(array_column) AS median
FROM `project-name`.dataset.table

Dynamic FROM in U-SQL statement

I am trying to generate a dynamic FROM clause in U-SQL so that we can extract data from different files based on a previous query outcome. That's something like this:
#filesToExtract = SELECT whatevergeneratesthepaths from #foo; <-- this query generates a rowset with all the file we want to extract like: [/path/file1.csv, /path/file2.csv]
SELECT * FROM #filesToExtract; <-- here we want to extract the data from file1 and file2
I'm afraid that this kind of dynamics queries are not supported yet, but can someone help pointing me out the way to achieve this? It seems that the only feasible approach is to generate another U-SQL script and execute it afterwards.
Thanks in advance.
It is not fully clear from your question if you want the file names to be dynamically retrieved and passed to an EXTRACT statement, or the name of tables/rowsets and passed to a SELECT's FROM clause. Or both.
In general, you cannot dynamically generate source names from your U-SQL expression. You may want to file a feature request here http://aka.ms/adlfeedback for dynamically or statically parameterizable sources.
Having said that, depending on your exact requirements, there may be some ways to achieve your goals without the work-around you describe.
For example, you could write your code as a parameterized table-valued function and then pass the different rowsets with different scripts, or - if you statically can decide which rowset to choose - you can use the IF statement.
Here is a pseudo-code example:
DECLARE EXTERNAL #someconditionparameter Boolean = true;
IF (#someconditionparameter) THEN
#data = EXTRACT a int, b string FROM #fileset1 USING Extractors.Csv();
ELSE
#data = EXTRACT a int, b string FROM #file2 USING ...;
END;
#results = MyTableValuedFunction (#data);
...
If your files are schematized differently, you may be able to use flexible column sets (currently in preview, see release notes) in the TVF to handle the variability of the rowset schema.