Need help regarding running multiple queries in Big Query - sql

I have some queries that I want to run in a sequential Manner. Is it possible to schedule multiple queries under one scheduled query in Big Query? Thanks
tack.imgur.com/flUN4.jpg

If you don't need all of the intermediate tables and are just interested in the final output... consider using CTEs.
with first as (
select *, current_date() as todays_date from <table1>
),
second as (
select current_date(), concat(field1,field2) as new_field, count(*) as ct
from first
group by 1,2
)
select * from second
You can chain together as many of these as needed.
If you do need all of these intermediate tables materialized, you are venturing into ETL and orchestration tools (dbt, airflow, etc) or will need to write a custom script to execute several commands sequentially.

Not currently, but an alpha program for scripting support in BigQuery was announced at Google Cloud Next in April. You can follow the relevant feature request for updates. In the meantime, you could consider using Cloud Composer to execute multiple sequential queries or an App Engine cron with some code to achieve sequential execution on a regular basis.
Edit (October 2019): support for scripting and stored procedures is now in beta. You can submit multiple queries separated with semi-colons and BigQuery is able to run them now.

I'm not 100% sure if this is what you're looking for, but I'm confident that you won't need to orchestrate many queries to do this. It may be as simple to use the INSERT...SELECT syntax, like this:
INSERT INTO
YourDataset.AdServer_Refine
SELECT
Placement_ExtID,
COALESCE(m.New_Ids,a.Placement_ExtID) AS New_Ids,
m.Labels,
CONCAT(Date," - ",New_Ids) AS Concatenated,
a.Placement_strategy,
a.Campaign_Id,
a.Campaign,
a.Cost,
a.Impressions,
a.Clicks,
a.C_Date AS Current_date,
a.Date
FROM
YourDataset.AdServer AS a
LEFT JOIN
YourDataset.Matching AS m
USING(Placement_ExtID)
WHERE
a.Date = CURRENT_DATE()
This will insert all the rows that are output from SELECT portion of the query (and you can easily test the output by just running the SELECT).
Another option is to create a scheduled query that outputs to your desired table from the SELECT portion of the query above.
If that isn't doing what you're expecting, please clarify the question and leave a comment and I'm happy to try to refine the answer.

Related

Is there a name for different sql Statement methods?

I have take over a position as a BI consultant. and almost ALL of the prior SQL are build in a very funny way.. Especially compared to what i have been teached.. And I wonder if the other "way" has particular name.. Some times when i try to explain how the old things are build, and why they are so complicated to change, compared to the way i'm building it where they are getting change pretty quickly..
I have always "learned" to make 1 script, with CTE, DT, SUBQUERIES ETC.. But all of the previous code are build in many many steps.
Even simple tasks are build up in MANY steps..
For example if they wanted to find info from 3 different tables. it could be like (simplyfied for example reasons, i have scripts i have rewritten from 56 steps, to 2 steps. and gotten 76% performance boost)
Create qtemp.cust as (
Select
CustNo,
CustName
From Customer)
;
Alter Table qtemp.cust
add Column Revenue
;
Update Table qtemp.cust A
Set Revenue = (select sum(revenue) from Sales B Where A.CustNo = B.CustNo)
;
Insert into F_SALES
Select * from qtemp.cust
It's not that they have been idiots, so the above statement where all info are in the sales table they would just group by and take the info from there. But in general all code are build like above. and often there are used qtemp tables instead of subqueries or CTE.. And in lack of a better word i call the old work Stepped process and my own work for one step process.
So are there are termonology for it?
Also its on and old DB2 server, which are now the newest db2 version.. but is it legacy from a time where db2 didn't support DT, CTE, subqueries etc. I'm very currious why anyone would build it like above, and if there is a name for it?

Track data load history in Snowflake

Snowflake stores few metadata sets in its INFORMATION_SCHEMA object. I tried to investigate how specific table got loaded by procedure/query.
History allows to investigate high-level but I wanted to have custom SQL code to drill more deep.
After executing below code i got Statement not found error even though Query_ID is valid.
Is there any way to navigate history load so I can track what procedure loaded data to which table?
SELECT * FROM table(RESULT_SCAN('xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'));
details of using RESULT_SCAN( ) can be found at the below link, please note these two conditions might be affecting your ability to run the query:
the query cannot have executed more than 24 hours prior to the use of RESULT_SCAN()
only the user who ran the original query can use the RESULT_SCAN( )
https://docs.snowflake.com/en/sql-reference/functions/result_scan.html#usage-notes
As for "navigate history load so I can track what procedure loaded data to which table?": I'd strongly recommend you doing your analysis on the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view.
A good starting point might be something like this:
SELECT *
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('days', -30, CURRENT_TIMESTAMP())
AND start_time <= date_trunc(HOUR, CURRENT_TIMESTAMP())
AND query_text iLike '%TABLE_NAME_HERE%'
AND query_type <> 'SELECT';
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
If you suspect the table in question has been loaded from a COPY INTO table command,
it'd make sense for you to begin with seeing the results of those in one of the following two views:
SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html
SNOWFLAKE.ACCOUNT_USAGE.LOAD_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/load_history.html
While the views in the account_usage "share" may have some latency (typically 10-20 minutes, could be as much as 90 minutes), I've found that using them to do analysis like you are doing easier than querying INFORMATION_SCHEMA objects (opinion).
I hope this helps...Rich
If you wish to view the most recent query history you can use the following syntax:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_ID = 'xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'
To filter for data load queries:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE '%COPY INTO%'
Tip: The above table functions return the last 7 days worth of data. If you require more history, use the Account usage views.
Tip: to use the Account History views, switch to the AccountAdmin role.
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
Rgds,
Dan.

SQL query execute takes more time

here i combine two table and get the result.
SELECT *
FROM dbo.LabSampleCollection
WHERE CONVERT(nvarchar(20), BillNo) + CONVERT(Nvarchar(20), ServiceCode)
NOT IN (SELECT CONVERT(nvarchar(20), BillNo) + CONVERT(Nvarchar(20), ServiceCode)
FROM dbo.LabResult)
the problem is Its take more time to execute. is there is any alternative way to handle this.
SELECT *
FROM dbo.LabSampleCollection sc
WHERE NOT EXISTS ( SELECT BillNo
FROM dbo.LabResult r
WHERE r.BillNo = sc.BillNo
AND r.ServiceCode = sc.Servicecode)
No need to combine the two fields, just check if both are available in the same record. It would also be better to replace the * with the actual columns that you wish to retrieve. (The selected column BillNo in the second select state is just there to limit the results of the second query.
Are you familiar with query execution plans, if not then i strongly recommend you read up on them? If you are going to be writing queries and troubleshooting/trying to improve performance they are one of the most useful tools (along with some basic understanding of what they are how SQL server optimization engine works).
You can access them from SSMS via the activity monitor or by running the query itself (us the Income actual execution plan button or ctrl-M) and they will tell you exactly which part of the query is the most inefficient and why. There are many very good articles on the web on how to improve performance using this valuable tool e.g. https://www.simple-talk.com/sql/performance/execution-plan-basics/

Dynamically Querying Multiple Tables In BigQuery

I have a BigQuery database where daily data is uploaded into it's own table. So I have tables named "20131201", "20131202", etc. I can write a fixed query to "merge" those tables by doing:
SELECT * FROM db.20131201, db.20131202, ...
I'd like to have a single query that does not require me to update the Custom SQL everytime a new table is added. Something like:
SELECT * FROM db.*
Which currently doesn't work. I would like to avoid making one giant table. Is there a work-around that I can do, or will this have to be a feature request?
End-goal is for a Tableau data connection to all the tables.
This isn't exactly what you've asked for, but I've managed to use https://developers.google.com/bigquery/query-reference#tablewildcardfunctions in particular
TABLE_DATE_RANGE(prefix, timestamp1, timestamp2)
to achieve a similar result for use in tableaux. You'll still need to provide 2 date parameters, but it's substantially better than dynamically generating the FROM clause.
Hope this helps.
As of now in google bigquery this dynamic Sql [like "EXECUTE SQL" in mssqlserver] is not avilable...sulry google will look inthis i belive :)

Out of the two sql queries below , suggest which one is better one. Single query with join or two simple queries?

Assuming result of first query in A) (envelopecontrolnumber,partnerid,docfileid) = (000000400, 31,35)
A)
select envelopecontrolnumber, partnerid, docfileid
from envelopeheader
where envelopeid ='LT01ENV1107010000050';
select count(*)
from envelopeheader
where envelopecontrolnumber = '000000400'
and partnerid= 31 and docfileid<>35 ;
or
B)
select count(*)
from envelopeheader a
join envelopeheader b on a.envelopecontrolnumber = b.envelopecontrolnumber
and a.partnerid= b.partnerid
and a.envelopeid = 'LT01ENV1107010000050'
and b.docfileid <> a.docfileid;
I am using the above query in a sql function. I tried the queries in pgAdmin(postgres), it shows 16ms for A) and B). When I tried queries from B) separately on pgadmin. It still shows 16 ms separately for each one - making 32ms for B) - Which is wrong because when you run both the queries in one go from B), it shows 16 ms. Please suggest which one is better. I am using postgres database.
The time displayed includes time to :
send query to server
parse query
plan query
execute query
send results back to client
process all results
Try a simple query like "SELECT 1". You'll probably get 16 ms too.
It's quite likely you are simply measuring the ping time to your server.
If you want to know how much time on the server a query uses, you need EXPLAIN ANALYZE.
Option 1:
Run query A.
Get results.
Use these results to create query B.
Send query B.
Get results.
Option 2:
Run combined query AB.
Get results.
So, if you are using this from a client, connecting to Postgres, use the second option. There is an overhead for sending a query to the db and getting results back.
If you are using it inside an SQL function or procedure, the difference is probably negligible. I would still use the second option though. And in either case, I would check that queries B or AB are optimized (checked query plan, if indexes are used, etc).
Go option 1: the two queries are unrelated, so more efficient to do them separately.
Option A will be faster since you are interested in the count.
The join will create a temporary structure for join the data based on conditions and then performs the counting operation.
Hence option A is better and faster.