Track data load history in Snowflake - sql

Snowflake stores few metadata sets in its INFORMATION_SCHEMA object. I tried to investigate how specific table got loaded by procedure/query.
History allows to investigate high-level but I wanted to have custom SQL code to drill more deep.
After executing below code i got Statement not found error even though Query_ID is valid.
Is there any way to navigate history load so I can track what procedure loaded data to which table?
SELECT * FROM table(RESULT_SCAN('xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'));

details of using RESULT_SCAN( ) can be found at the below link, please note these two conditions might be affecting your ability to run the query:
the query cannot have executed more than 24 hours prior to the use of RESULT_SCAN()
only the user who ran the original query can use the RESULT_SCAN( )
https://docs.snowflake.com/en/sql-reference/functions/result_scan.html#usage-notes
As for "navigate history load so I can track what procedure loaded data to which table?": I'd strongly recommend you doing your analysis on the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view.
A good starting point might be something like this:
SELECT *
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('days', -30, CURRENT_TIMESTAMP())
AND start_time <= date_trunc(HOUR, CURRENT_TIMESTAMP())
AND query_text iLike '%TABLE_NAME_HERE%'
AND query_type <> 'SELECT';
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
If you suspect the table in question has been loaded from a COPY INTO table command,
it'd make sense for you to begin with seeing the results of those in one of the following two views:
SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html
SNOWFLAKE.ACCOUNT_USAGE.LOAD_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/load_history.html
While the views in the account_usage "share" may have some latency (typically 10-20 minutes, could be as much as 90 minutes), I've found that using them to do analysis like you are doing easier than querying INFORMATION_SCHEMA objects (opinion).
I hope this helps...Rich

If you wish to view the most recent query history you can use the following syntax:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_ID = 'xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'
To filter for data load queries:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE '%COPY INTO%'
Tip: The above table functions return the last 7 days worth of data. If you require more history, use the Account usage views.
Tip: to use the Account History views, switch to the AccountAdmin role.
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
Rgds,
Dan.

Related

Need help regarding running multiple queries in Big Query

I have some queries that I want to run in a sequential Manner. Is it possible to schedule multiple queries under one scheduled query in Big Query? Thanks
tack.imgur.com/flUN4.jpg
If you don't need all of the intermediate tables and are just interested in the final output... consider using CTEs.
with first as (
select *, current_date() as todays_date from <table1>
),
second as (
select current_date(), concat(field1,field2) as new_field, count(*) as ct
from first
group by 1,2
)
select * from second
You can chain together as many of these as needed.
If you do need all of these intermediate tables materialized, you are venturing into ETL and orchestration tools (dbt, airflow, etc) or will need to write a custom script to execute several commands sequentially.
Not currently, but an alpha program for scripting support in BigQuery was announced at Google Cloud Next in April. You can follow the relevant feature request for updates. In the meantime, you could consider using Cloud Composer to execute multiple sequential queries or an App Engine cron with some code to achieve sequential execution on a regular basis.
Edit (October 2019): support for scripting and stored procedures is now in beta. You can submit multiple queries separated with semi-colons and BigQuery is able to run them now.
I'm not 100% sure if this is what you're looking for, but I'm confident that you won't need to orchestrate many queries to do this. It may be as simple to use the INSERT...SELECT syntax, like this:
INSERT INTO
YourDataset.AdServer_Refine
SELECT
Placement_ExtID,
COALESCE(m.New_Ids,a.Placement_ExtID) AS New_Ids,
m.Labels,
CONCAT(Date," - ",New_Ids) AS Concatenated,
a.Placement_strategy,
a.Campaign_Id,
a.Campaign,
a.Cost,
a.Impressions,
a.Clicks,
a.C_Date AS Current_date,
a.Date
FROM
YourDataset.AdServer AS a
LEFT JOIN
YourDataset.Matching AS m
USING(Placement_ExtID)
WHERE
a.Date = CURRENT_DATE()
This will insert all the rows that are output from SELECT portion of the query (and you can easily test the output by just running the SELECT).
Another option is to create a scheduled query that outputs to your desired table from the SELECT portion of the query above.
If that isn't doing what you're expecting, please clarify the question and leave a comment and I'm happy to try to refine the answer.

How to make Hive Terminal show rows (not just headers) after code is run?

As of now, Hive Terminal is showing only column headers after a create table code is run. What settings should I change to make Hive Terminal show few rows also, say first 100 rows?
Code I am using to create table t2 from table t1 which resides in the database (I don't know how t1 is created):
create table t2 as
select *
from t1
limit 100;
Now while development, I am writing select * from t2 limit 100; after each create table section to get the rows with headers.
You cannot
The Hive Create Table documentation does not mention anything about showing records. This, combined with my experience in Hive makes me quite confident that you cannot achieve this by mere regular config changes.
Of course you could tap into the code of hive itself, but that is not something to be attempted lightly.
And you should not want to
Changing the create command could lead to all kinds of problems. Especially because unlike the select command, it is in fact an operation on metadata, followed by an insert. Both of these normally would not show you anything.
If you would create a huge table, it would be problematic to show everything. If you choose always to just show the first 100 rows, that would be inconsistent.
There are ways
Now, there are some things you could do:
Change hive itself (not easy, probably not desirable)
Do it in 2 steps (what you currently do)
Write a wrapper:
If you want to automate things and don't like code duplication, you can look into writing a small wrapper function to call the create and select based on just the input of source (and limit) and destination.
This kind of wrapper could be written in bash, python, or whatever you choose.
However, note that if you like executing the commands ad-hoc/manually this may not be suitable, as you will need to start a hive JVM each time you run such a program and thus response time is expected to be slow.
All in all you are probably best off just doing the create first and select second.
The below command mentioned seems to be correct to show the first 100 rows:
select * from <created_table> limit 100;
Paste the code you have written to create the table will help to diagnose the issue in hand!!
Nevertheless , check if you have correctly mentioned the delimiters for the elements, key-value pairs, collection items etc while creating the table.
If you have not defined them correctly you might end up with having only the first row(header) being shown.

View with parameters in BigQuery

We have a set of events (kind of log) that we want to connect to get the current state. To improve performance/cost further, we would like to create snapshots (in order to not check all the events in history, but only from the last snapshot). Logs and snapshots are the tables with date suffix.
This approach works OK in the BQ, but we need to manually define the query every time. Is there any way to define 'view' with parameters (e.g. dates for the table range query)? Or any plans to do something like that?
I know that there are some topics connected with TABLE_RANGE / QUERY in views (eg Use of TABLE_DATE_RANGE function in Views). Are there any new information on this subject?
That's a great feature request - but currently not supported. Please leave more details at https://code.google.com/p/google-bigquery/issues/list, the BigQuery team takes these requests very seriously!
As a workaround i wrote a small framework to generate complex queries with help of velocity templates. Just published it at https://github.com/softkot/gbq
Now you can use Table Functions (aka table-valued functions - TVF) to achieve this. They are very similar to a view but they accept a parameter. I've tested and they really help to save a lot while keeping future queries simple, since the complexity is inside the Table Function definition. It receives a parameter that you can then use inside the query for filtering.
This example is from the documentation:
CREATE OR REPLACE TABLE FUNCTION mydataset.names_by_year(y INT64)
AS
SELECT year, name, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = y
GROUP BY year, name
Then you just query it like this:
SELECT * FROM mydataset.names_by_year(1950)
More details can be found in the oficial documentation.
You can have a look at BigQuery scripting that have been released in beta : https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

Dynamically Querying Multiple Tables In BigQuery

I have a BigQuery database where daily data is uploaded into it's own table. So I have tables named "20131201", "20131202", etc. I can write a fixed query to "merge" those tables by doing:
SELECT * FROM db.20131201, db.20131202, ...
I'd like to have a single query that does not require me to update the Custom SQL everytime a new table is added. Something like:
SELECT * FROM db.*
Which currently doesn't work. I would like to avoid making one giant table. Is there a work-around that I can do, or will this have to be a feature request?
End-goal is for a Tableau data connection to all the tables.
This isn't exactly what you've asked for, but I've managed to use https://developers.google.com/bigquery/query-reference#tablewildcardfunctions in particular
TABLE_DATE_RANGE(prefix, timestamp1, timestamp2)
to achieve a similar result for use in tableaux. You'll still need to provide 2 date parameters, but it's substantially better than dynamically generating the FROM clause.
Hope this helps.
As of now in google bigquery this dynamic Sql [like "EXECUTE SQL" in mssqlserver] is not avilable...sulry google will look inthis i belive :)

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.