"Write append" options bigquery - google-bigquery

I am working on bigquery with standard sql and I have the following problem.
I am transforming a table with millions of data, but I will only work with the data of yesterday and today.
The result of that query (which is already listed) I have to store in another table.
The problem is that what must be executed every 1 hour and when creating the scheduled query and placing the option of "write append", the data that has been previously saved will be duplicated.
I need something like "write to table if it does not exist"

You should have your scheduled query written with replace in mind:
REPLACE TABLE `dataset.mytable`
AS
SELECT 1;
this way you fully replace on each run.
Update:
You may use MERGE statement to skip existing rows and add only new ones.

Materialized views are appending only new data.
They can query only single table, support only a limited set of aggregation functions (APPROX_COUNT_DISTINCT, ARRAY_AGG, AVG, COUNT, HLL_COUNT.INIT, MAX, MIN, SUM) and do not support a computation on top of an aggregation but maybe they will fit your use case.

Related

Creating a UDF in BigQuery

I would like to create a UDF named maxDate in BigQuery that does the following:
maxDate('table_name') returns the result from running the query below:
select max(table_id) from fact.___TABLES____ where table_id < 'table_name';
I'm quite new to JS and not too sure how to start. This looks like a simple thing to write. Could anyone point me in the right way? I've read the documentation, and unsure of how to write this.
Scalar UDF are not existent yet in BigQuery
See more about BigQuery User-Defined Functions to understand what are they today.
To simplify - think of today's UDF as virtual table that you can query and this table in turn powered by real table where each row is processed row-by-row and javascript code is applied for each row and generates (instead of this input row) zero, one or many (depends of inplemented in js logic) rows)

Dynamically Querying Multiple Tables In BigQuery

I have a BigQuery database where daily data is uploaded into it's own table. So I have tables named "20131201", "20131202", etc. I can write a fixed query to "merge" those tables by doing:
SELECT * FROM db.20131201, db.20131202, ...
I'd like to have a single query that does not require me to update the Custom SQL everytime a new table is added. Something like:
SELECT * FROM db.*
Which currently doesn't work. I would like to avoid making one giant table. Is there a work-around that I can do, or will this have to be a feature request?
End-goal is for a Tableau data connection to all the tables.
This isn't exactly what you've asked for, but I've managed to use https://developers.google.com/bigquery/query-reference#tablewildcardfunctions in particular
TABLE_DATE_RANGE(prefix, timestamp1, timestamp2)
to achieve a similar result for use in tableaux. You'll still need to provide 2 date parameters, but it's substantially better than dynamically generating the FROM clause.
Hope this helps.
As of now in google bigquery this dynamic Sql [like "EXECUTE SQL" in mssqlserver] is not avilable...sulry google will look inthis i belive :)

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.

Adding update SQL queries

I have a script that updates itself every week. I've got a warning from my hosting that I've been overloading the server with the script. The problem, I've gathered is that I use too many UPDATE queries (one for each of my 8000+ users).
It's bad coding, I know. So now I need to lump all the data into one SQL query and update it all at once. I hope that is what will fix my problem.
A quick question. If I add purely add UPDATE queries separated by a semicolon like this:
UPDATE table SET something=3 WHERE id=8; UPDATE table SET something=6 WHERE id=9;
And then update the database with one large SQL code as opposed to querying the database for each update, it will be faster right?
Is this the best way to "bunch" together UPDATE statements? Would this significantly reduce server load?
Make a delimited file with your values and use your equivalent of MySQL's LOAD DATA INFILE. This will be significantly faster than an UPDATE.
LOAD DATA INFILE '/path/to/myfile'
REPLACE INTO TABLE thetable(field1,field2, field3)
//optional field and line delimiters
;
Your best bet is to batch these statements by your "something" field:
UPDATE table SET something=3 WHERE id IN (2,4,6,8)
UPDATE table SET something=4 WHERE id IN (1,3,5,7)
Of course, knowing nothing about your requirements, there is likely a better solution out there...
It will improve IO since there is only one round trip, but the database "effort" will be the same.
A curiosity of SQL is that the following integer expression
(1 -abs(sign(A - B))) = 1 if A == B and 0 otherwise. For convenience lets call this expression _eq(A,B).
So
update table set something = 3*_eq(id,8) + 6* _eq(id,9)
where id in (8,9);
will do what you want with a single update statement.

Is there a way to parser a SQL query to pull out the column names and table names?

I have 150+ SQL queries in separate text files that I need to analyze (just the actual SQL code, not the data results) in order to identify all column names and table names used. Preferably with the number of times each column and table makes an appearance. Writing a brand new SQL parsing program is trickier than is seems, with nested SELECT statements and the like.
There has to be a program, or code out there that does this (or something close to this), but I have not found it.
I actually ended up using a tool called
SQL Pretty Printer. You can purchase a desktop version, but I just used the free online application. Just copy the query into the text box, set the Output to "List DB Object" and click the Format SQL button.
It work great using around 150 different (and complex) SQL queries.
How about using the Execution Plan report in MS SQLServer? You can save this to an xml file which can then be parsed.
You may want to looking to something like this:
JSqlParser
which uses JavaCC to parse and return the query string as an object graph. I've never used it, so I can't vouch for its quality.
If you're application needs to do it, and has access to a database that has the tables etc, you could run something like:
SELECT TOP 0 * FROM MY_TABLE
Using ADO.NET. This would give you a DataTable instance for which you could query the columns and their attributes.
Please go with antlr... Write a grammar n follow the steps..which is given in antlr site..eventually you will get AST(abstract syntax tree). For the given query... we can traverse through this and bring all table ,column which is present in the query..
In DB2 you can append your query with something such as the following, but 1 is the minimum you can specify; it will throw an error if you try to specify 0:
FETCH FIRST 1 ROW ONLY