allowLargeResults in Query job in BigQuery - google-bigquery

I'm trying to run a Query job in BigQuery and getting the following error:
Response too large to return. Consider setting allowLargeResults to
true in your job configuration
I understand that I need to set allowLargeResults to True in my job configuration, but then I also have to supply a destination table field.
I don't want to insert the results of the query to specific table, only to process it locally.
how can I manage this situation?

I don't want to insert the results of the query to specific table,
only to process it locally.
Wanted to clarify – so you hopefully feel better about using destination table:
In reality, any query result ends up in some table!
If result is smaller than 128MB - BigQuery creates temporary table on your behalf (in special dataset which name starts with underscore so it is not visible in Web UI dataset/table navigator).
This temporary table is available for 24 hours and is used if you use Query Cashing or you can even use it by yourself – you just need to find which table is created. You can find this in API – destination table – which as I said above exists even if you have not set specific table. Or you can find it in Web UI
When result is bigger than 128MB – you must set destination table. The only drawback in your case is that you need to make sure you delete this table after you don’t need it anymore otherwise you will be paying for storage
You can do this either by actually deleting table - manually (in UI) or programmatically (API). Or you can set expiration on the table (API)

First of all if it's means it's too large, then probably greater than 128MB. You need to make sure that you query is accurate and if indeed you want to return the large data. Usually people make mistakes in the queries, like join explosion, missing time filters to reduce data, or missing limits.
After you are convinced the data is too large, you need to write to a table, then export to GCS, then download, and then deal with it.
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple

Related

Bigquery caching when hitting table would provide a different result?

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Oracle - Failover table or query manipulation

In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.
If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.
A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.

What are the benefits of a Make Table vs a Select query in Access?

I know you can run SELECT queries on top of SELECT queries in Access, but the application also provides the Make Table query type.
I'm wondering what the benefits/reasons for using Make Table might be?
You would usually use Make Table for performance reasons. If you have a fairly complex query that returns a subset of your table's data, and that you may need to retrieve multiple times, it can be expensive to re-run the query multiple times.
Using Make Table allows you to incur the cost of running the expensive query once, and make a copy of the query results into a table. Querying this copy would then be a lot less expensive than running your original expensive query.
This is usually a good option when you don't expect your original data to change frequently, or if you don't care that you are working of a copy of the data that may not be 100% up-to-date with the original data.
Notice what the following article on Create a make table query has to say:
Typically, you create make table queries when you need to copy or archive data. For example, suppose you have a table (or tables) of past sales data, and you use that data in reports. The sales figures cannot change because the transactions are at least one day old, and constantly running a query to retrieve the data can take time — especially if you run a complex query against a large data store. Loading the data into a separate table and using that table as a data source can reduce workload and provide a convenient data archive. As you proceed, remember that the data in your new table is strictly a snapshot; it has no relationship or connection to its source table or tables.
The main defense here is that a make table query creates a table. And when you done with the table then effort and time to delete that table and recover the VERY LARGE increase in the database file will have to occur. For general reports and a query of data make much more send. A comparison would be to build a NEW garage every time you want to park your car.
The database engine and query system can fetch and pull rows at a very high rate and those results are then able to be rendered into a report or form, and this occurs without having to create a temp table. It makes little sense to go through all of the trouble of having the system create a WHOLE NEW table for such results of data when they can with ease be sent to a report.
In other words creating a whole table just to display or use some data that the database engine already fetched and returned makes little sense. A table is a set of rows that holds data that can be updated and the results are permanent. A query is a “on the fly” results or sub set of data that only exists in memory and is discarded after you use the results.
So for general reporting and display of data, it makes no sense to create a temp table. MUCH WORSE of an issue is that if you have two users wanting to run a report, if they both need different results and you send the results to the SAME temp table, then you have a big mess and collision between the two users. So use of a temp table in Access for the most part makes little sense, and this is EVEN MORE so when working in a multi-user environment. And as noted, once the table is created, then after you are done you need to delete and remove the table. And with many users in a multi-user database this becomes even more of a problem and issue.
However in a multi-user environment as pointed out that if the resulting data needs additional processing, then sending the results to a temp table can be of use. This approach however suggests that EACH USER has their own front end and own copy of the application side. And better is that the temp table is created outside of the front end application that resides on each computer. Since the application part (front end) is placed on each computer, then creating of a temp table does not occur in the production database (back end) and as a result you can have multiple users function correctly without each individual user creating a temp table in the production back end database. So if one is to adopt a make table query, it likely should occur on each local workstation and not in the back end database when you have a multiple user database application.
Thus for the most part a make table and that of reports and query of data are VERY different goals and tasks. You don't want nor as a general rule create a whole brand new table for a simple query. In a multi user database system the users might run 100's of reports in a given day and FEW if any systems will send such data to a temp table in place of sending the query results directly to the report.
It creates a table - which is useful if you have a need for that table which you may have for temporary use where you have to modify the data for calculations or further processing while not disturbing the original data.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

The order of records in a regularly updated bigquery database

I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.