Generic way to stay within Google BigQuery SQL Query quota - sql

This is the SQL query I am running against a public dataset:
SELECT
package,
COUNT(*) count
FROM (
SELECT
REGEXP_EXTRACT(line, '(.*)') package,
id
FROM (
SELECT
SPLIT(content, '\n') line,
id
FROM
[bigquery-public-data:github_repos.contents]
WHERE
sample_path LIKE '%.bashrc' OR sample_path LIKE '%.bash_profile')
GROUP BY
package,
id )
GROUP BY
1
ORDER BY
count DESC
LIMIT
400;
and this is the error message:
Error: Quota exceeded: Your project exceeded quota for free query
bytes scanned. For more information, see
https://cloud.google.com/bigquery/troubleshooting-errors
bigquery-public-data:github_repos.contents is too large for my quota.
bigquery-public-data:github_repos.sample_contents is too small for what I'm analyzing.
Is there any way to specify how much quota a query can utilize? For example, if I have a 1TB quota, is there a way to run this query against github_repos.contents (which would consume 2.15TB), but stop processing after consuming 1TB?

You can use Custom Cost Controls. This can be set at project level or user. The user can be a service account. Having different service accounts running each queries you can "specify how much quota a query can utilize".

Related

CETAS times out for large tables in Synapse Serverless SQL

I'm trying to create a new external table using CETAS (CREATE EXTERNAL TABLE AS SELECT * FROM <table>) statement from an already existing external table in Azure Synapse Serverless SQL Pool. The table I'm selecting from is a very large external table built on around 30 GB of data in parquet format stored in ADLS Gen 2 storage but the query always times out after about 30 minutes. I've tried using premium storage and also tried out most if not all the suggestions made here as well but it didn't help and the query still times out.
The error I get in Synapse Studio is :-
Statement ID: {550AF4B4-0F2F-474C-A502-6D29BAC1C558} | Query hash: 0x2FA8C2EFADC713D | Distributed request ID: {CC78C7FD-ED10-4CEF-ABB6-56A3D4212A5E}. Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. Query timeout expired.
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
Is there a way to resolve this timeout issue or a better way to solve the problem?
This is a limitation of Serverless.
Query timeout expired
The error Query timeout expired is returned if the query executed more
than 30 minutes on serverless SQL pool. This is a limit of serverless
SQL pool that cannot be changed. Try to optimize your query by
applying best practices, or try to materialize parts of your queries
using CETAS. Check is there a concurrent workload running on the
serverless pool because the other queries might take the resources. In
that case you might split the workload on multiple workspaces.
Self-help for serverless SQL pool - Query Timeout Expired
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
It's simple to do in a Data Factory copy job, a Spark job, or AzCopy.

Why does Java OutOfMemoryError occurs when selecting less columns in hive query?

I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?
When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.

Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements

I am trying to insert record in to Azure sql Dataware House using Oracle ODI, but i am getting error after insertion of some records.
NOTE: I am trying to insert 1000 records, but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
I am trying to insert 1000 records,but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
While Abhijith's answer is technically correct, I'd like to suggest an alternative that will give you far better performance.
The root of your problem is that you've chosen the worst-possible way to load a large volume of data into Azure SQL Data Warehouse. A long list of INSERT statements is going to perform very badly, no matter how many DWUs you throw at it, because it is always going to be a single-node operation.
My recommendation is to adapt your ODI process in the following way, assuming that your Oracle is on-premise.
Write your extract to a file
Invoke AZCOPY to move the file to Azure blob storage
CREATE EXTERNAL TABLE to map a view over the file in storage
CREATE TABLE AS or INSERT INTO to read from that view into your target table
This will be orders of magnitude faster than your current approach.
20MB is the limit defined and it is hard limit for now. Reducing the batch size will certainly help you work around this limit.
Link to capacity limits.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits

Google BigQuery large table (105M records) with 'Order Each by' clause produce "Resources Exceeds Query Execution" error

I am running into Serious issue "Resources Exceeds Query Execution" when Google Big Query large table (105M records) with 'Order Each by' clause.
Here is the sample query (which using public data set: Wikipedia):
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
How to solve this without adding Limit keyword.
Using order by on big data databases is not an ordinary operation and at some point it exceeds the attributes of big data resources. You should consider sharding your query or run the order by in your exported data.
As I explained to you today in your other question, adding allowLargeResults will allow you to return large response, but you can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
One option here that you may try is sharding your query.
where ABS(HASH(Id) % 4) = 0
You can play with the above parameters a lot to achieve smaller resultsets and then combining.
Also read Chapter 9 - Understanding Query Execution it explaines how internally sharding works.
You should also read Launch Checklist for BigQuery
I've gone through the same problem and fixed it following the next steps
Run the query without ORDER BY and save in a dataset table.
Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)
Download the files to a folder in your machine.
Install XAMPP (if you get a UAC warning) and change some settings after.
Start Apache and MySQL in your XAMPP control panel.
Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)
Create a database and a table with its fields.
Go to Tools > Import CSV file, configure accordingly and import.
Once all data is imported, do the ORDER BY and export the table.

Bigquery load job said successful but data did not get loaded into table

I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.
I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.
This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.
we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.