how to stop running bigquery query - google-bigquery

is there any way how to cancel a running query?
i use the web interface. first i ran a series of tests on tables of 10k and than 20k rows and the response was in seconds. but than i ran the triple join query on a table of 100k rows and it seems endless after thousand of seconds.
i just wanted to run some tests before moving all the work to bigquery but now i'm afraid it's gonna spend the whole monthly 100gb free limit + more.
the table is a simple key-value pairs of integer values.

The shell command bq cancel job_id will do this now. You can get the job_id from the Query History tab in the BigQuery console. If you started the query via the CLI, it will have logged the job_id to the standard output.

There isn't a way, currently to stop a running query either via the API or the UI. You may be able to close the query builder (via the 'x' in the top right of the UI) and re-open it again to make the UI responsive again. We're currently working on this feature in the UI.
It is surprising that the query would take so long, even for a join, for tables of that size, unless your join was joining on non-unique keys so was taking time generating the cross-products of matching keys. For example:
SELECT t1.foo
FROM (SELECT 1 as one, foo FROM table1) t1
JOIN (SELECT 1 as one, bar FROM table2) t2
ON t1.one = t2.one
Would generate n x m rows where n is thenumber of rows in table1 and m is the number of rows in table2. Is there any chance your query is doing something similar? If not, can you send the query? (maybe in another SO question, related to slow join performance).

We didn't find a way to stop jobs while using the Java API, and as far as i know you can't stop a job at the web interface.

Related

Google BigQuery Query exceeded resource limits

I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:
Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.
As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.
My current query is:
SELECT
coy.company_id,
cont.contact_id,
deals.deal_id,
{another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy
ON CAST(ac.from_id AS int64) = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64)
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;
FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.
It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.
TLDR;
Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.
I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.
Long Answer
Investigation method
Say you are joining two tables
SELECT
cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2
and you run into this error. The first thing you should do is check for duplicates in both tables.
SELECT
c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC
And the same thing for each table involved in a join.
I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!
Why BigQuery gives this error
In my experience, this error message means that your query does too many computation compared to the size of your inputs.
No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.
BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery
SELECT MINE_BITCOIN_UDF()
The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.
So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.
Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.
P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.

My SQL table is too big: retrieving data via paging/segmenting the result?

This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.

Increasing rows returned by "select top" suddently makes the query incredibly slower

A load that has been running in about 2 minutes suddenly turned into a 90 minute run before being manually cancelled.
It's a simple shadow query:
select fields
into shadow_table
from table
where date = '8/23/2011'
date has a non-clustered index on it.
If I change the query to select
top 300000 it completes in 2 seconds
top 400000 it runs in 3 minutes
top 500000 I got bored waiting and cancelled it
Our server team shows a lot of self blocking while it runs.
Can anyone suggest possible bottlenecks to look at?
Out of date stats.
Self-blocking only occurs with parallelism, and super long parallel runs (compared to norms) ordinarily means out-of-date stats. It could also be a change in cardinality in the data.
Step 1 should be running an UPDATE STATISTICS WITH FULLSCAN on your source table.
Follow the proven Waits and Queues methodology to identify the bottleneck.
When a request is running parallel query the proper way to analyze blockage is to dive at the subtask level and see what is blocking each of the sub tasks. One should never stop at CXPACKET as wait type, or 'self block' as an explanation.
select w.last_wait_type,
wt.wait_type,
wt.resource_description,
wt.blocking_session_id,
t.pending_io_count,
r.*
from sys.dm_os_tasks t
left join sys.dm_os_waiting_tasks wt on wt.waiting_task_address = t.task_address
join sys.dm_os_workers w on t.worker_address = w.worker_address
join sys.dm_exec_requests r on t.session_id = r.session_id
where r.session_id = <queryspid>;
If it's what it seems - an archival query - on records that won't be updated while it's running, you can turn off blocking entirely. Other queries that need integrity but use your records won't be affected - they manage their own locking.
Also make sure you have your fields as part of the include of your nonclustered index. If you don't, you're going to have to back to the table using an RID lookup to get that data.
create nonclustered index ix_whatever on YourTable (date)
include (field1, field2, ...)

TOP 100 causing SQL Server 2008 hang?

I have inherited a VERY poorly designed and maintained database and have been using my knowledge of SQL Server and a little luck keeping this HIGH availability server up and not completing coming down in flames (the previous developer, who quit basically just kept the system up for 4 years).
I have come across a very strange problem today. I hope someone can explain this to me so if this happens again there is a way to fix it.
Anyway, there is a stored proc that is pretty simple. It joins two tables together between a SHORT date/time range (5 mins range) and passes back the results (this query runs every 5 mins via a windows service). The largest table has 100k rows, the smallest table has 10k rows. The stored proc is very simple and does:
NOTE:The table and columns names have been changed to protect the innocent.
SELECT TOP 100 m.*
FROM dbo.mytable1 m WITH (nolock)
INNER JOIN dbo.mytable2 s WITH (nolock) ON m.Table2ID = s.Table2ID
WHERE m.RowActive = 1
AND s.DateStarted <= DATEADD(minute, -5, getdate())
ORDER BY m.DateStarted
Now, if I keep "TOP 100" in the query, the query hangs until I stop it (running in SMS or in the stored proc). If I remove the TOP 100, the query works as planned and returns 50-ish rows, like it should (we don't want it to return more than 100 rows if we can help it).
So, I did some investigating, using sp_who, sp_who2, and looked at the master..sysprocesses and used DBCC INPUTBUFFER to look for any SPIDs that might be locking or blocking. No blocks and no locking.
This JUST STARTED today with no changes to these these two tables designs and from what I gather the last time this query/tables have been touched was 3 years ago and has been running without error since.
Now, a side note, and I don't know if this would have anything to do with this. But I reindexed both these tables about 24 hours before because they were 99% fragmented (remember, I said this was poorly designed and poorly maintained server).
Can anyone explain why SQL Server 2008 would do this?
THE ORDER BY is the killer. it has to read all rows, sort by that order by column, and then give you the first 100 rows.
The absolute first thing I would do would do a side by side comparison of the query plans of the full and the top 100 queries and see if the top 100 is not performant. You might need to update stats or even have missing indexes.
I'd presume there's no index on mytable1.DateStarted. I think something might be deciding to perform the sorting earlier on in the query process when you did SELECT TOP 100.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."