ST_INTERSECTS issue BigQuery Waze - google-bigquery

I'm using the following code to query a dataset based on a polygon:
SELECT *
FROM `waze-public-dataset.partner_name.view_jams_clustered`
WHERE ST_INTERSECTS(geo, ST_GEOGFROMTEXT("POLYGON((-99.54913355822276 27.60526592074579,-99.52673174853038 27.60526592074579,-99.52673174853038 27.590813604291416,-99.54913355822276 27.590813604291416,-99.54913355822276 27.60526592074579))")) IS TRUE
The validation message says that "This query will process 1 TB when run".
It seems like there's no problem. However, when I remove the "WHERE INTERSECTS" function, the validation message says exactly the same thing: "This query will process 1 TB when run", the same 1 TB, so I'm guessing that the ST_INTERSECTS function is not working.

When you actually run this query, the amount charged should be usually much less, as expected for spatially clustered table. I've run select count(*) ... query with one partner dataset, and while editor UI announced 9TB before running query, the query reported around 150MB processed after running.
The savings come from clustered table - but the specific clusters that intersect the polygon used in filter depend on actual data in the table and how clusters were created. The clusters and thus the cost of the query can only be determined when the query runs. Editor UI in this case shows maximum possible cost of the query.

Related

Flex-slots intermittently unable to execute trivial queries

I am running some tests using flex-slots. I have created a flex-slot commitment/reservation of 100 using QUERY as the type, and assigned it to my project. There have no other queries running concurrently. I am using a trivial query without cached results and mode set to INTERACTIVE to test performance. Everything is running in the US multi-region.
The table I am querying has just 1.2M rows (only 86MB), and is the public wikipedia dataset found in bigquery-samples.wikipedia_benchmark.Wiki1M.
This is the query:
SELECT
SUM(views) AS total_views,
title,
LANGUAGE
FROM
`bigquery-samples.wikipedia_benchmark.Wiki1M`
WHERE
LANGUAGE='en'
AND TITLE LIKE "%Germany%"
GROUP BY
title,
LANGUAGE
ORDER BY
total_views DESC;
A successful run (e.g. bquxjob_1f741592_1853646efb9) runs in under 1 second and uses on average less than a single slot. 100 slots for this query should be more than adequate.
However, intermittently the query (e.g. bquxjob_1913d525_185365564ea) fails with:
"Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job."
Is there a reason why this trivial query would be failing even with 100 slots assigned and no other queries running concurrently?

Is SELECT TOP always the fastest way to get a preview of a query you want on SQL?

I have the following query that I've ran:
SELECT TOP 100 certs.CertId, COUNT(cluster.BGTJobId) C
FROM [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
INNER JOIN [Eagle].[raw].[certs] certs
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Ultimately, I want to get the full results and not just the top 100, but for previewing purposes, is this the fastest way to go?
Since you've mentioned this is for preview purposes, so I'm assuming you just want data out of the query and you want it to run FAST regardless of the data it returns, and seeing that you mentioned that the query takes 14 minutes to execute, a quick 'hack-fix' would be to use something like below:
SELECT
certs.CertId
, COUNT(cluster.BGTJobId)
FROM
(SELECT TOP 100
certs.CertId
FROM [Eagle].[raw].[certs] certs) certs
INNER JOIN [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Aggregating data (in your case COUNT) is a very expensive operation and should be done only at the last part of the query on as little data as possible. That is why, for "preview" purposes I have selected onyl the first 100 certificates and made the COUNT on that data.
However, because you mentioned that the query takes 14 minutes to run, the problems are elsewhere and usually this is due to design (query design, index design or even table design).
You should ask yourself if you really want to go over all of the data in the tables and get all of the matching rows from both tables, and aren't you possibly missing a WHERE clause?
If you do decide that there is a WHERE clause needed, are there any indexes to help filtering the data based on the conditions of your WHERE clause (and even the join columns - certs.BGTJobId and cluster.BGTJobId?
Yes top select query is fastest for preview purposes that why its also shown in management studio GUI right click. But if you are running custom query just check that where clause / grouping etc are done is part of the clustered index.

"Shuffle failed with error" message when trying to use GROUP BY to return a tables disinct rows

We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY
There is an error message we'd like some help deciphering
Query Failed
Error:
Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.
Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I
The query as you'd imagine does quite a bit and looks a little something like this
SELECT Twenty, Different, Columns, For, Deduping, ...
including_some, INTEGER(conversions),
plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
with_some, joined, reference, columns,
COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...
Question
What does this error mean? Is it trying to do this kind of shuffling? ;-)
And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?
Side Note
For completeness we tried a more modest GROUP EACH
SELECT our, twenty, nine, string, column, table,
count(*) as dupe_count
FROM [MainDataSet.ReallyBigTable]
GROUP EACH BY all, those, twenty, nine, string, columns
And we receive a more subtle
Error: Resources exceeded during query execution.
Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4
Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?
Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.
When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.
This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.
In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:
(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)
Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.

Access & SQL Server: Number of uses since date aggregate problem - new reporting problem (solved aggregate issue)

BACKGROUND:
I've been trying to streamline the work involved in running a report in my program. Lately, I've had to supply a listing of job numbers an instrument has been used on with the listing of items for cost/benefit analysis. Mostly to see how often an instrument is used since it was last serviced/calibrated and the last time anyone did use it. I was looking to integrate this into the query that helps generate the report - but I keep hitting a brick wall of sorts with the number of uses - since I want that aggregate to be based on the date the instrument was last calibrated (a field based in the same query). I can get it to give me the number of uses in the system total - but it will not accept the limitation that I want it to be only counting the times used since the last time it was calibrated
PROBLEM:
Attempts to put an aggregate function in my report for the number of uses since the item's calibration are met either with undesired results, or the dreaded 'aggregate missing' error (don't remember the exact warning).
-- Edited to add 8/12/2011 # 16:09 --
An additional problem with the use of the Max aggregate has been found for instruments that have never been used being excluded by this query.
DETAILS:
Here is the query that does work so far:
SELECT
dbo_tblPOGaugeDetail.intGagePOID,
dbo_tblPOGaugeDetail.strGageDetailID,
dbo_Gage_Master.Description,
dbo_Gage_Master.Manufacturer,
dbo_Gage_Master.Model_No,
dbo_Gage_Master.Gage_SN,
dbo_Gage_Master.Unit_of_Meas,
dbo_Gage_Master.User_Defined,
dbo_Gage_Master.Calibration_Frequency,
dbo_Gage_Master.Calibration_Frequency_UOM,
dbo_tblPOGaugeDetail.bolGageLeavePriceBlank,
dbo_tblPOGaugeDetail.intGageCost,
dbo_Gage_Master.Last_Calibration_Date,
dbo_Gage_Master.Next_Due_Date,
dbo_tblPOGaugeDetail.bolGageEvaluate,
dbo_tblPOGaugeDetail.bolGageExpedite,
dbo_tblPOGaugeDetail.bolGageAccredited,
dbo_tblPOGaugeDetail.bolGageCalibrate,
dbo_tblPOGaugeDetail.bolGageRepair,
dbo_tblPOGaugeDetail.bolGageReturned,
dbo_tblPOGaugeDetail.bolGageBER,
dbo_tblPOGaugeDetail.intTurnaroundDaysOut,
qryRCEquipmentLastUse.MaxOfdatDateEntered
FROM (dbo_tblPOGaugeDetail
INNER JOIN dbo_Gage_Master ON dbo_tblPOGaugeDetail.strGageDetailID = dbo_Gage_Master.Gage_ID)
INNER JOIN qryRCEquipmentLastUse ON dbo_Gage_Master.Gage_ID = qryRCEquipmentLastUse.Gage_ID
ORDER BY dbo_tblPOGaugeDetail.strGageDetailID;
But I can't seem to aggregate a count of Uses (making a Count(strCustomerJobNum)) from the tblGageActivity with the following fields:
strGageID
strCustomerJobNum
datDateEntered
datTimeEntered
I tried to add a field to the formerly listed query to do a Count(strCustomerJobNum) where datDateEntered matched the Last_Calibration_Date from the calling query - but I got the 'missing aggregate' error. If I leave this condition out - it will run - but will list every instrument ever sent out only if it's had a usage count of at least one (not what I want at all, sadly).
I also want to make sure that if I should get a zero uses count - I will get a zero back instead of my expected records minus the null results.
I hope someone out there can tell me where I am going wrong with this - I want to save the time I am currently spending running an activity report in another program whenever I want to generate this report. Thanks in advance, and let me know if you need me to post more information.
-- Edited to add 08/15/2011 # 14:41 --
I managed to solve the Max() aggregate problem by creating a 'pure' first-step query to get a listing of all instrument with most modern date as qryRCEquipmentUsed.
qryRCEquipmentLastUse:
SELECT dbo.tblGageActivity.strGageID, Max(dbo.tblGageActivity.datDateEntered) AS datLastDateUsed
FROM dbo.tblGageActivity
GROUP BY dbo.tblGageActivity.strGageID;
Then I created a 'pure' listing of all instruments that have no usage at all as a query named qryRCEquipmentNeverUsed.
qryRCEquipmentNeverUsed:
SELECT dbo_Gage_Master.Gage_ID, NULL AS datLastDateUsed
FROM dbo_Gage_Master LEFT JOIN dbo_tblGageActivity ON dbo_Gage_Master.Gage_ID = dbo_tblGageActivity.strGageID
WHERE (((dbo_tblGageActivity.strGageID) Is Null));
NOTE: The NULL was inserted so that the third combining UNION query will not fail due to a mismatch in the number of fields being retrieved from the tables.
At last, I created a UNION query named qryCombinedUseEquipment to combine the two into a list:
qryCombinedUseEquipment:
SELECT *
FROM qryRCEquipmentLastUse
UNION SELECT *
FROM qryRCEquipmentNeverUsed;
Using this last union query to feed the Last Used date to the parent query works in datasheet view, but when the parent query is called in the report - I get a blank report; so a nudge in the right direction would still be wonderfully appreciated.
APPENDIX
Same script as above, but with shorter table aliases (in case someone finds that clearer):
SELECT
gd.intGagePOID,
gd.strGageDetailID,
gm.Description,
gm.Manufacturer,
gm.Model_No,
gm.Gage_SN,
gm.Unit_of_Meas,
gm.User_Defined,
gm.Calibration_Frequency,
gm.Calibration_Frequency_UOM,
gd.bolGageLeavePriceBlank,
gd.intGageCost,
gm.Last_Calibration_Date,
gm.Next_Due_Date,
gd.bolGageEvaluate,
gd.bolGageExpedite,
gd.bolGageAccredited,
gd.bolGageCalibrate,
gd.bolGageRepair,
gd.bolGageReturned,
gd.bolGageBER,
gd.intTurnaroundDaysOut,
lu.MaxOfdatDateEntered
FROM (dbo_tblPOGaugeDetail gd
INNER JOIN dbo_Gage_Master gm ON gd.strGageDetailID = gm.Gage_ID)
INNER JOIN qryRCEquipmentLastUse lu ON gm.Gage_ID = lu.Gage_ID
ORDER BY gd.strGageDetailID;
Piece by piece...
First -- I suspect you're trying to answer too many questions at once (as evidenced by 23 fields in your SELECT), which will make aggregation near-impossible. Start by narrowing down the scope of the query -- What question is this query attempting to answer? (You can always make more queries to answer other questions... :-)
1) How many uses since last calibration?
2) How many uses since last ...use? (not sure what you mean by that -- maybe last sign-out, or last rental, etc.?)
Tip -- learn to use table aliases. Large queries are difficult to read; worse because of repeated table names.
1) Ex.: dbo_tbl_POGaugeDetail.intGagePOID becomes d.intGagePOID
Here's a sample that might get you started:
SELECT
d.strCustomerJobNum,
Max(d.last_calibration_date) -- not sure what you named that field
Count(d.strCustomerJobNum)
FROM
dbo_tblPOGaugeDetail d
GROUP BY
d.strCustomerJobNum
Does this work:
SELECT dbo_tblPOGaugeDetail.intGagePOID, dbo_tblPOGaugeDetail.strGageDetailID,
OuterGageMaster.Description, OuterGageMaster.Manufacturer, OuterGageMaster.Model_No,
OuterGageMaster.Gage_SN, OuterGageMaster.Unit_of_Meas, OuterGageMaster.User_Defined,
OuterGageMaster.Calibration_Frequency, OuterGageMaster.Calibration_Frequency_UOM,
dbo_tblPOGaugeDetail.bolGageLeavePriceBlank, dbo_tblPOGaugeDetail.intGageCost,
OuterGageMaster.Last_Calibration_Date, OuterGageMasterNext_Due_Date,
dbo_tblPOGaugeDetail.bolGageEvaluate, dbo_tblPOGaugeDetail.bolGageExpedite,
dbo_tblPOGaugeDetail.bolGageAccredited, dbo_tblPOGaugeDetail.bolGageCalibrate,
dbo_tblPOGaugeDetail.bolGageRepair, dbo_tblPOGaugeDetail.bolGageReturned,
dbo_tblPOGaugeDetail.bolGageBER, dbo_tblPOGaugeDetail.intTurnaroundDaysOut,
qryRCEquipmentLastUse.MaxOfdatDateEntered,
(Select Count(strCustomerJobNum)
FROM tblGageActivity WHERE
OuterGageMaster.Last_Calibration_Date=tblGageActivity.datDateEntered) As JobCount
FROM
(dbo_tblPOGaugeDetail INNER JOIN dbo_Gage_Master OuterGageMaster ON
dbo_tblPOGaugeDetail.strGageDetailID = OuterGageMaster.Gage_ID) INNER JOIN
qryRCEquipmentLastUse ON OuterGageMaster.Gage_ID = qryRCEquipmentLastUse.Gage_ID
ORDER BY
dbo_tblPOGaugeDetail.strGageDetailID;
or is that what you tried?
Summary Problem:
Attempts to put an aggregate function in my report for the number of uses since the item's calibration are met either with undesired results, or the dreaded 'aggregate missing' error.
Solution:
I decided to leave the query driving the report alone - instead choosing to employ the use of DLookup and DCount as appropriate to retrieve the last used date from a query that provides the last used date of all the instruments, and the number of uses an instrument has had since it's last calibration, using the aforementioned domain aggregates respectively.
Using the query described in the problem description, I am able to retrieve the last used date for all instruments. I used a =DLookup statement as the source for a text box on the report's subreport dealing with various items as such:
=IIf((DLookUp("[qryRCCombinedUseEquipment]![datLastDateUsed]","[qryRCCombinedUseEquipment]","[qryRCCombinedUseEquipment]![strGageID]=[strGageDetailID]")) Is Null Or ([bolGageReturned]=True),"",DLookUp("[qryRCCombinedUseEquipment]![datLastDateUsed]","[qryRCCombinedUseEquipment]","[qryRCCombinedUseEquipment]![strGageID]=[strGageDetailID]"))
This allows items that have never been used to return a NULL result, which will display as a blank text box.
The number of uses, however, would not feed off a query using =DCount (I tried, it would take over ten minutes to retrieve results, if it ever did). However, using the underlying activity table, I used the following statement:
=IIf([bolGageReturned],"","Used " & DCount("[dbo_tblGageActivity]![strGageID]","[dbo_tblGageActivity]","[dbo_tblGageActivity]![strGageID] = [strGageDetailID] And [dbo_tblGageActivity]![datDateEntered] Between [txtLastCalibrationDate] And date()") & " times since last calibration")
It would retrieve a number of times used since the instrument was last calibrated, but no uses that are before that or after today (some jobs are post dated, strangely). Of course, this is SLOW (about thirty seconds for a large document with thirty or forty instruments).
Does anyone else have a better solution for this, or will I have to take the performance hit? If no one has any better ideas, I will accept this as the answer after five days (8/21/2011) .

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.