I've been trying to run this query and I keep getting the resources exceeded error despite setting Allow Large results to true and setting the destination table. I tried adding a limit as well. Is there any way I can optimize the query to avoid this error?
SELECT
repo_name,
commit,
parent,
subject,
message,
difference.*
FROM
FLATTEN(FLATTEN([bigquery-public-data:github_repos.commits], repo_name), parent)
WHERE
(REGEXP_MATCH(message,r'(?i:null pointer dereference)'))
LIMIT
5000000
Thank you very much!
It worked for me (and took about a minute and a half):
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name
CROSS JOIN UNNEST(parent) AS parent
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
I suspect that legacy SQL in BigQuery is worse at handling this kind of thing. As an aside, though: is there a reason that you want to flatten the repo_name and parent arrays? Without flattening, the query would produce 37,073 rows, whereas with flattening, it produces 11,419,166. Depending on the kind of analysis that you want to do, you may not need to flatten the data at all.
Edit: since it sounds like flattening was only necessary to work around legacy SQL's limitations related to independently repeating fields, you can remove the CROSS JOINs and avoid flattening:
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
This is faster, too--it takes about 15 seconds instead of 90.
Related
for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.
We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY
There is an error message we'd like some help deciphering
Query Failed
Error:
Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.
Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I
The query as you'd imagine does quite a bit and looks a little something like this
SELECT Twenty, Different, Columns, For, Deduping, ...
including_some, INTEGER(conversions),
plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
with_some, joined, reference, columns,
COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...
Question
What does this error mean? Is it trying to do this kind of shuffling? ;-)
And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?
Side Note
For completeness we tried a more modest GROUP EACH
SELECT our, twenty, nine, string, column, table,
count(*) as dupe_count
FROM [MainDataSet.ReallyBigTable]
GROUP EACH BY all, those, twenty, nine, string, columns
And we receive a more subtle
Error: Resources exceeded during query execution.
Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4
Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?
Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.
When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.
This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.
In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:
(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)
Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.
I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).
I'm in serious trouble, I've a huge subtle query that takes huge time to execute. Actually it freezes Access and sometimes I have to kill it the query looks like:
SELECT
ITEM.*,
ERA.*,
ORDR.*,
ITEM.COnTY1,
(SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1) AS NewConTy1,
ITEM.COnValue1,
(SELECT TOP 1 KBETR FROM GEN_KUMV WHERE KNUMV = ERA.DOCCOND AND KSCHL = (SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1)) AS NewCOnValue1
--... etc: this continues until ConTy40
FROM
GEN_ITEMS AS ITEM,
GEN_ORDERS AS ORDR,
GEN_ERASALES AS ERA
WHERE
ORDR.ORDER_NUM = ITEM.ORDER_NUM AND -- link between ITEM and ORDR
ERA.concat = ITEM.concat -- link between ERA and ITEM
I won't provide you with the tables schema since the query works, what I'd like to know is if there's a way to add the NewConTy1 and NewConValue1 using another technique to make it more efficient. The thing is that the Con* fields goes from 1 to 40 so I've to align them along (NewConTy1 next to ConTy1 with NewConValue1 next to new ConValue2... etc until 40).
ConTy# and ConTyValue# are in ITEMS (each in a field)
NewConty# and NewConValue# are in ERA (each in a record)
I really hope my explanation is enough to figure out my issue,
Looking forward to hearing from you guys
EDIT:
Ignore the TOP 1 in the SELECTS, it's because current dumps of data I have aren't accurate it's going to be removed later
EDIT 2:
Another thing my query returns up to 230 fields also lol
Thanks
Miloud
Have you considered a union query to normalize items?
SELECT "ConTy1" As CTName, Conty1 As CTVal,
"ConTyValue1" As CTVName, ConTyValue1" As CTVVal
FROM ITEMS
UNION ALL
SELECT "ConTy2" As CTName, Conty2 As CTVal,
"ConTyValue2" As CTVName, ConTyValue2" As CTVVal
FROM ITEMS
<...>
UNION ALL
SELECT "ConTy40" As CTName, Conty40 As CTVal,
"ConTyValue40" As CTVName, ConTyValue40" As CTVVal
FROM ITEMS
This can either be a separate query that links in to your main query, or a sub query of your main query, if that is more convenient. It should then be easy enough to draw in the relationship to the NewConty# and NewConValue# in ERA.
Remou's answer gives what you want - significantly different approach. It's been a while since I've meddled with MS Access query optimization, and had forgot about the details of its planner, but you might want to try a trivial suggestion to actually make your
WHERE conditions
into
INNER JOIN ON conditions
You are firing 40ish correlated subqueries so the above probably will not help (again Remou's answer takes significantly different approach and you might see real improvements there), but do let us know as it is trivial to test.
Another approach that you can take is to materialize expensive part and take Remou's idea but split it into different parts where you can join directly.
For example your first subquery is correlated on ITEM.COnTY1, your second is correlated on ERA.DOCCOND and ITEM.ConTY1.
If you classify your subqueries according to correlated keys then you can save them as queries (or materialize them as make table queries) and join on them (or the newly created tables), which should might perform much faster (and in the case of make tables will perform much faster, at the expense of materializing - so you'll have to run some queries before getting latest data - this can be encapsulated in a macro or VBA function/sub).
Otherwise (for example if you run the above query regularly as a part of your normal business use case) - redesign your DB.
As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.