Inconsistent results from BigQuery: same query, different number of rows - google-bigquery

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2

Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

Related

Index for join query with where clause PostgreSQL

I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;

MS Access SQL - Removing Duplicates From Query

MS Access SQL - This is a generic performance-related duplicates question. So, I don't have a specific example query, but I believe I have explained the situation below clearly and simply in 3 statements.
I have a standard/complex SQL query that Selects many columns; some computed, some with asterisk, and some by name - e.g. (tab1.*, (tab2.co1 & tab2.col2) as computedFld1, tab3.col4, etc).
This query Joins about 10 tables. And the Where clause is based on user specified filters that could be based on any of the fields present in all 10 tables.
Based on these filters, I can sometimes get records with the same tab4.ID value.
Question: What is the best way to eliminate duplicate result rows with the same tab4.ID value. I don't care which rows get eliminated. They will differ in non-important ways.
Or, if important, they will differ in that they will have different tab5.ID values; and I want to keep the result rows with the LARGEST tab5.ID values.
But if the first query performs better than the second, then I really don't care which rows get eliminated. The performance is more important.
I have worked on this most of the morning and I am afraid that the answer to this is above my pay scale. I have tried Group By tab4.ID, but can't use "*" in Select clause; and many other things that I just keep bumping my head against a wall.
Access does not support CTEs but you can do something similar with saved queries.
So first alias the columns that have same names in your query, something like:
SELECT tab4.ID AS tab4_id, tab5.ID AS tab5_id, ........
and then save your query for example as myquery.
Then you can use this saved query like this:
SELECT q1.*
FROM myquery AS q1
WHERE q1.tab5_id = (SELECT MAX(q2.tab5_id) FROM myquery AS q2 WHERE q2.tab4_id = q1.tab4_id)
This will return 1 row for each tab4_id if there are no duplicate tab5_ids for each tab4_id.
If there are duplicates then you must provide additional conditions.

MSSQL - Question about how insert queries run

We have two tables we want to merge. Say, table1 and table2.
They have the exact same columns, and the exact same purpose. The difference being table2 having newer data.
We used a query that uses LEFT JOIN to find the the rows that are common between them, and skip those rows while merging. The problem is this. both tables have 500M rows.
When we ran the query, it kept going on and on. For an hour it just kept running. We were certain this was because of the large number of rows.
But when we wanted to see how many rows were already inserted to table2, we ran the code select count(*) from table2, it gave us the exact same row count of table2 as when we started.
Our questions is, is that how it's supposed to be? Do the rows get inserted all at the same time after all the matches have been found?
If you would like to read uncommited data, than the count should me modified, like this:
select count(*) from table2 WITH (NOLOCK)
NOLOCK is over-used, but in this specific scenario, it might be handy.
No data are inserted or updated one by one.
I have no idea how it is related with "Select count(*) from table2 WITH (NOLOCK) "
Join condition is taking too long to produce Resultset which will be use by insert operator .So actually there is no insert because no resultset is being produce.
Join query is taking too long because Left Join condition produces very very high cardinality estimate.
so one has to fix Join condition first.
for that need other info like Table schema ,Data type and length and existing index,requirement.

SQL query, view joined to table - number of results inconsistent

Apologies in advance for the vagueness of this question, but it involves a query which is too big to describe in full, and field/table names that I can't reveal. So I'm not really expecting a solution, but if someone give some advice on how I could proceed in solving it myself, I'd be grateful.
SQL Server 2000.
I have a query which joins a view and a table with an INNER JOIN and has a WHERE clause:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (23 rows)
This produces 23 results (it should be 1000s). If I add another WHERE clause to the above query, I get more results, instead of less:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE' AND
view.field2=1
This gives me a few thousand results, as was originally expected. Changing the value to 2 or 3 (the only other values present) also gives me thousands of results each, but if I change it to view.field2 IN (1,2,3) I end up with only 38 results.
Going back to the original query, which gave me 23 results, if I add the table field I have in the WHERE clause to the SELECT block, I get the right number of results:
SELECT
view.join_field,
table.other_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (8764 rows)
If I instead use a WHERE clause of table.other_field='GG' (the only other value present in the table), none of these strange things happen, and I get the expected number of results.
If I SELECT the contents of view into a temporary table, and use that in my query, I also get the thousands of rows I was expecting.
view itself is an LEFT OUTER JOIN of another view and two other tables. table, in my query, is not involved in any of the views.
Can anyone give me even the vaguest of ideas of what's going on? Are my tables or views corrupt, somehow?

Where clause affecting join

Dumb question time. Oracle 10g.
Is it possible for a where clause to affect a join?
I've got a query in the form of:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on pd.product_value = to_number(p.product_value)) product_result
where product_name like '%prototype%';
Obviously this is a contrived example. No real need to show the table structure as it's all imaginary. Unfortunately, I can't show the real table structure or query. In this case, p.product_value is a VARCHAR2 field which in certain rows have an ID stored inside it rather than text. (Yes, bad design - but something I inherited and am unable to change)
The issue is in the join. If I leave out the where clause, the query works and rows are returned. However, if I add the where clause, I get "invalid number" error on the pd.product_value = to_number(p.product_value) join condition.
Obviously, the "invalid number" error happens when rows are joined which contain non-digits in the p.product_value field. However, my question is how are those rows being selected? If the join succeeds without the outer where clause, shouldn't the outer where clause just select rows from the result of the join? It appears what is happening is the where clause is affecting what rows are joined, despite the join being in an inner query.
Is my question making sense?
It affects the plan that's generated.
The actual order that tables are joined (and so filtered) is not dictated by the order you write your query, but by the statistics on the tables.
In one version, the plan generated co-incidentally means that the 'bad' rows never get processed; because the preceding joins filtered the result set down to a point that they're never joined on.
The introduction of the WHERE clause has meant that ORACLE now believes a different order of join is better (because filtering by the product name requires a certain index, or because it narrows the data down a lot, etc).
This new order means that the 'bad' rows get processed before the join that filters them out.
I would endeavour to clean the data before querying it. Possibly by creating a derived column where the value is already cast to a number, or left as NULL if it is not possible to do so.
You can also use EXPLAIN PLAN to see the different plans being gerenated from your queries.
Short answer: yes.
Long answer: the query engine is free to rewrite your query however it wants, as long as it returns the same results. All of the query is available to it to use for the purpose of producing the most efficient query it can.
In this case, I'd guess that there is an index that covers what you are wanting, but it doesn't cover product name, when you add that to the where clause, the index isn't used and instead there's a scan where both conditions are tested at the same time, thus your error.
Which is really an error in your join condition, you shouldn't be using to_number unless you are sure it's a number.
I guess your to_number(p.product_value) only applies for rows with a valid product_name.
What happens is that your join is applied before your where clause resulting in the failure of the to_number function.
What you need to do is include your product_name like '%prototype%' as a JOIN clause like this:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on product_name like '%prototype%' AND
pd.product_value = to_number(p.product_value));
For more background (and a really good read), I'd suggest reading Jonathan Gennick's Subquery Madness.
Basically, the problem is that Oracle is free to evaluate predicates in any order. So it is free to push (or not push) the product_name predicate into your subquery. It is free to evaluate the join conditions in any order. So if Oracle happens to pick a query plan where it filters out the non-numeric product_value rows before it applies the to_number, the query will succeed. If it happens to pick a plan where it applies the to_number before filtering out the non-numeric product_value rows, you'll get an error. Of course, it's also possible that it will return the first N rows successfully and then you'll get an error when you try to fetch row N+1 because row N+1 is the first time that it is trying to apply the to_number predicate to a non-numeric data.
Other than fixing the data model, you could potentially throw some hints into the query to force Oracle to evaluate the predicate that ensures that all the non-numeric data is filtered out before the to_number predicate is applied. But in general, it's a bit challenging to fully hint a query in a way that will force the optimizer to always evaluate things in the "proper" order.