Getting "Unexpected. Please try again." when doing large join in Google BigQuery - google-bigquery

Ok, I have two tables. Table A has ~60k records and 145 fields. Table B has ~1k records and 9 fields. I'm doing a join between 23 fields from A to a single field on B. Each one of these joins is a LEFT OUTER join. As such, I'm selecting all 145 fields from table A (but replacing the 23 fields from the join).
If I run the joins one-by-one, results are returned in under a second. However, If I try to run the query with all the joins in place, it runs for hours then errors out with the message: "Unexpected. Please try again."
This happens if I select one field or all the fields.
Any ideas?

Of the first 6 left joins, only 2 are used elsewhere in the query. Get rid of the unused joins for starters.
Do you have indexes in [medclient.users_table] for all of the fields you are doing the left join on? If not, then consider at least adding them while this query runs.
Add a few left joins at a time, checking your actual execution plan and see which one(s) kill it.

Related

Poor performance with stacked joins

I'm not sure I can provide enough details for an answer, but my company is having a performance issue with an older mssql view. I've narrowed it down to the right outer joins, but I'm not familiar with the structure of joins following joins without a "ON" with each one, as in the code snippet below.
How do I write the joins below to either improve performance or to the simpler format of Join Tablename on Field1 = field2 format ?
FROM dbo.tblObject AS tblObject_2
JOIN dbo.tblProspectB2B PB ON PB.Object_ID = tblObject_2.Object_ID
RIGHT OUTER JOIN dbo.tblProspectB2B_CoordinatorStatus
RIGHT OUTER JOIN dbo.tblObject
INNER JOIN dbo.vwDomain_Hierarchy
INNER JOIN dbo.tblContactUser
INNER JOIN dbo.tblProcessingFile WITH ( NOLOCK )
LEFT OUTER JOIN dbo.enumRetentionRealization AS RR ON RR.RetentionRealizationID = dbo.tblProcessingFile.RetentionLeadTypeID
INNER JOIN dbo.tblLoan
INNER JOIN dbo.tblObject AS tblObject_1 WITH ( NOLOCK ) ON dbo.tblLoan.Object_ID = tblObject_1.Object_ID ON dbo.tblProcessingFile.Loan_ID = dbo.tblLoan.Object_ID ON dbo.tblContactUser.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.vwDomain_Hierarchy.Object_ID = tblObject_1.Domain_ID ON dbo.tblObject.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.tblProspectB2B_CoordinatorStatus.Object_ID = dbo.tblLoan.ReferralSourceContactID ON tblObject_2.Object_ID = dbo.tblLoan.ReferralSourceContactID
Your last INNER JOIN has a number of ON statements. Per this question and answer, such syntax is equivalent to a nested subquery.
That is one of the worst queries I have ever seen. Since I cannot figure out how it is supposed to work without the underlying data, this is what I suggest to you.
First find a good sample loan and write a query against this view to return where loan_id = ... Now you have a data set you chan check you changes against more easily than the, possibly, millions of records this returns. Make sure these results make sense (that right join to tbl_objects is bothering me as it makes no sense to return all the objects records)
Now start writing your query with what you think should be the first table (I would suggest that loan is the first table, if it not then the first table is Object left joined to loan)) and the where clause for the loan id.
Check your results, did you get the same loan information as teh view query with the where clause added?
Then add each join one at a time and see how it affects the query and whether the results appear to be going off track. Once you have figured out a query that gives the same results with all the tables added in, then you can try for several other loan ids to check. Once those have checked out, then run teh whole query with no where clause and check against the view results (if it is a large number you may need to just see if teh record counts match and visually check through (use order by on both things in order to make sure your results are in the same order). In the process try to use only left joins and not that combination of right and left joins (its ok to leave teh inner ones alone).
I make it a habit in complex queries to do all the inner joins first and then the left joins. I never use right joins in production code.
Now you are ready to performance tune.
I woudl guess the right join to objects is causing a problem in that it returns teh whole table and the nature of that table name and teh other joins to the same table leads me to believe that he probably wanted a left join. Without knowing the meaning of the data, it is hard to be sure. So first if you are returning too many records for one loan id, then consider if the real problem is that as tables have grown, returning too many records has become problematic.
Also consider that you can often take teh view and replace it with code to get the same results. Views calling views are a poor technique that often leads to performance issues. Often the views on top of the other views call teh same tables and thus you end up joining to them multiple times when you don;t need to.
Check your Explain plan or Execution plan depending on what database backend you have. Analysis of this should show where you might have missing indexes.
Also make sure that every table in the query is needed. This especially true when you join to a view. The view may join to 12 other tables but you only need the data from one of them and it can join to one of your tables. MAke sure that you are not using select * but only returning teh fields the view actually needs. You have inner joins so, by definition, select * is returning fields you don't need.
If your select part of teh view has a distinct in it, then consider if you can weed down the multiple records you get that made distinct needed by changing to a derived table or adding a where clause. To see what is causing the multiples, you may need to temporarily use select * to see all the columns and find out which one is not uniques and is causing the issue.
This whole process is not going to be easy or fun. Just take it slowly, work carefully and methodically and you will get there and have a query that is understandable and maintainable in the end.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

SQL query, view joined to table - number of results inconsistent

Apologies in advance for the vagueness of this question, but it involves a query which is too big to describe in full, and field/table names that I can't reveal. So I'm not really expecting a solution, but if someone give some advice on how I could proceed in solving it myself, I'd be grateful.
SQL Server 2000.
I have a query which joins a view and a table with an INNER JOIN and has a WHERE clause:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (23 rows)
This produces 23 results (it should be 1000s). If I add another WHERE clause to the above query, I get more results, instead of less:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE' AND
view.field2=1
This gives me a few thousand results, as was originally expected. Changing the value to 2 or 3 (the only other values present) also gives me thousands of results each, but if I change it to view.field2 IN (1,2,3) I end up with only 38 results.
Going back to the original query, which gave me 23 results, if I add the table field I have in the WHERE clause to the SELECT block, I get the right number of results:
SELECT
view.join_field,
table.other_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (8764 rows)
If I instead use a WHERE clause of table.other_field='GG' (the only other value present in the table), none of these strange things happen, and I get the expected number of results.
If I SELECT the contents of view into a temporary table, and use that in my query, I also get the thousands of rows I was expecting.
view itself is an LEFT OUTER JOIN of another view and two other tables. table, in my query, is not involved in any of the views.
Can anyone give me even the vaguest of ideas of what's going on? Are my tables or views corrupt, somehow?

SQL Query with multiple possible joins (or condition in join)

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!

Number of records discrepancy - only change is sorting

I have an Access 2003 database with a query that is a left outer join of a table to another query. If I didn't sort that final query, I got 42 records. If I sorted the final query by the 2 joined fields, I got 43 records. No other changes were made to the query.
To verify this, I took the query, copied it, applied the sort with no other changes, and the record count went up by one. Perplexed, I copied the results into Excel, sorted, and compared row by row. I discovered one record was duplicated (all fields were exactly the same), where there were actually no duplicate records in the source table and query.
I would think this is a bug, and I know there are a few in Access, but has anyone heard of this behavior before?
It is possible that you have a corrupt index. It may be worth taking a back up and then compacting and repairing the database, which should rebuild the indexes.
I had a similar problem on Access 2003 where I had duplicate records where 1 field was an autonumber which should up twice.
My query was :
SELECT qry_Tasks_with_Names.*, Location.Location, WorkRequests.Element, WorkRequests.Site_ID
FROM ((WorkRequests LEFT JOIN Location ON WorkRequests.Site_ID=Location.LocationID) INNER JOIN qry_Tasks_with_Names_by_Individual ON WorkRequests.Work_Request_ID=qry_Tasks_with_Names_by_Individual.Work_Request_ID) INNER JOIN qry_Tasks_with_Names ON WorkRequests.Work_Request_ID=qry_Tasks_with_Names.Work_Request_ID
WHERE (WorkRequests.Site_ID=Forms!TaskList!comboFilter_by_Site Or Forms!TaskList!comboFilter_by_Site Is Null) And (qry_Tasks_with_Names.Assigned_to_User_ID=forms!taskList!comboFilter_by_Person Or forms!taskList!comboFilter_by_Person Is Null) And (qry_Tasks_with_Names.Assigned_to_Team_ID=forms!taskList!comboFilter_by_Team Or forms!taskList!comboFilter_by_Team Is Null) And qry_Tasks_with_Names.Assigned_to_User_ID>0
ORDER BY qry_Tasks_with_Names.SLA_Due;
The above was initially created by the query designer, but it gets itself confused at this level and it also seems that I do.
Once I removed the inner join on qry_Tasks_with_Names_by_Individual all was OK.
No idea why, but hopefully this may save someone else some tears if they have the same problem.