I’m trying to join two tables in BigQuery but it hits the 6 hour query execution time limit. For context:
The primary table is ~150M rows; the joined table is ~250M rows.
It’s a simple SELECT * FROM a LEFT JOIN b with equality comparison on three columns (as ORs). These three columns are strings ranging from 10-80 or so characters in length.
The same query with the same schemas for those tables, but with a sample of only 100K rows each, works quickly and without issue.
Tried: Running with fewer join clauses (3 instead of 9), enabling large results
Expected: BQ to render a result, eventually
Actual: Timeout
Suggestions appreciated!
Related
I have an SQL query generated by Entity Framework that contains a two level nested LEFT OUTER JOIN on tables TableA, TableB and TableC1 and TableC2 with the following foreign key relations as the arrows indicate.
TableA->TableB->TableC1
->TableC2
TableA contains 1000 rows, and all other tables contain around 100000 rows
The SQL statement looks something like this:
select * from TableA A
LEFT JOIN TableB B on A.Id = B.TableAId
LEFT JOIN TableC1 C1 on B.Id = C1.TableBId
LEFT JOIN TableC2 C2 on B.Id = C2.TableBId
When SQL is executed on Microsoft SQL Server, it takes around 30 seconds.
However, if I do a select of each table and retrieve the rows as lists in memory and join them programmatically in C#, it takes 3 seconds or so.
Can anyone give any indication of why sql server is so slow?
Thanks
When you use joins in SQL you create something called a Cartesian Product.
For example, if I have 2 tables, Table A and Table B, Table A has 10 rows and each row has 10 Table B references:
If I load these 2 tables separately, I am loading 110 rows. 10 rows from table A, and 100 rows from table B.
If I JOIN these tables I am loading 100 rows, however, those hundred rows each represent the combine data of both tables. If Table A has 10 columns, and Table B has 20 columns, the total data read loading the tables separately would be 10x10 + 100x20 or 2100 columns worth of data. With a JOIN, I am loading 30x100 or 3000 columns worth of data. That's not a huge difference, but it compounds as I join more tables.
If each Table B has a Table C with an average of 5 rows and 10 columns, loaded separately that would add 5000 (500x10) or now 7600 columns worth of data. When Joined, that becomes 3000x5x10, or 150,000 columns worth of total data being loaded into memory or sifted through. You should see how this can and does snowball quite quickly if you start doing SELECT * FROM ... with Joins.
When EF goes to build a query where you are loading entity graphs (related entities) the resulting query will often use JOINs resulting in these Cartesian results that it loads, then sifts through to build the resulting object graph, condensing the results back down into the 10 A's with 10 B's and 5 C's, etc. But it still takes memory and time to chew through all of that flattened resulting data. EF Core can offer query splitting to essentially execute more like what your counter-comparison was, loading the related tables separately to piece together, greatly reducing the total amount of data being read.
Ultimately to improve performance of queries generated by EF:
Use Select or Automapper's ProjectTo to select just the values from the related tables, rather than loading Entities /w Include to eager load related entities when reading "sets" of entities like search results. Load entities /w Include for single entities like when updating one.
Ensure that when querying the above data, inspect the execution plans for index suggestions.
If you do need to load large amounts of related data, consider use query splitting.
When I execute this it return more than million rows, in first table I have 315 000 rows, in second about 14 000. What should I do to get all rows from both tables? Also if I don't stop server it breakdown during listing unexcited rows.
select *
from tblNormativiIspratnica
inner join tblNormativiSubIspratnica on tblNormativiIspratnica.ZaklucokBroj = tblNormativiSubIspratnica.ZaklucokBroj
If the first table has 315000 rows and the second one has 14000 rows, then the fields that you are using are not forming part of a good primary-key-foreign-key relationship, with the result you are getting cartesian product duplicates. If you want to have a well-defined result, you must have well-defined fields that serve these purposes. btw, Take care of your server-breakdown etc, do not try queries that fetch large resultsets, if you have no idea what you are doing. Quickly look-up the basics, and understand the design, by writing simple queries with more specific criteria that fetch small resultsets, before you run large ones.
If you do not have issues of server performance and breaking down, I would have suggested DISTINCT as a (poor) quick solution for getting unique rows, but remember, DISTINCT queries can, at times, take a performance toll.
I'm using a query which brings ~74 fields from different database tables.
The query consists of 10 FULL JOINS and 15 LEFT JOINS.
Each join condition is based on different fields.
The query fetches data from a main table which contains almost 90% foreign keys.
I'm using those joins for the foreign keys but some of the data doesn't require those joins because it's type of data(as logic) doesn't use those information.
Let me give an example:
Each employee can have multiple Tasks.There are four types of tasks(1,2,3,4).
Each TaskType has different meaning. When running the query , I'm getting data for all those tasktypes and then do some logic for showing them separately.
My question is : It is better to use UNION ALL and split all the 4 different cases into queries? This way I could use only the required joins for each case in each union.
Thanks,
I would think it depends strongly on the size (row number count) of the main table an e.g. the task tables.
Say if your main table has tens of millions of rows and the tasks are smaller, then a union with all tasks will necessitate table scans every time, whereas a join with the smaller task tables can do this with one table scan.
I have two tables having 50 million unique rows each.
Row number from one table corresponds to row number in the second table.
i.e
1st row in the 1st table joins with 1st row in the second table, 2nd row in first table joins with 2nd row in the second table and so on. Doing inner join is costly.
It takes more than 5 hours on clusters. Is there an efficient way to do this in SQL?
To start with: tables are just sets. So the row number of a record can be considered pure coincidence. You must not join two tables based on row numbers. So you would join on IDs rather than on row numbers.
There is nothing more efficient than a simple inner join. As the whole tables must be read, you might not even gain anything from indexes (but as we are talking of IDs, there will be indexes anyhow, so nothing we must ponder on).
Depending on the DBMS you may be able to parallelize the query. In Oracle for example you would use a hint such as /*+ parallel( tablename , parallel_factor ) */.
Try to sort both tables by rows (if isnt sorted),then use normal SELECT (maybe you could use LIMIT to get it part by part) for both tables anddata connect line by line wherever you want
I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.