Large table joins small table oom - apache-spark-sql

I have a large table, say a, with about 100 million rows. Keys may have duplicates. Some of the keys may have appear in over 1million rows. Another table, say b, has about 20 thousands rows, where keys are all unique. Although the rows in b are small, but the field of each row is big, say about 1M in average and 10M in max.
I am joining a with table b on a.key = b.key. However, it's always oom. What can I do to optimize the table join? Please help

Related

Nested LEFT OUTER JOINS takes too long compared to programmatical solution

I have an SQL query generated by Entity Framework that contains a two level nested LEFT OUTER JOIN on tables TableA, TableB and TableC1 and TableC2 with the following foreign key relations as the arrows indicate.
TableA->TableB->TableC1
->TableC2
TableA contains 1000 rows, and all other tables contain around 100000 rows
The SQL statement looks something like this:
select * from TableA A
LEFT JOIN TableB B on A.Id = B.TableAId
LEFT JOIN TableC1 C1 on B.Id = C1.TableBId
LEFT JOIN TableC2 C2 on B.Id = C2.TableBId
When SQL is executed on Microsoft SQL Server, it takes around 30 seconds.
However, if I do a select of each table and retrieve the rows as lists in memory and join them programmatically in C#, it takes 3 seconds or so.
Can anyone give any indication of why sql server is so slow?
Thanks
When you use joins in SQL you create something called a Cartesian Product.
For example, if I have 2 tables, Table A and Table B, Table A has 10 rows and each row has 10 Table B references:
If I load these 2 tables separately, I am loading 110 rows. 10 rows from table A, and 100 rows from table B.
If I JOIN these tables I am loading 100 rows, however, those hundred rows each represent the combine data of both tables. If Table A has 10 columns, and Table B has 20 columns, the total data read loading the tables separately would be 10x10 + 100x20 or 2100 columns worth of data. With a JOIN, I am loading 30x100 or 3000 columns worth of data. That's not a huge difference, but it compounds as I join more tables.
If each Table B has a Table C with an average of 5 rows and 10 columns, loaded separately that would add 5000 (500x10) or now 7600 columns worth of data. When Joined, that becomes 3000x5x10, or 150,000 columns worth of total data being loaded into memory or sifted through. You should see how this can and does snowball quite quickly if you start doing SELECT * FROM ... with Joins.
When EF goes to build a query where you are loading entity graphs (related entities) the resulting query will often use JOINs resulting in these Cartesian results that it loads, then sifts through to build the resulting object graph, condensing the results back down into the 10 A's with 10 B's and 5 C's, etc. But it still takes memory and time to chew through all of that flattened resulting data. EF Core can offer query splitting to essentially execute more like what your counter-comparison was, loading the related tables separately to piece together, greatly reducing the total amount of data being read.
Ultimately to improve performance of queries generated by EF:
Use Select or Automapper's ProjectTo to select just the values from the related tables, rather than loading Entities /w Include to eager load related entities when reading "sets" of entities like search results. Load entities /w Include for single entities like when updating one.
Ensure that when querying the above data, inspect the execution plans for index suggestions.
If you do need to load large amounts of related data, consider use query splitting.

How to improve SQL query in Spark when updating table? ('NOT IN' in subquery)

I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?
I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.

which one is faster: to query with criteria in one shot or to subset a large table into a smaller table and then apply criteria

I have a large table (TB size, ~10 billion rows, ~100 million IDs).
I want to run a query to get some specific IDs' counts (say, 100k IDs). List of needed IDs are in another table.
I know that I can run a join query to get the results, but it is extremely time consuming (~ 5 days processing).
I am wondering if I break the script into 2 phase (1- subset the whole table based on just the IDs, 2-apply the selection criteria on the subset table), will it make any process improvement?

Oracle process getting slower in time

I am using oracle database and what i do is
Taking 1 record of table A. (table A has column P and lets say
values of it are x,y,z)
Putting that record to table B or C or D according to values x,y,z
(if P=x then put record to table B , if P=y then put record into
table C ...)
Delete that record of A which we inserted to table B or C or D.
Note: size of A is like 200 million, B is 170 C is 20 D is 10 so and size of A is decreasing others are same (if a parameter of A record is negative then it is not inserted into to B,C,D it is exist in these tables so just deleted it from table) so there is no size change for B,C,D just size of A decreasing in time.
The problem is at the beginning everything is working nice, but in time, its becoming extremely slow. Approximately it is making 40 insert+delete in 1 second but in time its processing 1 insert+delete in 3 second.
All tables have index in corresponding columns.
Paralel run exists but there is no lock.
Table sizes are approximately 60 million record.
What other effects can make it - in time - if there is no lock or size increase for table??
note: it is not different processes, in same process i click "execute query" it is starting very fast but then extremely slow.
Inserting 200 million records from a staging table and inserting them into permanent tables in a single transaction is ambitious. It would be a useful if you had a scheme for dividing the records from table A into chunks which could be processed in discrete chunks.
Without seeing your code it's hard to tell but I have a suspicion you are attempting this RBAR rather than a more efficient set-based approach. I think the key here is to de-couple the insertions from clearing down table A. Insert all the records, than zap A at your leisure. Something like this
insert all
when p = 'X' then into b
when p = 'Y' then into c
when p = 'Z' then into d
select * from a;
truncate table a;

INNER JOINs with where on the joined table

Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks
The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.
It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.
The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.