Creating a view from JOIN two massive tables - sql

Context
I have a big table, say table_A, with roughly 20 billion rows and 600 columns. I don't own this table but I can read from it.
For a fraction of these columns I produce a few extra columns (50) which I store in a separate table, say table_B, which is therefore roughly 20 bn X 50 large.
Now I have the need to expose the join of table table_A and table_B to users, which I tried as
CREATE VIEW table_AB
AS
SELECT *
FROM table_A AS ta
LEFT JOIN table_B AS tb ON (ta.tec_key = tb.tec_key)
The problem is that for any simple query like SELECT * FROM table_AB LIMIT 2 will fail because of memory issues: apparently Impala attempts to do a full join first in memory which would result into a table of 0.5 Petabyte. Hence the failure.
Question
What is the best way to create such a view?
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
I have also tried with [...] SELECT STRAIGHT_JOIN * [...] but did not help.

What is the best way to create such a view?
Since both tables are huge, there will be memory problem. here are some points i would recommend,
Assuming table a and b have same tec_key, do a inner join
Keep (smaller) table b as driver. create vw as select ... from b join a on .... Impala stores driver table in memory and so it will require less memory.
Select only columns required and do not select all.
put filter in view.
Do partitions in table b if you can on some dtae/year/region/anything that can evenly distribute the data.
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
You can not ensure filter goes before or after join. Only way to ensure a filter will improve perf is if you have partition on the filter column. Else, you can try to filter first and the join to see if it improves perf like this
select ... from b
join ( select ... from a where region='Asia') a on ... -- wont improve much
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
Completely agree on this. Multiple smaller tables is way better than one giant table with 600 columns. So, create few stg table with only required fields and then enrich that data. Its a difficult data set, but no one will change 20b rows everyday - so some sort of incremental is also possible to implement.

Related

Nested LEFT OUTER JOINS takes too long compared to programmatical solution

I have an SQL query generated by Entity Framework that contains a two level nested LEFT OUTER JOIN on tables TableA, TableB and TableC1 and TableC2 with the following foreign key relations as the arrows indicate.
TableA->TableB->TableC1
->TableC2
TableA contains 1000 rows, and all other tables contain around 100000 rows
The SQL statement looks something like this:
select * from TableA A
LEFT JOIN TableB B on A.Id = B.TableAId
LEFT JOIN TableC1 C1 on B.Id = C1.TableBId
LEFT JOIN TableC2 C2 on B.Id = C2.TableBId
When SQL is executed on Microsoft SQL Server, it takes around 30 seconds.
However, if I do a select of each table and retrieve the rows as lists in memory and join them programmatically in C#, it takes 3 seconds or so.
Can anyone give any indication of why sql server is so slow?
Thanks
When you use joins in SQL you create something called a Cartesian Product.
For example, if I have 2 tables, Table A and Table B, Table A has 10 rows and each row has 10 Table B references:
If I load these 2 tables separately, I am loading 110 rows. 10 rows from table A, and 100 rows from table B.
If I JOIN these tables I am loading 100 rows, however, those hundred rows each represent the combine data of both tables. If Table A has 10 columns, and Table B has 20 columns, the total data read loading the tables separately would be 10x10 + 100x20 or 2100 columns worth of data. With a JOIN, I am loading 30x100 or 3000 columns worth of data. That's not a huge difference, but it compounds as I join more tables.
If each Table B has a Table C with an average of 5 rows and 10 columns, loaded separately that would add 5000 (500x10) or now 7600 columns worth of data. When Joined, that becomes 3000x5x10, or 150,000 columns worth of total data being loaded into memory or sifted through. You should see how this can and does snowball quite quickly if you start doing SELECT * FROM ... with Joins.
When EF goes to build a query where you are loading entity graphs (related entities) the resulting query will often use JOINs resulting in these Cartesian results that it loads, then sifts through to build the resulting object graph, condensing the results back down into the 10 A's with 10 B's and 5 C's, etc. But it still takes memory and time to chew through all of that flattened resulting data. EF Core can offer query splitting to essentially execute more like what your counter-comparison was, loading the related tables separately to piece together, greatly reducing the total amount of data being read.
Ultimately to improve performance of queries generated by EF:
Use Select or Automapper's ProjectTo to select just the values from the related tables, rather than loading Entities /w Include to eager load related entities when reading "sets" of entities like search results. Load entities /w Include for single entities like when updating one.
Ensure that when querying the above data, inspect the execution plans for index suggestions.
If you do need to load large amounts of related data, consider use query splitting.

Extract data from view ORACLE performance

Hello I created a view to make a subquery ( select from two tables)
this the sql order :
CREATE OR REPLACE VIEW EMPLOYEER_VIEW
AS
SELECT A.ID,A.FIRST_NAME||' '||A.LAST_NAME AS NAME,B.COMPANY_NAME
FROM EMPLOY A, COMPANY B
WHERE A.COMPANY_ID=B.COMPANY_ID
AND A.DEPARTEMENT !='DEP_004'
ORDER BY A.ID;
If I select data from EMPLOYEER_VIEW the average execution time is 135,953 s
Table EMPLOY contiens 124600329 rows
Table COMPANY contiens 609 rows.
My question is :
How can i make the execution faster ?
I created two indexes:
emply_index (ID,COMPANY_ID,DEPARTEMENT)
and company_index(COMPANY_ID)
Can you help me to make selections run faster ? ( creating another index or change join )
PS: I Can't create a materialized view in this database.
In advance thanks for help.
You have a lot of things to do.
If you must work with a view, and can not create a scheduled job to insert data in a table, I will remove my answer.
VIEWs does not have the scope to support hundred of million data. Is for few million.
INDEXes Must be cleaned when data are inserting. If you insert data with an index the process is 100 times slower. (You can drop and create or update them).
In table company CREATE PARTITION.
If you have a lot of IDs, use RANGE.
If you have around 100 IDs LIST PARTITION.
You do not need Index, because the clause to JOIN does not optimize, INDEXes is specified to strict WHERE Clause.
We had a project with 433.000.000 data, and the only way to works was playing with partitions.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.

Slow self-join delete query

Does it get any simpler than this query?
delete a.* from matches a
inner join matches b ON (a.uid = b.matcheduid)
Yes, apparently it does... because the performance on the above query is really bad when the matches table is very large.
matches is about 220 million records. I am hoping that this DELETE query takes the size down to about 15,000 records. How can I improve the performance of the query? I have indexes on both columns. UID and MatchedUID are the only two columns in this InnoDB table, both are of type INT(10) unsigned. The query has been running for over 14 hours on my laptop (i7 processor).
Deleting so many records can take a while, I think this is as fast as it can get if you're doing it this way. If you don't want to invest into faster hardware, I suggest another approach:
If you really want to delete 220 million records, so that the table then has only 15.000 records left, thats about 99,999% of all entries. Why not
Create a new table,
just insert all the records you want to survive,
and replace your old one with the new one?
Something like this might work a little bit faster:
/* creating the new table */
CREATE TABLE matches_new
SELECT a.* FROM matches a
LEFT JOIN matches b ON (a.uid = b.matcheduid)
WHERE ISNULL (b.matcheduid)
/* renaming tables */
RENAME TABLE matches TO matches_old;
RENAME TABLE matches_new TO matches;
After this you just have to check and create your desired indexes, which should be rather fast if only dealing with 15.000 records.
running explain select a.* from matches a inner join matches b ON (a.uid = b. matcheduid) would explain how your indexes are present and being used
I might be setting myself up to be roasted here, but in performing a delete operation like this in the midst of a self-join, isn;t the query having to recompute the join index after each deletion?
While it is clunky and brute force, you might consider either:
A. Create a temp table to store the uid's resulting from the inner join, then join to THAT, THEN perfoorm the delete.
OR
B. Add a boolean (bit) typed column, use the join to flag each match (this operation should be FAST), and THEN use:
DELETE * FROM matches WHERE YourBitFlagColumn = True
Then delete the boolean column.
You probably need to batch your delete. You can do this with a recursive delete using a common table expression or just iterate it on some batch size.