JOIN VIEWs containing UNION ALL: performance disaster? - sql

Consider the following standard table/subtable relationship in a MS SQL Server DB:
PARENT
-ID (PK)
-OTHER_FIELD
CHILD
-ID (PK)
-PARENT_ID (FK)
-OTHER_FIELD
Now also consider that there exist other versions of these tables in the DB (with the same structure): PARENT_2/CHILD_2, PARENT_3/CHILD_3, PARENT_4/CHILD_4.
I want to create a standard mechanism to select data from all of the tables combined in a simple query. The first thing that came to mind is:
PARENT_VIEW = PARENT UNION ALL PARENT_2 UNION ALL PARENT_3 UNION ALL PARENT_4
CHILD_VIEW = CHILD UNION ALL CHILD_2 UNION ALL CHILD_3 UNION ALL CHILD_4
But I'm concerned that queries containing JOINs against these views will be horrendous due to each row in the PARENT_VIEW having to scan each row in the CHILD_VIEW for each JOIN. There are actually multiple different CHILD tables (resulting in multiple JOINs) and I'll be dealing with large amounts of data in due course. I don't think this will scale.
Ideally I want to preserve the structure of the tables (opposed to flattening them out).
Queries made against the combined tables would always be filtered using WHERE conditions.
I'm aware of the alternative approach of writing my queries for each PARENT/CHILD table separately and using UNION ALL on each result set so the JOIN would be much more efficient.
I also considered a conditional join, but have not yet explored this further.
Any suggestions how to proceed (ideally with some DBMS science to back it up)?

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Best practices for multiple table joins UNION vs JOIN

I'm using a query which brings ~74 fields from different database tables.
The query consists of 10 FULL JOINS and 15 LEFT JOINS.
Each join condition is based on different fields.
The query fetches data from a main table which contains almost 90% foreign keys.
I'm using those joins for the foreign keys but some of the data doesn't require those joins because it's type of data(as logic) doesn't use those information.
Let me give an example:
Each employee can have multiple Tasks.There are four types of tasks(1,2,3,4).
Each TaskType has different meaning. When running the query , I'm getting data for all those tasktypes and then do some logic for showing them separately.
My question is : It is better to use UNION ALL and split all the 4 different cases into queries? This way I could use only the required joins for each case in each union.
Thanks,
I would think it depends strongly on the size (row number count) of the main table an e.g. the task tables.
Say if your main table has tens of millions of rows and the tasks are smaller, then a union with all tasks will necessitate table scans every time, whereas a join with the smaller task tables can do this with one table scan.

Is a Join Faster than two queries

I know this question has been asked before,
example, and I do agree that one query with a join is faster than performing another query for each record returned by the firs query.
However, since with a join you tend to generate redundant field will this slow the network ?
Let's say I have a table hotel and HOTEL has a number of images in a table HOTEL_IMAGE. HOTEL has 20 fields. performing a join on HOTEL_IMAGE will produce 20 fields for each hotel image. will this query still be faster over the network ?
That depends a lot on your actual data, but from what I have seen, if you have a well-parameterized DB with fresh statistics, it is much better to put the join in SQL and let the DB figure out what to do.
Anyway, DB queries are in my opinion the first things you want to profile. It is not a coincidence that any good DBMS has a lot of performance measuring tools. And you need to profile with data as close to actual data as possible (recent copies of your production environment are best).
Don't use select * but only the columns that you need. IF you do this, a join will be faster (not sure why you would ever want to do this with 2 queries, you have to make 2 connections to your database ect.)
As a solution to avoid duplication in joined data, you can return multiple recordsets from single query, if your database supports it. One recordset will return the master records, and second - the detail records plus the key fields from master query.
select ID, Name, ... from HOTEL where <.... your criteria>;
select h.ID as HotelID, i.ID, i.Description, i.ImageFile, .... from HOTEL_IMAGE i
join HOTEL h on h.ID = i.HotelID and ( <.... same criteria for HOTEL> )
Not sure if the query on the master table would be cached, so the second select will reuse it, but it will save traffic for sure.
We are using this approach for queries, which tend to return multi-level joined results.

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?
I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')
It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.
Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'
The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.
Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.
The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it