Performance over PostgreSQL conditional join - Query optimization - sql

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.

What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Related

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.

SQL Query with multiple possible joins (or condition in join)

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!

Inefficient JOIN Method?

I'm trying to query two fairly large tables here to pull some results and having some trouble with effeciency.
Note: I've only included relevant columns to make this not look so messy!
TableA (Stock) has productID, ownerID, and count columns
TableB (Owners) has ID, accountHolderID, and name columns
What I'm trying to do is query TableA and where productID = X pull up Stock.productID, Stock.accountHolderID and Owners.name. The relation between these two tables is Stock.ownerID = Owners.ID so if the WHERE condition pulled say five productIDs then I'd want the name from TableB that matched up to the ownerID from TableA.
The only unique ID in this situation is Owners.ID from TableB
Just doing a basic SELECT query on TableA for those products takes 15 seconds however when I add an INNER JOIN to match things up to TableB the query takes significantly longer, upwards of 10 minutes. I'm guessing I've designed this query inefficiently.
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID = 42301679
How can I make this query more efficient?
Would adding ORs to the WHERE condition allow me to pull multiple productIDs at once?
Based on your comment, it looks like you're missing a very critical index on the owners.id field. Now, keep in mind this index will help this query, but you have to take into consideration all of the other queries that run against this table to determine if it is a good idea to add that index.
At 29M rows, having an index on a table that is frequently inserted to may have a noticeable effect on insert times.
This may be a situation where different applications need different indexes - namely your OLTP app and your reporting app (which may just be you running ad hoc queries). A common solution is to have a second server that runs your reporting/data warehouse queries that has indexes properly tuned to this function.
Best of luck.
Your'e query looks right
perhaps we can see the schema
In order to pull multiple productIDs at once you can use the IN operator instead of OR
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID IN (42301679,123232,232324)
If the productID is unique in the Stock table, it makes sense to make this the index and this can greatly improve performance as others have mentioned.
Another performance gain comes from setting a specific length Owner.name field. In mySQL, VARCHAR can be used for Strings of varied length while a CHAR(32) column indicates that the name will always occupy 32 characters. The extra unused space is just padded, so you can really think of the (32) as indicating a maximum length. The performance advantage comes from the fact that the database now knows exactly how many bytes each row occupies and it can use this information to improve lookup time.

Tips or tricks for translating sql joins from literal language to SQL syntax?

I often know exactly what I want, and know how tables are related in order to get it, but I have a real hard time translating that literal language knowledge to SQL syntax when it comes to joins. Do you have any tips or tricks you can share that have worked for you in the past?
This is a basic, but poor example:
"I have Categories, which have one-to-many Products, which have one-to-many Variants, which have one-to-many Sources. I need all Sources that belong to Category XYZ."
I imagine doing something where you cross out certain language terms and replace them with SQL syntax. Can you share how you formulate your queries based upon some concept similar to that? Thanks!
Use SQL Query Designer to easily buid Join queries from the visual table collection right there, then if you want to learn how it works, simply investigate it, that's how I learned it.
You won't notice how charming it is till you try it.
Visual Representation of SQL Joins - A walkthrough explaining SQL JOINs.
Complete ref of SQL-Server Join, Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, in SQL-Server 2005 (View snapshot bellow).
ToTraceString of Entity Frameork' ObjectQuery (that you add Include shapings to it) is also a good way to learn it.
SQL-Server Join types (with detailed examples for each join type):
INNER JOIN - Match rows between the two tables specified in the INNER JOIN statement based on one or more columns having matching data. Preferably the join is based on referential integrity enforcing the relationship between the tables to ensure data integrity.
Just to add a little commentary to the basic definitions above, in general the INNER JOIN option is considered to be the most common join needed in applications and/or queries. Although that is the case in some environments, it is really dependent on the database design, referential integrity and data needed for the application. As such, please take the time to understand the data being requested then select the proper join option.
Although most join logic is based on matching values between the two columns specified, it is possible to also include logic using greater than, less than, not equals, etc.
LEFT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the left table. On the right table, the matching data is returned in addition to NULL values where a record exists in the left table, but not in the right table.
Another item to keep in mind is that the LEFT and RIGHT OUTER JOIN logic is opposite of one another. So you can change either the order of the tables in the specific join statement or change the JOIN from left to right or vice versa and get the same results.
RIGHT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the right table. On the left table, the matching data is returned in addition to NULL values where a record exists in the right table but not in the left table.
Self Join - In this circumstance, the same table is specified twice with two different aliases in order to match the data within the same table.
CROSS JOIN - Based on the two tables specified in the join clause, a Cartesian product is created if a WHERE clause does filter the rows. The size of the Cartesian product is based on multiplying the number of rows from the left table by the number of rows in the right table. Please heed caution when using a CROSS JOIN.
FULL JOIN - Based on the two tables specified in the join clause, all data is returned from both tables regardless of matching data.
I think most people approach its:
Look for substantives, as they can point to potential tables
Look for adjectives, cause they probably are fields
Relationships between substantives gives JOIN rules
Nothing better than to draw these structures in a paper sheet.
Write and debug a query which returns the fields from the table having the majority of—or the most important—data. Add constraints which depend only on that table, or which are independent of all tables.
Add a new where term which relates another table.
Repeat 2 until done.
I've yet to use the join operator in a query, even after 20+ years of writing SQL queries. One can almost always write them in the form
select field, field2, field3, <etc.>
from table
where field in (select whatever from table2 where whatever) and
field2 in (select whatever from table2 where whatever) and ...
or
select field, field2, field3, <etc.>
from table1, table2, ...
where table1.field = table2.somefield and
table1.field2 = table3.someotherfield and ...
Like someone else wrote, just be bold and practice. It will be like riding a bicycle after 4 or 5 times creating such a query.
One word: Practice.
Open up the query manager and start running queries until you get what you want. Look up similar examples and adapt them to your situation. You will always have to do some trial and error with the queries to get them right.
SQL is very different from imperative programming.
1) To design tables, consider Entities (the real THINGS in the world of concern), Relationships (between the Entities), and Attributes (the values associated with an Entity).
2) to write a Select statement consider a plastic extrusion press:
a) you put in raw records From tables, Where conditions exist in the records
b) you may have to join tables to get at the data you need
c) you craft extrusion nozzles to make the plastic into the shapes you want. These are the individual expressions of the select List.
d) you may want the n-ary sets of list data to come to you in a certain order, you can apply an Order By clause.
3) crafting the List expressions is the most like imperative programming after you discover the if(exp,true-exp,false-exp) function.
Look at the ERD.
Logical or physical version, it will show what tables are related to one another. This way, you can see what table(s) you need to join to in order to get from point/table a to point/table b, and what criteria.