Configuring Merge Join in PostgreSQL - sql

I'm using PostgreSQL with big tables, and query takes too much time.
I have two tables. The first one has about 6 million rows (data table), and the second one has about 30000 rows (users table).
Each user has about 200 rows in data table.
Later, data and users tables may increase up to 30 times.
My query is:
SELECT d.name, count(*) c
FROM users AS u JOIN data AS d on d.id = u.id
WHERE u.language = 'eng' GROUP BY d.name ORDER BY c DESC LIMIT 10;
90% of users has eng language, and query time is 7 seconds. Each column is indexed!
I read about Merge Join and it should be really fast, so I sorted tables by id and forced Merge Join, but time increased up to 20 seconds.
I suppose, the tables configuration is wrong, but I don't know how to fix it.
Should I make other improvements?

For this query:
SELECT d.name, count(*) c
FROM users u JOIN
data d
on d.id = u.id
WHERE u.language = 'eng'
GROUP BY d.name
ORDER BY c DESC
LIMIT 10;
First, try indexes: users(language, id), data(id, name). See if this speeds up the query.
Second, what is d.name? Can a user have more than one of them? Is there a table of valid values? Depending on the answers to these questions, there may be other ways to structure the query.

Related

PostgreSQL - Optimize subquery by referencing outer query

I have two tables: users and orders. Orders is a massive table (>100k entries) and users is relatively small (around 400 entries).
I want to find the number of orders per user. The column linking both tables is the email column.
I can achieve this with the following query:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o
WHERE o.status = 'COMPLETED'
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
However, as mentioned previously, the order table is very large, so I would ideally want to add another condition to the WHERE clause in the Subquery to only retrieve those emails that exist in the User table.
I can simply add the user table to the subquery like this:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o, users AS u
WHERE o.status = 'COMPLETED'
and o.cust_email = u.email
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
This does speed up the query, but sometimes the outer query is much more complex than just selecting all entries from the user table. Therefore, this solution does not always work. The goal would be to somehow link the outer and the inner query. I've thought of joint queries but cannot figure out how to get it to work.
I noticed that the first query seems to perform faster than I expected, so perhaps PostgreSQL is already smart enough to connect the outer and inner tables. However, I was hoping that someone could shed some light on how this works and what the best way to perform these types of subqueries is.

Where can I find usage statistics in Redshift?

Before all thank you for your help!
I want to find out which tables in the database are most heavily used, i.e. the amount of users that query the table, the amount of times it was queried, the resources that where consumed by users per table, the total time the tables where queried, and any other useful data.
For now I would limit the analysis to 9 specific tables.
I'd tried using stl_scan and pg_user using the next two querys:
SELECT
s.perm_table_name AS table_name,
count(*) AS qty_query,
count(DISTINCT s.userid) AS qty_users
FROM stl_scan s
JOIN pg_user b
ON s.userid = b.usesysid
JOIN temp_mone_tables tmt
ON tmt.table_id = s.tbl AND tmt.table = s.perm_table_name
WHERE s.userid > 1
GROUP BY 1
ORDER BY 1;
SELECT
b.usename AS user_name,
count(*) AS qty_scans,
count(DISTINCT s.tbl) AS qty_tables,
count(DISTINCT trunc(starttime)) AS qty_days
FROM stl_scan s
JOIN pg_user b
ON s.userid = b.usesysid
JOIN temp_mone_tables tmt
ON tmt.table_id = s.tbl AND tmt.table = s.perm_table_name
WHERE s.userid > 1
GROUP BY 1
ORDER BY 1;
The temp_mone_tables is a temporal table that contains the id and name of the tables I'm interested.
With this queries I'm able to get some information but I need more details. Surprisingly there's not much data online about this kind of statistics.
Again thank you all beforehand!
Nice work! You are on the right track using the stl_scan table. I'm not clear what further details you're looking for.
For detailed metrics on resource usage you may want to use the SVL_QUERY_METRICS_SUMMARY view. Note that this data is summarized by query not table because a query is the primary way resources are utilized.
Generally, have a look at the admin queries (and views) in our Redshift Utils library on GitHub, particularly v_get_tbl_scan_frequency.sql
Thanks to Joe Harris' answer I was able to add a lot of information to my previous query. With svl_query_metrics_summary joined to stl_scan you get important data about resources consumption, this information can be extended joining them to the vast number of views listed in Joe's answer.
For me the solution begins with the next query:
SELECT *
FROM stl_scan ss
JOIN pg_user pu
ON ss.userid = pu.usesysid
JOIN svl_query_metrics_summary sqms
ON ss.query = sqms.query
JOIN temp_mone_tables tmt
ON tmt.table_id = ss.tbl AND tmt.table = ss.perm_table_name
The query gives you a lot of data that can be summarized in multiple ways as wanted.
Remember that temp_mone_tables is a temp table that contains the tableid and name of the tables I'm interested.

Most efficient way to get records from a table for which a record exists in another table for each month

I have two tables as below:
User: User_ID, User_name and some other columns (has approx 1000 rows)
Fee: Created_By_User_ID, Created_Date and many other columns (has 17 million records)
Fee table does not have any index (and I can't create one).
I need a list of users for each month of a year (say 2016) who have created at least one fee record.
I do have a working query below which is taking long time to execute. Can someone help me with a better query? May be using EXIST clause (I tried one but still takes time as it scans Fee table)
SELECT MONTH(f.Created_Date), f.Created_By_User_ID
FROM Fees f
JOIN [User] u ON f.Created_By_User_ID= u.User_ID
WHERE f.Created_Date BETWEEN '2016-01-01' AND '2016-12-31'
You will require a full scan of the fee table once in the original query you are using. If you use just the join directly, as you have in the original query, you will require multiple scans of the fee table, many of which will go through redundant rows while the join occurs. Same scenario will occur when you use an inner query as suggested by Mansoor.
An optimization could be to decrease the number of rows on which the joins are happening.
Assuming that the user table contains only one record per user and the Fee table has multiple records per person, we can attempt to find distinct months users made a purchase for by using a CTE.
Then we can make a join on top of this CTE, this will reduce the computation performed by the join and should give a slightly better output time when performing over a large data set.
Try this:
WITH CTE_UserMonthwiseFeeRecords AS
(
SELECT DISTINCT Created_By_User_ID, MONTH(Created_Date) AS FeeMonth
FROM Fee
WHERE Created_Date BETWEEN '2016-01-01' AND '2016-12-31'
)
SELECT User_name, FeeMonth
FROM CTE_UserMonthwiseFeeRecords f
INNER JOIN [User] u ON f.Created_By_User_ID= u.User_ID
Also, you have not mentioned that you require the user names and all, if only id is required for the purpose of finding distinct users making purchases per month, then you can just use the query within the CTE and not even require the JOIN as:
SELECT DISTINCT Created_By_User_ID, MONTH(Created_Date) AS FeeMonth
FROM Fee
WHERE Created_Date BETWEEN '2016-01-01' AND '2016-12-31'
Try below query :
SELECT MONTH(f.Created_Date), f.Created_By_User_ID
FROM Fees f
WHERE EXISTS(SELECT 1 FROM [User] u WHERE f.Created_By_User_ID= u.User_ID
AND DATEDIFF(DAY,f.Created_Date,'2016-01-01') <= 0 AND
DATEDIFF(DAY,f.Created_Date,'2016-12-31') >= 0
You may try this approach to reduce the query run time. however, It does duplicate the huge data and store a instance of table (Temp_Fees), On every DML performed on table Fees/User require truncate and fresh load of table Temp_Fees.
Select * into Temp_Fees from (SELECT MONTH(f.Created_Date) as Created_MONTH, f.Created_By_User_ID
FROM Fees f
WHERE f.Created_Date BETWEEN '2016-01-01' AND '2016-12-31' )
SELECT f.Created_MONTH, f.Created_By_User_ID
FROM Temp_Fees f
JOIN [User] u ON f.Created_By_User_ID= u.User_ID

Optimize subquery join order

I have this query - but it's taking a super long time.
I read on wikipedia that the join order may be a factor:
The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first
I'm trying to get TV shows an actor has been on through their episodes.
My query looks like this:
select distinct e.show_id
from episodes e
where e.id IN
(select c.episode_id
from contributions c
where c.person_id = #{#person.id})
")
The column count for each is:
2,500,000 episodes
600,000 contributions
40,000 shows
20,000 people
Am I on the right track or should I be doing joins? This query sometimes takes over 10 seconds on heroku even though everything has an index.
Try to use JOIN instead of nested select and IN. Something like that:
SELECT distinct e.show_id
FROM episodes e
JOIN contributions c
ON e.id = c.episode_id
WHERE c.person_id = #person.id

AND conditions in many-to-many relation

Say I have three tables, a table of users, a table of around 500 different items, and the corresponding join table. What I would like to do is:
select * from users u join items_users iu on iu.user_id = u.id
where iu.item_id in (1,2,3,4,5)
and u.city_id = 1 limit 10;
Except, instead of an IN condition, I would like to find users that have all the corresponding items. If it helps, assume that the max number of items that will be searched for at a time will be 5. Also, I am using Postgres, and don't mind denormalizing it if would help as it's a read only system and speed is highest priority.
It's another case of relational division. We have assembled quite an arsenal of queries to deal with this class of problems here.
In this case, with 5 or more items, I might try:
SELECT u.*
FROM users AS u
WHERE u.city_id = 1
AND EXISTS (
SELECT *
FROM items_users AS a
JOIN items_users AS b USING (user_id)
JOIN items_users AS c USING (user_id)
...
WHERE a.user_id = u.user_id
AND a.item_id = 1
AND b.item_id = 2
AND c.item_id = 3
...
)
LIMIT 10;
It was among the fastest in my tests and it fits the requirement of multiple criteria on items_users while only returning columns from user.
Read about indexes at the linked answer. these are crucial for performance.
As your tables are read-only I would also CLUSTER both tables, to minimize the number of pages that have to be visited. If nothing else, CLUSTER items_users using a multi-column index on (user_id, item_id).