AND conditions in many-to-many relation - sql

Say I have three tables, a table of users, a table of around 500 different items, and the corresponding join table. What I would like to do is:
select * from users u join items_users iu on iu.user_id = u.id
where iu.item_id in (1,2,3,4,5)
and u.city_id = 1 limit 10;
Except, instead of an IN condition, I would like to find users that have all the corresponding items. If it helps, assume that the max number of items that will be searched for at a time will be 5. Also, I am using Postgres, and don't mind denormalizing it if would help as it's a read only system and speed is highest priority.

It's another case of relational division. We have assembled quite an arsenal of queries to deal with this class of problems here.
In this case, with 5 or more items, I might try:
SELECT u.*
FROM users AS u
WHERE u.city_id = 1
AND EXISTS (
SELECT *
FROM items_users AS a
JOIN items_users AS b USING (user_id)
JOIN items_users AS c USING (user_id)
...
WHERE a.user_id = u.user_id
AND a.item_id = 1
AND b.item_id = 2
AND c.item_id = 3
...
)
LIMIT 10;
It was among the fastest in my tests and it fits the requirement of multiple criteria on items_users while only returning columns from user.
Read about indexes at the linked answer. these are crucial for performance.
As your tables are read-only I would also CLUSTER both tables, to minimize the number of pages that have to be visited. If nothing else, CLUSTER items_users using a multi-column index on (user_id, item_id).

Related

SQL SELECT * FROM 2 tables

I am building a small database app for friends where table 1 is contacts and table 2 is users. I can find email on both (One as the loggued in user and the other as the owner of the contact)
SELECT *
FROM contacts
WHERE contacts.username = users.email
I try to show all contacts fields where username is equal to already loggued in users (email)
Thanks you very much!
It sounds like you're trying to JOIN two tables together. Ideally, you don't want to use the email as the primary key on a table (the smaller the data, the faster your JOIN will be); a better option would be to add an auto-incrementing Id (integer) to both the Contacts and Users tables, set as the primary key (unique identifier). Joining on integers is much faster, as integers are 4 bytes per row, vs string which (in MySQL) is 1 per character length (latin1 encoding) + 1 byte.
Anyway, back to the original question. I believe the query you're looking for (MySQL syntax) is:
SELECT c.Id, c.Col1, u.Col2, ...
FROM contacts AS c
INNER JOIN users AS u ON u.email = c.username
Additionally, I would avoid the use of *, as it slows down the query a bit. Instead, try to specify the exact columns you need.
Try the following. Also, I would suggest you learn about joins in SQL.
SELECT *
FROM contacts
INNER JOIN
users on contacts.username = users.email
Use Inner Join:
SELECT *
FROM contacts as c
INNER JOIN
users as u on u.email = c.username

Where can I find usage statistics in Redshift?

Before all thank you for your help!
I want to find out which tables in the database are most heavily used, i.e. the amount of users that query the table, the amount of times it was queried, the resources that where consumed by users per table, the total time the tables where queried, and any other useful data.
For now I would limit the analysis to 9 specific tables.
I'd tried using stl_scan and pg_user using the next two querys:
SELECT
s.perm_table_name AS table_name,
count(*) AS qty_query,
count(DISTINCT s.userid) AS qty_users
FROM stl_scan s
JOIN pg_user b
ON s.userid = b.usesysid
JOIN temp_mone_tables tmt
ON tmt.table_id = s.tbl AND tmt.table = s.perm_table_name
WHERE s.userid > 1
GROUP BY 1
ORDER BY 1;
SELECT
b.usename AS user_name,
count(*) AS qty_scans,
count(DISTINCT s.tbl) AS qty_tables,
count(DISTINCT trunc(starttime)) AS qty_days
FROM stl_scan s
JOIN pg_user b
ON s.userid = b.usesysid
JOIN temp_mone_tables tmt
ON tmt.table_id = s.tbl AND tmt.table = s.perm_table_name
WHERE s.userid > 1
GROUP BY 1
ORDER BY 1;
The temp_mone_tables is a temporal table that contains the id and name of the tables I'm interested.
With this queries I'm able to get some information but I need more details. Surprisingly there's not much data online about this kind of statistics.
Again thank you all beforehand!
Nice work! You are on the right track using the stl_scan table. I'm not clear what further details you're looking for.
For detailed metrics on resource usage you may want to use the SVL_QUERY_METRICS_SUMMARY view. Note that this data is summarized by query not table because a query is the primary way resources are utilized.
Generally, have a look at the admin queries (and views) in our Redshift Utils library on GitHub, particularly v_get_tbl_scan_frequency.sql
Thanks to Joe Harris' answer I was able to add a lot of information to my previous query. With svl_query_metrics_summary joined to stl_scan you get important data about resources consumption, this information can be extended joining them to the vast number of views listed in Joe's answer.
For me the solution begins with the next query:
SELECT *
FROM stl_scan ss
JOIN pg_user pu
ON ss.userid = pu.usesysid
JOIN svl_query_metrics_summary sqms
ON ss.query = sqms.query
JOIN temp_mone_tables tmt
ON tmt.table_id = ss.tbl AND tmt.table = ss.perm_table_name
The query gives you a lot of data that can be summarized in multiple ways as wanted.
Remember that temp_mone_tables is a temp table that contains the tableid and name of the tables I'm interested.

Output identical fields names of two LEFT JOIN tables Sql

I have two tables, with about 20 columns each
users:
id_user user ..... status token
----------------------------------
2 A 0 XdAQ
posts:
id_user post ..... status token
-------------------------------------
3 hi 1 sDyTMl
Query:
SELECT u.*,p.*
FROM posts as p
LEFT JOIN users as u ON u.id_user = p.id_user
WHERE p.id_post = 3
LIMIT 1
So in Php, it could be retrieved any value
....
$status=$a['status'];
$token=$a['token'];
I want to return all the fields of each table to make the post content, the problem is that there is conflict among those identical column names in each table. there are more than 20 columns in each in my real tables, so writing the column names with aliases I think is not the way to go. Is there a way to alias only those identical columns in conflict?
You really should list the specific columns that you want. This is the safest way to retrieve values from the table.
If the only column that is in common is the one used for the join, you can use the USING clause:
SELECT *
FROM posts p LEFT JOIN
users as u
USING (id_user)
WHERE p.id_post = 3
LIMIT 1;
The USING clause is ANSI standard, but not all databases support it. When you use it, only one version of id_post is in the columns returned by the SELECT *. In a LEFT JOIN, it is the version with a value.
If you have other columns with the same name, you need to use column aliases. One short-cut is to take all columns from one table and name the columns in the other:
SELECT u.*, p.col1 as p_col1, . . .
FROM posts p LEFT JOIN
users as u
USING (id_user)
WHERE p.id_post = 3
LIMIT 1;

Configuring Merge Join in PostgreSQL

I'm using PostgreSQL with big tables, and query takes too much time.
I have two tables. The first one has about 6 million rows (data table), and the second one has about 30000 rows (users table).
Each user has about 200 rows in data table.
Later, data and users tables may increase up to 30 times.
My query is:
SELECT d.name, count(*) c
FROM users AS u JOIN data AS d on d.id = u.id
WHERE u.language = 'eng' GROUP BY d.name ORDER BY c DESC LIMIT 10;
90% of users has eng language, and query time is 7 seconds. Each column is indexed!
I read about Merge Join and it should be really fast, so I sorted tables by id and forced Merge Join, but time increased up to 20 seconds.
I suppose, the tables configuration is wrong, but I don't know how to fix it.
Should I make other improvements?
For this query:
SELECT d.name, count(*) c
FROM users u JOIN
data d
on d.id = u.id
WHERE u.language = 'eng'
GROUP BY d.name
ORDER BY c DESC
LIMIT 10;
First, try indexes: users(language, id), data(id, name). See if this speeds up the query.
Second, what is d.name? Can a user have more than one of them? Is there a table of valid values? Depending on the answers to these questions, there may be other ways to structure the query.

What are some alternatives to a NOT IN query?

Let's say we have a database that records all the Movies a User has not rated yet. Each rating is recorded in a MovieRating table.
When we are looking for movies user #1234 hasn't seen yet:
SELECT *
FROM Movies
WHERE id NOT IN
(SELECT DISTINCT movie_id FROM MovieRating WHERE user_id = 1234);
Querying NOT IN can be very expensive as the size of MovieRating grows. Assume MovieRatings can have 100,000+ rows.
My question is what are some more efficient alternatives to the NOT IN query? I've heard of the LEFT OUTER JOIN and NOT EXIST queries, but are there anything else? Is there any way I can design this database differently?
A correlated sub-query using WHERE NOT EXISTS() is potential your most efficient if you have to do this, but you should test performance against your data.
You may also want to consider limiting your results both in terms of the select list (don't use *) and only getting TOP n rows. That is, you may not need 100k+ movies if the user hasn't seen them. You may want to page the results.
SELECT *
FROM Movies m
WHERE NOT EXISTS (SELECT 1
FROM MovieRating r
WHERE user_id = 1234
AND r.movie_id= m.movie_id)
This is a mock query, because I don't have a db to test this, but something along the lines of the following should work.
select m.* from Movies m
left join MovieRating mr on mr.user_id = 1234
where mr.id is null
That should join the movies table to the movie rating table based on a user id. The where clause is then going to find null entries, which would be movies a user hasn't rated.
You can try this :
SELECT M.*
FROM Movies as M
LEFT OUTER JOIN
MovieRating as MR on M.id = MR.movie_id
and MR.user_id = 1234
WHERE M.id IS NULL