Query unknown number of keywords in Postgres - sql

I'm currently using postgres in node to query all users who have a certain tag associated with their account like so (note: I'm using node-postgres):
query = 'SELECT tags.*, pl.email FROM admin.tags tags
LEFT JOIN gameday.player_settings pl
ON tags.player_id = pl.id
WHERE tags.tag = $1'
client.query(
query,
[tagName],
function(err, results) {
...
[tagName] is then passed into the WHERE clause.
What I'm aiming to do is instead query by an unknown number of tags and return all users who have all of those tags associated with their account. So instead of [tagName] I'd like to pass in an array of unknown length, [tagNames], but I'm not sure how to accomplish this.

You need to turn the question backwards. Instead of:
Which users have all of these tags, you need to ask which users do not have one or more of these tags absent. It's a double negation.
You also need a way to pass the set of tags. The best way to do this, if the client language binding supports it, is as an array-valued query parameter. If the client binding doesn't support array-valued parameters you'll need dynamic SQL.
One formulation might be (untested, since you didn't provide sample schema and data):
SELECT pl.email
FROM gameday.player_settings pl
WHERE NOT EXISTS (
SELECT 1
FROM unnest(?) AS wanted_tags(tag)
LEFT JOIN admin.tags tags
ON tags.tag = wanted_tags.tag
WHERE tags.player_id = pl.id
AND wanted_tags.tag IS NULL
);
Doing a left join and filtering for IS NULL is called a left anti-join. It keeps the rows where the left-join condition does not match. So in this case, we retain a tag from our wanted_tags array only if there is no matching tag associated with this player. If any tags are left, the WHERE NOT EXISTS returns false, so the player is excluded.
Double-think, isn't it? It's easy to make mistakes with this so test.
Here ? should be your programming language PostgreSQL database binding's query parameter placeholder. I don't know what node.js's is. This will only work if you can pass an array as a query parameter in node. If not, you'll have to use dynamic SQL to generate an ARRAY['x','y','z'] expression or a (VALUES ('x'), ('y'), ('z')) subquery.
P.S. Please provide sample schema and data with questions when possible. http://sqlfiddle.com/ is handy.

Related

SQL Server distinct not being used properly

I'm trying to select just one account using SQL Server but am getting the following error:
ERROR: The text data type cannot be selected as DISTINCT because
it is not comparable. Error Code: 421
with the following statement:
select DISTINCT ad.*,
acc.companyname,
acc.accountnumber
from address ad
join AddressLink al on al.AddressID = ad.id
join account acc on acc.ID = al.ParentID
where acc.accountnumber like '11227'
What have I done wrong?
Edit:
New query:
select address.ID,
address.StreetAddress1,
address.StreetAddress2,
address.City,
Address.State,
Address.PostalCode,
Address.ClassTypeID,
account.companyname,
account.accountnumber,
addresslink.ID as addressLinkID,
addresslink.addresstypeid
from address
join AddressLink on address.id = addresslink.AddressID
join account on addresslink.ParentID = account.ID
where account.CompanyName like 'company name'
All the company names that I've had to blur are identical.
Try:
select ad.*,
l.companyname,
l.accountnumber
from address ad
join (select DISTINCT al.AddressID,
acc.companyname,
acc.accountnumber
from account acc
join AddressLink al on acc.ID = al.ParentID
where acc.accountnumber like '11227') l
on l.AddressID = ad.id
"Distinct", in the context you have is trying to do distinct on ALL columns. That said, there are some data types that are NOT converable, such as TEXT. So, if your table has some of these non "Distinctable" column types exists, that is what is crashing your query.
However, to get around this, if you do something like
CONVERT( char(60), YourTextColumn ) as YourTextColumn,
It should get that for you... at least its now thinking the final column type is "char"acter and CAN compare it.
You should check the data types of the columns in the address table. My guess is that one or more of them has the data type text, ntext or image.
One of the restrictions of using text, ntext or image data types is that columns defined of these data types cannot be used as part of a SELECT statement that includes the DISTINCT clause.
For what it's worth, the MSDN article for ntext, text, and image (Transact-SQL) recommends avoiding these data types and use nvarchar(max), varchar(max), and varbinary(max) instead. You may want to consider changing how that table is defined.
The accepted answer from Mark B shows a subquery (good idea to limit the domain of the DISTINCT) on AddressLink.AddressId, Account.CompanyName, and Account.AccountNumber.
Let me ask this: Does AddressLink allow more than one record to have the same value in the ParentId and AddressId fields?
If not, and assuming that Mark B's answer works, then just remove the DISTINCT because you're never going to get any duplicates inside of that subquery.
Leaving the DISTINCT in causes a performance hit because the DB has to create a temporary table that is either indexed with a btree or a hash and it has to insert every value returned by the subquery into that table to check if it invalidates the uniqueness constraint on those three fields. Note that the "optimizer" doesn't know that there won't be any dupes... if you tell it to check for DISTINCT, it will check it... With a btree index this is going to cause O(n log n) work on the number of rows returned; with a hash it would cause O(n) work but who knows how big the constant factor is in relation to the other work you're doing (it's probably larger than everything else you're doing meaning this could make this run half as fast as without the DISTINCT).
So my answer is Mark B's answer without the DISTINCT in the subquery. Let me know if AddressLink does allow repeats (can't imagine why it would).

In an EXISTS can my JOIN ON use a value from the original select

I have an order system. Users with can be attached to different orders as a type of different user. They can download documents associated with an order. Documents are only given to certain types of users on the order. I'm having trouble writing the query to check a user's permission to view a document and select the info about the document.
I have the following tables and (applicable) fields:
Docs: DocNo, FileNo
DocAccess: DocNo, UserTypeWithAccess
FileUsers: FileNo, UserType, UserNo
I have the following query:
SELECT Docs.*
FROM Docs
WHERE DocNo = 1000
AND EXISTS (
SELECT * FROM DocAccess
LEFT JOIN FileUsers
ON FileUsers.UserType = DocAccess.UserTypeWithAccess
AND FileUsers.FileNo = Docs.FileNo /* Errors here */
WHERE DocAccess.UserNo = 2000 )
The trouble is that in the Exists Select, it does not recognize Docs (at Docs.FileNo) as a valid table. If I move the second on argument to the where clause it works, but I would rather limit the initial join rather than filter them out after the fact.
I can get around this a couple ways, but this seems like it would be best. Anything I'm missing here? Or is it simply not allowed?
I think this is a limitation of your database engine. In most databases, docs would be in scope for the entire subquery -- including both the where and in clauses.
However, you do not need to worry about where you put the particular clause. SQL is a descriptive language, not a procedural language. The purpose of SQL is to describe the output. The SQL engine, parser, and compiler should be choosing the most optimal execution path. Not always true. But, move the condition to the where clause and don't worry about it.
I am not clear why do you need to join with FileUsers at all in your subquery?
What is the purpose and idea of the query (in plain English)?
In any case, if you do need to join with FileUsers then I suggest to use the inner join and move second filter to the WHERE condition. I don't think you can use it in JOIN condition in subquery - at least I've never seen it used this way before. I believe you can only correlate through WHERE clause.
You have to use aliases to get this working:
SELECT
doc.*
FROM
Docs doc
WHERE
doc.DocNo = 1000
AND EXISTS (
SELECT
*
FROM
DocAccess acc
LEFT OUTER JOIN
FileUsers usr
ON
usr.UserType = acc.UserTypeWithAccess
AND usr.FileNo = doc.FileNo
WHERE
acc.UserNo = 2000
)
This also makes it more clear which table each field belongs to (think about using the same table twice or more in the same query with different aliases).
If you would only like to limit the output to one row you can use TOP 1:
SELECT TOP 1
doc.*
FROM
Docs doc
INNER JOIN
FileUsers usr
ON
usr.FileNo = doc.FileNo
INNER JOIN
DocAccess acc
ON
acc.UserTypeWithAccess = usr.UserType
WHERE
doc.DocNo = 1000
AND acc.UserNo = 2000
Of course the second query works a bit different than the first one (both JOINS are INNER). Depeding on your data model you might even leave the TOP 1 out of that query.

"ambiguous column name" in SQLite INNER JOIN

I have two tables in a SQLite DB, INVITEM and SHOPITEM. Their shared attribute is ItemId and I want to perform an INNER JOIN. Here's the query:
SELECT INVITEM.CharId AS CharId,
INVITEM.ItemId AS ItemId
FROM (INVITEM as INVITEM
INNER JOIN SHOPITEM AS SHOPITEM
ON SHOPITEM.ItemId = INVITEM.ItemId)
WHERE ItemId = 3;
SQLite doesn't like it :
SQL error: ambiguous column name: ItemId
The error goes away if I write WHERE INVITEM.ItemId = 3, but since the WHERE condition is more or less user-specified, I rather make it work without having to specify the table. NATURAL JOIN seems to solve the issue, but I'm not sure if the solution is general enough (ie I could use in this case, but I'm not sure if I can use in every case)
Any alternate SQL syntax that would fix the problem?
I would write this query this way:
SELECT i.CharId AS CharId, i.ItemId AS ItemId
FROM INVITEM as i INNER JOIN SHOPITEM AS s USING (ItemId)
WHERE i.ItemId = 3;
I'm using the USING (ItemId) syntax which is just a matter of taste. It's equivalent to ON (i.ItemID = s.ItemID).
But I resolved the ambiguity by qualifying i.ItemID in the WHERE clause. You would think this is unnecessary, since i.ItemID = s.ItemID. They're both equal by commutativity, so there's no semantic ambiguity. But apparently SQLite isn't smart enough to know that.
I don't like to use NATURAL JOIN. It's equivalent to an equi-join of every column that exists in both tables with the same name. I don't like to use this because I don't want it to compare columns that I don't want it to, simply because they have the same name.
I would steer clear of allowing the user to write SQL clauses directly. This is the source of SQL Injection vulnerabilities.
If you need the query to be flexible, try parsing the user's input and adding the appropriate where clause.
Here is some C# code to show the general idea
// from user input
string user_column = "ItemID";
string user_value = "3";
string sql = "SELECT INVITEM.CharId AS CharId, INVITEM.ItemId AS ItemId FROM (INVITEM as INVITEM INNER JOIN SHOPITEM AS SHOPITEM ON SHOPITEM.ItemId = INVITEM.ItemId) ";
if (user_column == "ItemID")
{
// using Int32.Parse here to prevent rubbish like "0 OR 1=1; --" being entered.
sql += string.Format("WHERE INVITEM.ItemID={0}",Int32.Parse(user_value));
}
Obviously if you're dealing with more than one clause, you'd have to substitute AND for WHERE in subsequent clauses.
Just change your column alias to something similar, but unique (such as ITEM_ID).

SELECT with ORs including table joins

I've got a database with three tables: Books (with book details, PK is CopyID), Keywords (list of keywords, PK is ID) and KeywordsLink which is the many-many link table between Books and Keywords with the fields ID, BookID and KeywordID.
I'm trying to make an advanced search form in my app where you can search on various criteria. At the moment I have it working with Title, Author and Publisher (all from the Book table). It produces SQL like:
SELECT * FROM Books WHERE Title Like '%Software%' OR Author LIKE '%Spolsky%';
I want to extend this search to also search using tags - basically to add another OR clause to search the tags. I've tried to do this by doing the following
SELECT *
FROM Books, Keywords, Keywordslink
WHERE Title LIKE '%Joel%'
OR (Name LIKE '%good%' AND BookID=Books.CopyID AND KeywordID=Keywords.ID)
I thought using the brackets might separate the 2nd part into its own kinda clause, so the join was only evaluated in that part - but it doesn't seem to be so. All it gives me is a long list of multiple copies of the one book that satisfies the Title LIKE '%Joel%' bit.
Is there a way of doing this using pure SQL, or would I have to use two SQL statements and combine them in my app (removing duplicates in the process).
I'm using MySQL at the moment if that matters, but the app uses ODBC and I'm hoping to make it DB agnostic (might even use SQLite eventually or have it so the user can choose what DB to use).
You need to join the 3 tables together, which gives you a tablular resultset. You can then check any columns you like, and make sure you get distinct results (i.e. no duplicates).
Like this:
select distinct b.*
from books b
left join keywordslink kl on kl.bookid = b.bookid
left join keywords k on kl.keywordid = k.keywordid
where b.title like '%assd%'
or k.keyword like '%asdsad%'
You should also try to avoid starting your LIKE values with a percent sign (%), as this means SQL Server can't use an index on that column and has to perform a full (and slow) table scan. This starts to make your query into a "starts with" query.
Maybe consider the full-text search options in SQL Server, also.
What you've done here is made a cartesian result set by having the tables joined with the commas but not having any join criteria. Switch your statements to use outer join statements and that should allow you to reference the keywords. I don't know your schema, but maybe something like this would work:
SELECT
*
FROM
Books
LEFT OUTER JOIN KeywordsLink ON KeywordsLink.BookID = Books.CopyID
LEFT OUTER JOIN Keywords ON Keywords.ID = KeywordsLink.KeywordID
WHERE Books.Title LIKE '%JOEL%'
OR Keywords.Name LIKE '%GOOD%'
Use UNION.
(SELECT Books.* FROM <first kind of search>)
UNION
(SELECT Books.* FROM <second kind of search>)
The point is that you could write two (or more) simple and efficient queries instead of one complicated query that tries to do everything at once.
If number of resulting rows is low, then UNION will have very little overhead (and you can use faster UNION ALL if you don't have duplicates or don't care about them).
SELECT * FROM books WHERE title LIKE'%Joel%' OR bookid IN
(SELECT bookid FROM keywordslink WHERE keywordid IN
(SELECT id FROM keywords WHERE name LIKE '%good%'))
Beware that older versions of MySQL didn't like subselects. I think they've fixed that.
You must also limit the product of the join by specifying something like
Books.FK1 = Keywords.FK1 and
Books.FK2 = Keywordslink.FK2 and
Keywords.FK3 = Keywordslink.FK3
But i don't know your exact data model so your solution may be slightly different.
I'm not aware of any way to accomplish a "conditional join" in SQL. I think you'll be best served with executing the two statements separately and combining them in the application. This approach is also more likely to stay DB-agnostic.
It looks like Neil Barnwell has covered the answer that I would have given, but I'll add one thing...
Books can have more than one author. If your data model is really designed as your query implies you might want to consider changing it to accommodate that fact.

Object Relational Mapping Issues: Suggestions needed

I've been trying to come up with a good design pattern for mapping data contained in relational databases to the business objects I've created but I keep hitting a wall.
Consider the following tables:
TYPE: typeid, description
USER: userid, username, usertypeid->TYPE.typeid, imageid->IMAGE.imageid
IMAGE: imageid, location, imagetypeid->TYPE.typeid
I would like to gather all the information regarding a specific user. Creating a query for this isn't too difficult.
SELECT u.*, ut.*, i.*, it.* FROM user u
INNER JOIN type ut ON ut.typeid = u.usertypeid
INNER JOIN image i ON i.imageid = u.imageid
INNER JOIN type it ON it.typeid = i.imagetypeid
WHERE u.userid = #userid
The problem is that the field names collide and then I'm forced to alias every single field which gets out of hand very quickly.
Does anyone have a decent design pattern for this kind of thing?
I've thought about retrieving multiple results from a single stored procedure and then using a dataset to iterate through each one but I'm worried that some performance issues might bite me in the butt later. For example instead of the above query something like:
SELECT u.*, t.* FROM user u
INNER JOIN type t ON t.typeid = u.usertypeid
WHERE u.userid = #userid;
SELECT i.*, t.* FROM image i
INNER JOIN type t ON t.typeid = i.imagetypeid
INNER JOIN user u ON u.imageid = i.imageid
WHERE u.userid = #userid;
Does that seem like a decent solution? Can anyone foresee any issues with this approach?
Never use the SQL * wildcard in production code. Always spell out all the columns you want to retrieve.
Then aliasing some of them doesn't seem like such a huge amount of extra work.
Re your comment asking for background and reasoning:
Sometimes you don't really need every column from all tables, and fetching them can be needlessly costly (especially for large strings and blobs). There is no SQL syntax for "all columns except the following exceptions."
You can't alias columns that you fetch using the wildcard. Once you need to alias any of the columns, you need to expand the wildcard to list all the columns explicitly.
If the table structure changes, e.g. columns are renamed, reordered, dropped, or added, then the wildcard fetches them all, by position as defined in the tables. This may seem like a convenience, but not when your application depends on columns being in the result set by a given name or in a given position. You can get mysterious bugs where your application displays columns in the wrong order (if referencing columns by position), or shows them as blank (if referencing columns by name).
However, if the SQL query names columns explicitly, you can employ the "Fail Early" principle. This helps debugging, because it leads you directly to the SQL query that needs to be edited to account for the schema change.