Full text search across columns - sql

Sorry for the bad post title but I couldn't summarize this better.
It's better to use an example. Say I have this simple table with two text columns (I'm leaving the other columns out).
Id Text_1 Text_2
1 a a b
2 c a b
Now if I want to search for '"a" and not "b"', in my current implementation I'm getting record 1 back. I understand why this is, it's because the search condition is a match on column "Text_1", while for record 2 it's not a match on any column.
However, for the end user this may not be intuitive, as they probably mean to exclude record 1 as well most of the time.
So my question is, if I want to tell SQL Server to do the matching "across all columns" (meaning that if the "NOT" portion is found on ANY column, the record shouldn't match), is it possible?
EDIT: This is what my query would look like for this example:
SELECT Id, TextHits.RANK Rank, Text_1, Text_2 FROM simple_table
JOIN CONTAINSTABLE(simple_table, (Text_1, Text_2), '"a" and not "b"') TextHits
ON TextHits.[KEY] = simple_table.Id
ORDER BY Rank DESC
The actual query is a bit more complicated (more columns, more joins, etc) but this is the general idea :)
Thanks!

The logic is going to be evaluated against each record so if you want an exclusion hit from one record in a row to cause an exclusion on the row you should use a NOT EXISTS and break out the fullText query into separate inclusionary and exclusionary parts...
SELECT Id,
TextHits.RANK Rank,
Text_1,
Text_2
FROM simple_table
JOIN CONTAINSTABLE(simple_table, (Text_1, Text_2), '"a"') TextHits
ON TextHits.[KEY] = simple_table.Id
WHERE NOT EXISTS (SELECT 1
FROM CONTAINSTABLE(simple_table, (Text_1, Text_2), '"b"') exclHits
WHERE TextHits.[KEY] = exclHits.[KEY])
ORDER BY Rank DESC

Related

How do you JOIN tables to a view using a Vertica DB?

Good morning/afternoon! I was hoping someone could help me out with something that probably should be very simple.
Admittedly, I’m not the strongest SQL query designer. That said, I’ve spent a couple hours beating my head against my keyboard trying to get a seemingly simple three way join working.
NOTE: I'm querying a Vertica DB.
Here is my query:
SELECT A.CaseOriginalProductNumber, A.CaseCreatedDate, A.CaseNumber, B.BU2_Key as BusinessUnit, C.product_number_desc as ModelNumber
FROM pps_sfdc.v_Case A
INNER JOIN reference_data.DIM_PRODUCT_LINE_HIERARCHY B
ON B.PL_Key = A.CaseOriginalProductLine
INNER JOIN reference_data.DIM_PRODUCT C
ON C.product_line_code = A.CaseOriginalProductLine
WHERE B.BU2_Key = 'XWT'
LIMIT 20
I have a view (v_Case) that I’m trying to join to two other tables so I can lookup a value from each of them. The above query returns identical data on everything EXCEPT the last column (see below). It's like it's iterating through the last column to pull out the unique entries, sort of like a "GROUP BY" clause. What SHOULD be happening is that I get unique rows with specific "BusinessUnit" and "ModelNumber" for that record.
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 1
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 2
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 3
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 4
I modeled my solution after this post:
How to deal with multiple lookup tables for beginners of SQL?
What am I doing wrong?
Thank you for any help you can provide.
Data issue. General rule in trouble shooting these is the column that is distinct (in this case C.product_number_desc as ModelNumber) for each record is generally where the issue is going to be...and why I pointed you towards dim_product.
If you receive duplicates, this query below will help identify if this table is giving you the issues. Remember key in this statement can be multiple fields...whatever you are joining the table on:
Select key,count(1) from table group by key having count(1)>1
Other options for the future...don't assume it's your code, duplicates like this almost always point towards dirty data (other option is you are causing cross joins because keys are not correct). If you comment out the 'c' table and the column referred to in the select clause, you would have received one row...hence your dupes were coming from the 'c' table here.
Good luck with it

Modelling database for a small soccer league

The database is quite simple. Below there is a part of a schema relevant to this question
ROUND (round_id, round_number)
TEAM (team_id, team_name)
MATCH (match_id, match_date, round_id)
OUTCOME (team_id, match_id, score)
I have a problem with query to retrieve data for all matches played. The simple query below gives of course two rows for every match played.
select *
from round r
inner join match m on m.round_id = r.round_id
inner join outcome o on o.match_id = m.match_id
inner join team t on t.team_id = o.team_id
How should I write a query to have the match data in one row?
Or maybe should I redesign the database - drop the OUTCOME table and modify the MATCH table to look like this:
MATCH (match_id, match_date, team_away, team_home, score_away, score_home)?
You can almost generate the suggested change from the original tables using a self join on outcome table:
select o1.team_id team_id_1,
o2.team_id team_id_2,
o1.score score_1,
o2.score score_2,
o1.match_id match_id
from outcome o1
inner join outcome o2 on o1.match_id = o2.match_id and o1.team_id < o2.team_id
Of course, the information for home and away are not possible to generate, so your suggested alternative approach might be better after all. Also, take note of the condition o1.team_id < o2.team_id, which gets rid of the redundant symmetric match data (actually it gets rid of the same outcome row being joined with itself as well, which can be seen as the more important aspect).
In any case, using this select as part of your join, you can generate one row per match.
you fetch 2 rows for every matches played but team_id and team_name are differents :
- one for team home
- one for team away
so your query is good
Using the match table as you describe captures the logic of a game simply and naturally and additionally shows home and away teams which your initial model does not.
You might want to add the round id as a foreign key to round table and perhaps a flag to indicate a match abandoned situation.
drop outcome. it shouldn't be a separate table, because you have exactly one outcome per match.
you may consider how to handle matches that are cancelled - perhaps scores are null?

Filtering Database Results to Top n Records for Each Value in a Lookup Column

Let's say I have two tables in my database.
TABLE:Categories
ID|CategoryName
01|CategoryA
02|CategoryB
03|CategoryC
and a table that references the Categories and also has a column storing some random number.
TABLE:CategoriesAndNumbers
CategoryType|Number
CategoryA|24
CategoryA|22
CategoryC|105
.....(20,000 records)
CategoryB|3
Now, how do I filter out this data? So, I want to know what the 3 smallest numbers are out of each category and delete the rest. The end result would be like this:
TABLE:CategoriesAndNumbers
CategoryType|Number
CategoryA|2
CategoryA|5
CategoryA|18
CategoryB|3
CategoryB|500
CategoryB|1601
CategoryC|1
CategoryC|4
CategoryC|62
Right now, I can get the smallest numbers between all the categories, but I would like each category to be compared individually.
EDIT: I'm using Access and here's my code so far
SELECT TOP 10 cdt1.sourceCounty, cdt1.destCounty, cdt1.distMiles
FROM countyDistanceTable as cdt1, countyTable
WHERE cdt1.sourceCounty = countyTable.countyID
ORDER BY cdt1.sourceCounty, cdt1.distMiles, cdt1.destCounty
EDIT2: Thanks to Remou, here would be the working query that solved my problem. Thank you!
DELETE
FROM CategoriesAndNumbers a
WHERE a.Number NOT IN (
SELECT Top 3 [Number]
FROM CategoriesAndNumbers b
WHERE b.CategoryType=a.CategoryType
ORDER BY [Number])
You could use something like:
SELECT a.CategoryType, a.Number
FROM CategoriesAndNumbers a
WHERE a.Number IN (
SELECT Top 3 [Number]
FROM CategoriesAndNumbers b
WHERE b.CategoryType=a.CategoryType
ORDER BY [Number])
ORDER BY a.CategoryType
The difficulty with this is that Jet/ACE Top selects duplicate values where they exist, so you will not necessarily get three values, but more, if there are ties. The problem can often be solved with a key field, if one exists :
WHERE a.Number IN (
SELECT Top 3 [Number]
FROM CategoriesAndNumbers b
WHERE b.CategoryType=a.CategoryType
ORDER BY [Number], [KeyField])
However, I do not think it will help in this instance, because the outer table will include ties.
Order it by number and take 3, find out what the biggest number is and then remove rows where Number is greater than the Number.
I imagine it would need to be two seperate queries as your business tier would hold the value for the biggest number out of the 3 results and dynamically build the query to delete the rest.

Explain this short SQL query to me please

"SELECT * from posts A, categories B where A.active='1'
AND A.category=B.CATID order by A.time_added desc
limit $pagingstart, $config[items_per_page]";
I think it says selects the rows from the 'posts' table such that the active entry in each row is equal to 1 but I don't understand the rest. Please explain. Thank you.
It selects the columns from Posts (referred to with the alias "A"), and the associated for from Categories (referred to as "B") for each post, where:
Posts.Active = 1
The post's category exist in the "Categories" table (if a post doesn't have a matching category in this table, the row won't be returned)
Orders the results by A.Time_added (in decending order, newest to oldest)
Returns just "$config[items_per_page]" rows, starting with "$pagingstart"
I'm not sure what brand of SQL this is, as I don't recognize the limit statement or the $variables, but that's the gist.
You'll get rows
from A and B that where category and CATID match ("intersection" bit of a Venn Diagram)
The rows for A are filtered to those where Active = 1
sorted by time_added. latest first
limit says y rows startig at row x. x and y are determined by the sort
posts A, categories B is a such called "implicit JOIN". It returns all possible combinations of records from A and B which are later filtered by the WHERE conditions.
Explicit join syntax is much more readable:
SELECT *
FROM posts A
JOIN categories B
ON B.CATID = A.category
WHERE A.active='1'
ORDER BY
A.time_added DESC
LIMIT $pagingstart, $config[items_per_page]
This means: "for each record from A, take all records from B whose catid is the same as A's category".
ORDER BY A.time_added DESC makes your posts to return from latest to earliest.
LIMIT 100, 10 makes the query to return only posts from 100th to 110th.
It looks like this is trying to select all active posts, order them with the newest at the top, and limit the number of records to fit on a page. The semantics of A.active='1' probably mean that the post is active, but I'm guessing.
It looks like MySQL with PHP.
This selects entries from posts and categories, joining them together where posts.category=categories.CATID. It filters out all rows where posts.active!=1, and then orders by descending posts.time_added, returning at most $config[items_per_page] items starting from $pagingstart.
It selects all the active posts (and their category), newest first. However, it has a paging mechanism, so it shows only $config[items_per_page] posts starting at number $pagingstart.
Select the rows from the posts table and the categories table, joined into a single table by the category ID (using what I call a lazy join, but that may just be my opinion and I'm not really a database guy), sorted in descending order by the time added, displaying only $items_per_page records starting at $pagingstart.
It select all columns from table posts and categories where posts.active is equal 1 and where posts.category is joined to the categories.catid and this is ordered by posts.time_added a limit start and end is set by the two variables $pagingstart, $config[items_per_page]
It's saying:
1) Select everything from both Posts & Categories where Posts.Active = 1 and Posts.Category = Category.CATID.
2) The Order by statement then specifies that they should be presented (from top to bottom) with the newest Post.Time_Added first.
3) Finally, the limit clause says (I think, I don't use limit very often): Only grab $spagingstart (a variable which has been set at some point) number of items, and only display $config[items_per_page] at a time.

How do I remove "duplicate" rows from a view?

I have a view which was working fine when I was joining my main table:
LEFT OUTER JOIN OFFICE ON CLIENT.CASE_OFFICE = OFFICE.TABLE_CODE.
However I needed to add the following join:
LEFT OUTER JOIN OFFICE_MIS ON CLIENT.REFERRAL_OFFICE = OFFICE_MIS.TABLE_CODE
Although I added DISTINCT, I still get a "duplicate" row. I say "duplicate" because the second row has a different value.
However, if I change the LEFT OUTER to an INNER JOIN, I lose all the rows for the clients who have these "duplicate" rows.
What am I doing wrong? How can I remove these "duplicate" rows from my view?
Note:
This question is not applicable in this instance:
How can I remove duplicate rows?
DISTINCT won't help you if the rows have any columns that are different. Obviously, one of the tables you are joining to has multiple rows for a single row in another table. To get one row back, you have to eliminate the other multiple rows in the table you are joining to.
The easiest way to do this is to enhance your where clause or JOIN restriction to only join to the single record you would like. Usually this requires determining a rule which will always select the 'correct' entry from the other table.
Let us assume you have a simple problem such as this:
Person: Jane
Pets: Cat, Dog
If you create a simple join here, you would receive two records for Jane:
Jane|Cat
Jane|Dog
This is completely correct if the point of your view is to list all of the combinations of people and pets. However, if your view was instead supposed to list people with pets, or list people and display one of their pets, you hit the problem you have now. For this, you need a rule.
SELECT Person.Name, Pets.Name
FROM Person
LEFT JOIN Pets pets1 ON pets1.PersonID = Person.ID
WHERE 0 = (SELECT COUNT(pets2.ID)
FROM Pets pets2
WHERE pets2.PersonID = pets1.PersonID
AND pets2.ID < pets1.ID);
What this does is apply a rule to restrict the Pets record in the join to to the Pet with the lowest ID (first in the Pets table). The WHERE clause essentially says "where there are no pets belonging to the same person with a lower ID value).
This would yield a one record result:
Jane|Cat
The rule you'll need to apply to your view will depend on the data in the columns you have, and which of the 'multiple' records should be displayed in the column. However, that will wind up hiding some data, which may not be what you want. For example, the above rule hides the fact that Jane has a Dog. It makes it appear as if Jane only has a Cat, when this is not correct.
You may need to rethink the contents of your view, and what you are trying to accomplish with your view, if you are starting to filter out valid data.
So you added a left outer join that is matching two rows? OFFICE_MIS.TABLE_CODE is not unique in that table I presume? you need to restrict that join to only grab one row. It depends on which row you are looking for, but you can do something like this...
LEFT OUTER JOIN OFFICE_MIS ON
OFFICE_MIS.ID = /* whatever the primary key is? */
(select top 1 om2.ID
from OFFICE_MIS om2
where CLIENT.REFERRAL_OFFICE = om2.TABLE_CODE
order by om2.ID /* change the order to fit your needs */)
If the secondd row has one different value than it is not really duplicate and should be included.
Instead of using DISTINCT, you could use a GROUP BY.
Group by all the fields that you want to be returned as unique values.
Use MIN/MAX/AVG or any other function to give you one result for fields that could return multiple values.
Example:
SELECT Office.Field1, Client.Field1, MIN(Office.Field1), MIN(Client.Field2)
FROM YourQuery
GROUP BY Office.Field1, Client.Field1
You could try using Distinct Top 1 but as Hunter pointed out, if there is if even one column is different then it should either be included or if you don't care about or need the column you should probably remove it. Any other suggestions would probably require more specific info.
EDIT: When using Distinct Top 1 you need to have an appropriate group by statement. You would really be using the Top 1 part. The Distinct is in there because if there is a tie for Top 1 you'll get an error without having some way to avoid a tie. The two most common ways I've seen are adding Distinct to Top 1 or you could add a column to the query that is unique so that sql would have a way to choose which record to pick in what would otherwise be a tie.