MSSQL Full Text search - Search only part of table rows - sql

Is there a way to limit the search to specific rows in a table.
For Example:
On a big table I want to search only on rows with a specific Company ID:
select *
from Actions
where Company_ID=1253
and CONTAINS(ItemDesc, 'ABC')
SELECT
AC.*,
col1.RANK
FROM Actions AC
INNER JOIN CONTAINSTABLE(Actions, ItemDesc, 'ABC',50) as col1
on col1.[KEY] = AC.ActionId
where Company_ID=1253
I tried both examples and I think the search runs first on all the table rows,
and then it filters by Company_ID
I'm looking for a way to limit the row number before search
Thanks.

How does something like this behave?
SELECT *
FROM
(
SELECT *
FROM Actions
WHERE Company_ID = 1253
) AS CompanyFilteredActions
WHERE CONTAINS(ItemDesc, 'ABC')
You'd need to look at query plans and performance, but I think this will at least do what you ask.

Related

Excluding Records Based on Another Column's Value

I'm working in Redshift and have two columns from an Adobe Data feed:
post_evar22 and post_page_url.
Each post_evar22 has multiple post_page_url values as they are all the pages that the ID visited.
(It's basically a visitor ID and all the pages they visited)
I want to write a query where I can list distinct post_evar22 values that have never been associated with a post_page_url that contains '%thank%' or '%confirm%'.
In the dataset below, ID1 would be completely omitted from the query results bceause it was associated with a thank-you page and a confirmation page.
This is a case for NOT EXISTS:
select distinct post_evar22
from table t1
where not exists (
select 1
from table t2
where t2.post_evar22 = t1.post_evar22
and (t2.post_page_url like '%thank%' or t2.post_page_url like '%confirm%')
)
Or MINUS if your dbms supports it:
select post_evar22 from table
minus
select post_evar22 from table where (post_page_url like '%thank%' or post_page_url like '%confirm%')
Seems fairly straight forward. Am I missing something?
SELECT DISTINCT post_evar22
FROM table
WHERE post_page_url NOT LIKE '%thank%'
AND post_page_url NOT LIKE'%confirm%

SQL - Find duplicate fields and count how many fields are matched

I have a a large customer database where customers have been added multiple times in some circumstances which is causing problems. I am able to use a query to identify the records which are an exact match, although some records have slight variations such as different addresses or given names.
I want to query across 10 fields, some records will match all 10 which is clearly a duplicate although other fields may only match 5 fields with another record and require further investigation. Therefore i want to create a results set which has field with a count how many fields have been matched. Basically to create a rating of the likely hood the result is an actual match. All 10 would be a clear dup but 5 would only be a possible duplicate.
Some will only match on POSTCODE and FIRSTNAME which is generally can be discounted.
Something like this helps but as it only returns records which explicitly match on all 3 records its not really useful due the sheer amount of data.
SELECT field1,field2,field3, count(*)
FROM table_name
GROUP BY field1,field2,field3
HAVING count(*) > 1
You are just missing the magic of CUBE(), which generates all the combinations of columns automatically
DECLARE #duplicate_column_threshold int = 5;
WITH cte AS (
SELECT
field1,field2,...,field10
,duplicate_column_count = (SELECT COUNT(col) FROM (VALUES (field1),(field2),...,(field10)) c(col))
FROM table_name
GROUP BY CUBE(field1,field2,...,field10)
HAVING COUNT(*) > 1
)
SELECT *
INTO #duplicated_rows
FROM cte
WHERE duplicate_column_count >= #duplicate_column_threshold
Update: to fetch the rows from the original table, join it against the #duplicated_rows using a technique that treats NULLs as wildcards when comparing the columns.
SELECT
a.*
,b.duplicate_column_count
FROM table_name a
INNER JOIN #duplicated_rows b
ON NULLIF(b.field1,a.field1) IS NULL
AND NULLIF(b.field2,a.field2) IS NULL
...
AND NULLIF(b.field10,a.field10) IS NULL
You might try something like
Select field1, field2, field3, ... , field10, count(1)
from customerdatabase
group by field1, field2, field3, ... , field10
order by field1, field2, field3, ... , field10
Where field1 through field10 are ordered by the "most identifiable/important" to least.
This is as close I've got to what i'm trying to achieve, which will return all records which have any duplicate fields. I want to add a column to the results which indicate how many fields have matched any other record in the table. There are around 40,000 records in total.
select * from [CUST].[dbo].[REPORTA] as a
where exists
(select [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
from [CUST].[dbo].[REPORTA] as b
where a.[GIVEN.NAMES] = b.[GIVEN.NAMES]
or a.[FAMILY.NAME] = b.[FAMILY.NAME]
or a.[DATE.OF.BIRTH] = b.[DATE.OF.BIRTH]
or a.[POST.CODE] = b.[POST.CODE]
or a.[STREET] = b.[STREET]
or a.[TOWN.COUNTRY] = b.[TOWN.COUNTRY]
group by [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
having count(*) >= 1)
This query will return thousands of records but I'm generally interested in the record with a high count of exactly matching fields

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

how to compare two rows in one mdb table?

I have one mdb table with the following structure:
Field1 Field2 Field3 Field4
A ...
B ...
I try to use a query to list all the different fields of row A and B in a result-set:
SELECT * From Table1
WHERE Field1 = 'A'
UNION
SELECT * From Table1
WHERE Field1 = 'B';
However this query has two problems:
it list all the fields including the
identical cells, with a large table
it gives out an error message: too
many fields defined.
How could i get around these issues?
Is it not easiest to just select all fields needed from the table, based on the Field1 value and group on the values needed?
So something like this:
SELECT field1, field2,...field195
FROM Table1
WHERE field1 = 'A' or field1 = 'B'
GROUP BY field1, field2, ....field195
This will give you all rows where field1 is A or B and there is a difference in one of the selected fields.
Oh and for the group by statement as well as the SELECT part, indeed use the previously mentioned edit mode for the query. There you can add all fields (by selecting them in the table and dragging them down) that are needed in the result, then click the 'totals' button in the ribbon to add the group by- statements for all. Then you only have to add the Where-clause and you are done.
Now that the question is more clear (you want the query to select fields instead of records based on the particular requirements), I'll have to change my answer to:
This is not possible.
(untill proven otherwise) ;)
As far as I know, a query is used to select records using for example the where clause, never used to determine which fields should be shown depending on a certain criterium.
One thing that MIGHT help in this case is to look at the database design. Are those tables correctly made?
Suppose you have 190 of those fields that are merely details of the main data. You could separate this in another table, so you have a main table and details table.
The details table could look something like:
ID ID_Main Det_desc Det_value
This way you can filter all Detail values that are equal between the two main values A and B using something like:
Select a.det_desc, a.det_value, b.det_value
(Select Det_desc, det_value
from tblDetails
where id_main = a) as A inner join
(Select Det_desc, det_value
from tblDetails
where id_main = a) as B
on A.det_desc = B.det_desc and A.det_value <> B.det_value
This you can join with your main table again if needed.
You can full join the table on itself, matching identical rows. Then you can filter on mismatches if one of the two join parts is null. For example:
select *
from (
select *
from Table1
where Field1 = 'A'
) A
full join
(
select *
from Table1
where Field1 = 'B'
) B
on A.Field2 = B.Field2
and A.Field3 = B.Field3
where A.Field1 is null
or B.Field1 is null
If you have 200 fields, ask Access to generate the column list by creating a query in design view. Switch to SQL view and copy/paste. An editor with column mode (like UltraEdit) will help create the query.

What is the most effecient way to write this SQL query?

I have two lists of ids. List A and List B. Both of these lists are actually the results of SQL queries (QUERY A and QUERY B respectively).
I want to 'filter' List A, by removing the ids in List A if they appear in list B.
So for example if list A looks like this:
1, 2, 3, 4, 7
and List B looks like this:
2,7
then the 'filtered' List A should have ids 2 and 7 removed, and so should look like this:
1, 3, 4
I want to write an SQL query like this (pseudo code of course):
SELECT id FROM (QUERYA) as temp_table where id not in (QUERYB)
Using classic SQL:
select [distinct] number
from list_a
where number not in (
select distinct number from list_b
);
I've put the first "distinct" in square brackets since I'm unsure as to whether you wanted duplicates removed (remove either the brackets or the entire word). The second "distinct" should be left in just in case your DBMS doesn't optimize IN clauses.
It may be faster (measure, don't guess) with an left join along the lines of:
select [distinct] list_a.number from list_a
left join list_b on list_a.number = list_b.number
where list_b.number is null;
Same deal with the "[distinct]".
see Doing INTERSECT and MINUS in MySQL
The query:
select id
from ListA
where id not in (
select id
from ListB)
will give you the desired result.
I am not sure which way is the best. As my previous impression, the perforamnce could be very different depends on situtation and the size of the tables.
1.
select id
from ListA
where id not in (
select id
from ListB)
2.
select ListA.id
from ListA
left join ListB on ListA.id=ListB.id
where ListB.id is null
3.
select id
from ListA
where not exists (
select *
from ListB where ListB.id=ListA.id)
The 2) should be the fastest usually, as it does inner join not sub-queries.
Some people may suggest 3) rather then 1) beause it use "exists" which does not read data from table.