How to efficiently SELECT rows from database table based on selected set of values - sql

I have a transaction table of 1 million rows. The table has a field name "Code" to keep customer's ID. There are about 10,000 different customer code.
I have an GUI interface allow user to render a report from transaction table. User may select arbitrary number of customers for rendering.
I use IN operator first and it works for few customers:
SELECT * FROM TRANS_TABLE WHERE CODE IN ('...', '...', '...')
I quickly run into problem if I select few thousand customers. There is limitation using IN operator.
An alternate way is create a temporary table with only one field of CODE, and inject selected customer codes into the temporary table using INSERT statement. I may then using
SELECT A.* FROM TRANS_TABLE A INNER JOIN TEMP B ON (A.CODE=B.CODE)
This works nice for huge selection. However, there is performance overhead for temporary table creation, INSERT injection and dropping of temporary table.
Do you aware of better solution to handle this situation?

If you use SQL Server 2008, the fastest way to do this is usually with a Table-Valued Parameter (TVP):
CREATE TYPE CodeTable AS TABLE
(
Code int NOT NULL PRIMARY KEY
)
DECLARE #Codes AS CodeTable
INSERT #Codes (Code) VALUES (1)
INSERT #Codes (Code) VALUES (2)
INSERT #Codes (Code) VALUES (3)
-- Snip codes
SELECT t.*
FROM #Codes c
INNER JOIN Trans_Table t
ON t.Code = c.Code
Using ADO.NET, you can populate the TVP directly from your code, so you don't need to generate all those INSERT statements - just pass in a DataTable and ADO.NET will handle the rest. So you can write a Stored Procedure like this:
CREATE PROCEDURE GetTransactions
#Codes CodeTable READONLY
AS
SELECT t.*
FROM #Codes c
INNER JOIN Trans_Table t
ON t.Code = c.Code
... and just pass in the #Codes value as a parameter.

You can generate SQL such as
SELECT * FROM TRANS_TABLE WHERE CODE IN (?,?,?,?,?,?,?,?,?,?,?)
and re-use it in a loop until you've loaded all the IDs you need. The advantage is that if you only need a few IDs your DB doesn't need to parse all those in-clauses. If many IDs is a rare case then the performance hit may not matter. If you are not worried about the SQL parsing cache then you can limit the size of the in clause to the DB's actual limit, so that sometimes you don't need a loop and other times you do.

As you have to pass the IDs somehow, IN should be the fastest way.
MSDN mentions:
Including an extremely large number of values (many thousands) in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table.
If you still can use IN and the query is to slow, you could try to adjust your indexes like using some covering index for your query. Looking up random values by the clustered index can be slow, because of the random disk I/O required. A covering index could reduce that problem.
If you really pass limit of IN and you create a temporary table, I don't expect the creation of the table to be a major problem, as long as you insert the values at once (not thousands of queries of course). Choose the method with the least overhead, like one of those mentioned here:
http://blog.sqlauthority.com/2008/07/02/sql-server-2008-insert-multiple-records-using-one-insert-statement-use-of-row-constructor/
Of course, if there is some static pattern in your IDs you could select by that (like in SPs or UDFs). If you get those thousands of IDs out of your database itself, instead of passing them back and forth, you could just store them or use a subquery...

Maybe you could pass the customer codes to a stored procedure comma separated and use the split sql function mentioned here: http://www.devx.com/tips/Tip/20009.
Then declare a scalar table where you insert the splitted values in and use an IN clause.
CREATE PROCEDURE prc_dosomething (
#CustomerCodes varchar(MAX)
)
AS
DECLARE #customercodetable table(code varchar(10)) -- or whatever length you require.
SET #customercodetable = UTILfn_Split(#CustomerCodes) -- see the article above for the split function.
-- do some magic stuff here :).

Related

SQL Server stored procedure to search list of values without special characters

What is the most efficient way to search a column and return all matching values while ignoring special characters?
For example if a table has a part_number column with the following values '10-01' '14-02-65' '345-23423' and the user searches for '10_01' and 140265 it should return '10-01' and '14-02-65'.
Processing the input to with a regex to remove those characters is possible, so the stored procedure could could be passed a parameter '1001 140265' then it could split that input to form a SQL statement like
SELECT *
FROM MyTable
WHERE part_number IN ('1001', '140265')
The problem here is that this will not match anything. In this case the following would work
SELECT *
FROM MyTable
WHERE REPLACE(part_number,'-','') IN ('1001', '140265')
But I need to remove all special characters. Or at the very least all of these characters ~!##$%^&*()_+?/\{}[]; with a replace for each of those characters the query takes several minutes when the number of parts in the IN clause is less than 200.
Performance is improved by creating a function that does the replaces, so the query takes less than a minute. But without removals the query takes around 1 second, is there any way to create some kind of functional index that will work on multiple SQL Server engines?
You could use a computed column and index it:
CREATE TABLE MyTable (
part_number VARCHAR(10) NOT NULL,
part_number_int AS CAST(replace(part_number, '-', '') AS int)
);
ALTER TABLE dbo.MyTable ADD PRIMARY KEY (part_number);
ALTER TABLE dbo.MyTable ADD UNIQUE (part_number_int);
INSERT INTO dbo.MyTable (part_number)
VALUES ('100-1'), ('140265');
SELECT *
FROM dbo.MyTable AS MT
WHERE MT.part_number_int IN ('1001', '140265');
Of course your replace statement will be more complex and you'll have to sanitize user input the same way you sanitize column values. But this is going to be the most efficient way to do it.
This query can now seek your column efficiently:
But to be honest, I'd just create a separate column to store cleansed values for querying purpose and keep the actual values for display. You'll have to take care of extra update/insert clauses, but that's a minimum damage.

T-Sql - Select query in another select query takes long time

I have a procedure with arguments but its calling takes a very long time. I decided to check what is wrong with my query and came to the conclusion that the problem is Column In (SELECT [...]).
Both queries return 1500 rows.
First query: time 45 second
Second query: time 0 second
1.
declare #FILTER_OPTION int
declare #ID_DISTRIBUTOR type_int_value
declare #ID_DATA_TYPE type_bigint_value
declare #ID_AGGREGATION_TYPE type_int_value
set #FILTER_OPTION = 8
insert into #ID_DISTRIBUTOR values (19)
insert into #ID_DATA_TYPE values (30025)
insert into #ID_AGGREGATION_TYPE values (10)
SELECT * FROM dbo.[DATA] WHERE
[ID_DISTRIBUTOR] IN (select [VALUE] from #ID_DISTRIBUTOR)
AND [ID_DATA_TYPE] IN (select [VALUE] from #ID_DATA_TYPE)
AND [ID_AGGREGATION_TYPE] IN (select [VALUE] from #ID_AGGREGATION_TYPE)
2.
select * FROM dbo.[DATA] WHERE
[ID_DISTRIBUTOR] IN (19)
AND [ID_DATA_TYPE] IN (30025)
AND [ID_AGGREGATION_TYPE] IN (10)
Why this is happening?
How should I create a stored procedure that takes an array of arguments to use it quickly?
Edit:
Maybe it's a problem with indexes? indexes are created on these three columns.
For such a large performance difference, I would guess that you have one or more indexes. In particular, if you have an index on (ID_DISTRIBUTOR, ID_DATA_TYPE, ID_AGGREGATION_TYPE), then the second query can make use of the index. SQL Server can recognize that the IN is really = and the query is a simple lookup.
In the first case, SQL Server doesn't "know" that the subqueries really have only one row in them. That requires a different set of optimizations. In particular, the above index cannot be used, because the IN generally optimizes differently from =.
As for what to do. First, look at the execution plans so you can see the different between the two versions. Then, test the second version with more than one value in the IN lists.
If you can live with just one value for each comparison, then use = rather than IN.

Optimizing stored procedure with multiple "LIKE"s

I am passing in a comma-delimited list of values that I need to compare to the database
Here is an example of the values I'm passing in:
#orgList = "1123, 223%, 54%"
To use the wildcard I think I have to do LIKE but the query runs a long time and only returns 14 rows (the results are correct, but it's just taking forever, probably because I'm using the join incorrectly)
Can I make it better?
This is what I do now:
declare #tempTable Table (SearchOrg nvarchar(max) )
insert into #tempTable
select * from dbo.udf_split(#orgList) as split
-- this splits the values at the comma and puts them in a temp table
-- then I do a join on the main table and the temp table to do a like on it....
-- but I think it's not right because it's too long.
select something
from maintable gt
join #tempTable tt on gt.org like tt.SearchOrg
where
AYEAR= ISNULL(#year, ayear)
and (AYEAR >= ISNULL(#yearR1, ayear) and ayear <= ISNULL(#yearr2, ayear))
and adate = ISNULL(#Date, adate)
and (adate >= ISNULL(#dateR1, adate) and adate <= ISNULL(#DateR2 , adate))
The final result would be all rows where the maintable.org is 1123, or starts with 223 or starts with 554
The reason for my date craziness is because sometimes the stored procedure only checks for a year, sometimes for a year range, sometimes for a specific date and sometimes for a date range... everything that's not used in passed in as null.
Maybe the problem is there?
Try something like this:
Declare #tempTable Table
(
-- Since the column is a varchar(10), you don't want to use nvarchar here.
SearchOrg varchar(20)
);
INSERT INTO #tempTable
SELECT * FROM dbo.udf_split(#orgList);
SELECT
something
FROM
maintable gt
WHERE
some where statements go here
And
Exists
(
SELECT 1
FROM #tempTable tt
WHERE gt.org Like tt.SearchOrg
)
Such a dynamic query with optional filters and LIKE driven by a table (!) are very hard to optimize because almost nothing is statically known. The optimizer has to create a very general plan.
You can do two things to speed this up by orders of magnitute:
Play with OPTION (RECOMPILE). If the compile times are acceptable this will at least deal with all the optional filters (but not with the LIKE table).
Do code generation and EXEC sp_executesql the code. Build a query with all LIKE clauses inlined into the SQL so that it looks like this: WHERE a LIKE #like0 OR a LIKE #like1 ... (not sure if you need OR or AND). This allows the optimizer to get rid of the join and just execute a normal predicate).
Your query may be difficult to optimize. Part of the question is what is in the where clause. You probably want to filter these first, and then do the join using like. Or, you can try to make the join faster, and then do a full table scan on the results.
SQL Server should optimize a like statement of the form 'abc%' -- that is, where the wildcard is at the end. (See here, for example.) So, you can start with an index on maintable.org. Fortunately, your examples meet this criteria. However, if you have '%abc' -- the wildcard comes first -- then the optimization won't work.
For the index to work best, it might also need to take into account the conditions in the where clause. In other words, adding the index is suggestive, but the rest of the query may preclude the use of the index.
And, let me add, the best solution for these types of searches is to use the full text search capability in SQL Server (see here).

how to overcome the limitation of IN cause in sql query

I have written an sql query like :
select field1, field2 from table_name;
The problem is this query will return 1 million records/ or more than 100k records.
I have a directory in which I have input files (around 20,000 to 50,000 records) that contain field1 . This is the main data I am concerned with.
Using perl script, I am extracting from the directory.
But , if I write a query like :
select field1 , field2 from table_name
where field1 in (need to write a query to take field1 from directory);
If I use IN cause then it has limitation of processing 1000 entries, then how should I overcome the limitation of IN cause?
In any DBMS, I would insert them into a temporary table and perform a JOIN to workaround the IN clause limitation on the size of the list.
E.g.
CREATE TABLE #idList
(
ID INT
)
INSERT INTO #idList VALUES(1)
INSERT INTO #idList VALUES(2)
INSERT INTO #idList VALUES(3)
SELECT *
FROM
MyTable m
JOIN #idList AS t
ON m.id = t.id
In SQL Server 2005, in one of our previous projects, we used to convert this list of values that are a result of querying another data store (lucene index) into XML and pass it as XML variable in the SQL query and convert it into a table using the nodes() function on XML data types and perform a JOIN with that.
DECLARE #IdList XML
SELECT #idList = '
<Requests>
<Request id="1" />
<Request id="2" />
<Request id="3" />
</Requests>'
SELECT *
FROM
MyTable m
JOIN (
SELECT id.value('(#id)[1]', 'INT') as 'id'
FROM #idList.nodes('/Requests/Request') as T(id)
) AS t
ON m.id = t.id
Vikdor is right, you shouldn't be querying this with an IN() clause, it's faster and more memory efficient to use a table to JOIN.
Expanding on his answer I would recommend the following approach:
Get a list of all input files via Perl
Think of some clever way to compute a hash value for your list that is unique and based on all input files (I'd recommend the filenames or similar)
This hash will serve as the name of the table that stores the input filenames (think of it as a quasi temporary table that gets discarded once the hash changes)
JOIN that table to return the correct records
For step 2. you could either use a cronjob or compute whenever the query is actually needed (which would delay the response, though). To get this right you need to consider how likely it is that files are added/removed.
For step 3. you would need some logic that drops the previously generated tables once the current hash value differs from last execution, then recreate the table named after the current hash.
For the quasi temporary table names I'd recommend something along the lines of
input_files_XXX (.i.e. prefix_<hashvalue>)
which makes it easier to know what stale tables to drop.
You could split your 50'000 ids in 50 lists of 1000 ids, do a query for each such list, and collect the result sets in perl.
Oracle wise, the best solution with using a temporary table - which without indexing won't give you much performance is to use a nested tabled type.
CREATE TYPE my_ntt is table of directory_rec;
Then create a function f1 that returns a variable of my_ntt type and use in the query.
select field1 , field2 from table_name where field1 in table (cast (f1 as my_ntt));

Jet engine (Access) : Passing a list of values to a stored procedure

I am currently writing a VBA-based Excel add-in that's heavily based on a Jet database backend (I use the Office 2003 suite -- the problem would be the same with a more recent version of Office anyway).
During the initialization of my app, I create stored procedures that are defined in a text file. Those procedures are called by my app when needed.
Let me take a simple example to describe my issue: suppose that my app allows end-users to select the identifiers of orders for which they'd like details. Here's the table definition:
Table tblOrders: OrderID LONG, OrderDate DATE, (other fields)
The end-user may select one or more OrderIDs, displayed in a form - s/he just has to tick the checkbox of the relevant OrderIDs for which s/he'd like details (OrderDate, etc).
Because I don't know in advance how many OrderID s/he will select, I could dynamically create the SQL query in the VBA code by cascading WHERE clauses based on the choices made on the form:
SELECT * FROM tblOrders WHERE OrderID = 1 OR OrderID = 2 OR OrderID = 3
or, much simpler, by using the IN keyword:
SELECT * FROM tblOrders WHERE OrderID IN (1,2,3)
Now if I turn this simple query into a stored procedure so that I can dynamically pass list of OrderIDs I want to be displayed, how should I do? I already tried things like:
CREATE PROCEDURE spTest (#OrderList varchar) AS
SELECT * FROM tblOrders WHERE OrderID IN (#OrderList)
But this does not work (I was expecting that), because #OrderList is interpreted as a string (e.g. "1,2,3") and not as a list of long values. (I adapted from code found here: Passing a list/array to SQL Server stored procedure)
I'd like to avoid dealing with this issue via pure VBA code (i.e. dynamically assigning list of values to a query that is hardcoded in my application) as much as possible. I'd understand if ever this is not possible.
Any clue?
You can create the query-statement string dynamically. In SQL Server you can have a function whose return value is a TABLE, and invoke that function inline as if it were a table. Or in JET you could also create a kludge -- a temporary table (or persistent table that serves the function of a temporary table) that contains the values in your in-list, one per row, and join on that table. The query would thus be a two-step process: 1) populate temp table with INLIST values, then 2) execute the query joining on the temp table.
MYTEMPTABLE
autoincrementing id
QueryID [some value to identify the current query, perhaps a GUID]
myvalue one of the values in your in-list, string
select * from foo
inner join MYTEMPTABLE on foo.column = MYTEMPTABLE.myvalue and MYTEMPTABLE.QueryId = ?
[cannot recall if JET allows ANDs in INNER JOIN as SQL Server does --
if not, adjust syntax accordingly]
instead of
select * from foo where foo.column IN (... )
In this way you could have the same table handle multiple queries concurrently, because each query would have a unique identifier. You could delete the in-list rows after you're finished with them:
DELETE FROM MYTEMPTABLE where QueryID = ?
P.S. There would be several ways of handling data type issues for the join. You could cast the string value in MYTEMPTABLE as required, or you could have multiple columns in MYTEMPTABLE of varying datatypes, inserting into and joining on the correct column:
MYTEMPTABLE
id
queryid
mytextvalue
myintvalue
mymoneyvalue
etc