how to overcome the limitation of IN cause in sql query - sql

I have written an sql query like :
select field1, field2 from table_name;
The problem is this query will return 1 million records/ or more than 100k records.
I have a directory in which I have input files (around 20,000 to 50,000 records) that contain field1 . This is the main data I am concerned with.
Using perl script, I am extracting from the directory.
But , if I write a query like :
select field1 , field2 from table_name
where field1 in (need to write a query to take field1 from directory);
If I use IN cause then it has limitation of processing 1000 entries, then how should I overcome the limitation of IN cause?

In any DBMS, I would insert them into a temporary table and perform a JOIN to workaround the IN clause limitation on the size of the list.
E.g.
CREATE TABLE #idList
(
ID INT
)
INSERT INTO #idList VALUES(1)
INSERT INTO #idList VALUES(2)
INSERT INTO #idList VALUES(3)
SELECT *
FROM
MyTable m
JOIN #idList AS t
ON m.id = t.id
In SQL Server 2005, in one of our previous projects, we used to convert this list of values that are a result of querying another data store (lucene index) into XML and pass it as XML variable in the SQL query and convert it into a table using the nodes() function on XML data types and perform a JOIN with that.
DECLARE #IdList XML
SELECT #idList = '
<Requests>
<Request id="1" />
<Request id="2" />
<Request id="3" />
</Requests>'
SELECT *
FROM
MyTable m
JOIN (
SELECT id.value('(#id)[1]', 'INT') as 'id'
FROM #idList.nodes('/Requests/Request') as T(id)
) AS t
ON m.id = t.id

Vikdor is right, you shouldn't be querying this with an IN() clause, it's faster and more memory efficient to use a table to JOIN.
Expanding on his answer I would recommend the following approach:
Get a list of all input files via Perl
Think of some clever way to compute a hash value for your list that is unique and based on all input files (I'd recommend the filenames or similar)
This hash will serve as the name of the table that stores the input filenames (think of it as a quasi temporary table that gets discarded once the hash changes)
JOIN that table to return the correct records
For step 2. you could either use a cronjob or compute whenever the query is actually needed (which would delay the response, though). To get this right you need to consider how likely it is that files are added/removed.
For step 3. you would need some logic that drops the previously generated tables once the current hash value differs from last execution, then recreate the table named after the current hash.
For the quasi temporary table names I'd recommend something along the lines of
input_files_XXX (.i.e. prefix_<hashvalue>)
which makes it easier to know what stale tables to drop.

You could split your 50'000 ids in 50 lists of 1000 ids, do a query for each such list, and collect the result sets in perl.

Oracle wise, the best solution with using a temporary table - which without indexing won't give you much performance is to use a nested tabled type.
CREATE TYPE my_ntt is table of directory_rec;
Then create a function f1 that returns a variable of my_ntt type and use in the query.
select field1 , field2 from table_name where field1 in table (cast (f1 as my_ntt));

Related

Is there any SQL query character limit while executing it by using the JDBC driver [duplicate]

I'm using the following code:
SELECT * FROM table
WHERE Col IN (123,123,222,....)
However, if I put more than ~3000 numbers in the IN clause, SQL throws an error.
Does anyone know if there's a size limit or anything similar?!!
Depending on the database engine you are using, there can be limits on the length of an instruction.
SQL Server has a very large limit:
http://msdn.microsoft.com/en-us/library/ms143432.aspx
ORACLE has a very easy to reach limit on the other side.
So, for large IN clauses, it's better to create a temp table, insert the values and do a JOIN. It works faster also.
There is a limit, but you can split your values into separate blocks of in()
Select *
From table
Where Col IN (123,123,222,....)
or Col IN (456,878,888,....)
Parameterize the query and pass the ids in using a Table Valued Parameter.
For example, define the following type:
CREATE TYPE IdTable AS TABLE (Id INT NOT NULL PRIMARY KEY)
Along with the following stored procedure:
CREATE PROCEDURE sp__Procedure_Name
#OrderIDs IdTable READONLY,
AS
SELECT *
FROM table
WHERE Col IN (SELECT Id FROM #OrderIDs)
Why not do a where IN a sub-select...
Pre-query into a temp table or something...
CREATE TABLE SomeTempTable AS
SELECT YourColumn
FROM SomeTable
WHERE UserPickedMultipleRecordsFromSomeListOrSomething
then...
SELECT * FROM OtherTable
WHERE YourColumn IN ( SELECT YourColumn FROM SomeTempTable )
Depending on your version, use a table valued parameter in 2008, or some approach described here:
Arrays and Lists in SQL Server 2005
For MS SQL 2016, passing ints into the in, it looks like it can handle close to 38,000 records.
select * from user where userId in (1,2,3,etc)
I solved this by simply using ranges
WHERE Col >= 123 AND Col <= 10000
then removed unwanted records in the specified range by looping in the application code. It worked well for me because I was looping the record anyway and ignoring couple of thousand records didn't make any difference.
Of course, this is not a universal solution but it could work for situation if most values within min and max are required.
You did not specify the database engine in question; in Oracle, an option is to use tuples like this:
SELECT * FROM table
WHERE (Col, 1) IN ((123,1),(123,1),(222,1),....)
This ugly hack only works in Oracle SQL, see https://asktom.oracle.com/pls/asktom/asktom.search?tag=limit-and-conversion-very-long-in-list-where-x-in#9538075800346844400
However, a much better option is to use stored procedures and pass the values as an array.
You can use tuples like this:
SELECT * FROM table
WHERE (Col, 1) IN ((123,1),(123,1),(222,1),....)
There are no restrictions on number of these. It compares pairs.

Is there a way to pull part of a SQL query from a .sql file?

Let me simplify with an example. Let's say I have the following query saved on:
C:\sample.sql
grp.id IN
(001 --Bob
,002 --Tom
,003 --Fay
)
Now, that group of IDs could change, but instead of updating those IDs in every query it's related to, I was hoping to just update in sample.sql and the rest of the queries will pull from that SQL file directly.
For example, I have several queries that would have a section like this:
SELECT *
FROM GROUP grp
WHERE grp.DATERANGE >= '2017-12-01 AND grp.DATERANGE <= '2017-12-31
AND -- **this is where I would need to insert that query (ie. C:\sample.sql)**
More explained update:
Issue: I have several reports/queries having the same ID filter (that's the only thing in common between those reports)
What's needed: Instead of updating those IDs every time they change on each report, I was wondering if I can update those IDs in it's own SQL file (like the example above) and have the rest of the queries pull from there.
Note. I can't create a table or database in the used database.
Maybe the bulk insert utility could help. Hold your data in csv files and load them into temp tables at run time. Use these temp tables to drive your query.
CREATE TABLE #CsvData(
Column1 VARCHAR(40),
Column2 VARCHAR(40)
)
GO
BULK
INSERT #CsvData
FROM 'c:\csvtest.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
--Use #CsvData to drive your query
SELECT *
FROM #CsvData
maybe what you could use is a CTE (Common Table Expression) to pull your IDs using an additional query, specially if you only have read access. It would look something like this:
WITH myIDs AS (select IDs from grp where (conditions to get the IDs))
SELECT *
FROM grp
WHERE grp.DATERANGE BETWEEN '2017-12-01 AND '2017-12-31'
AND IDs in (select * from myIDs)
I've changed the dates syntax to use BETWEEN since it's more practical but only works if you have a SQL Server 2008 or later
Hope this helps!
Cheers!
The only chance to build a query out of text fragments is dynamic SQL:
Try this:
DECLARE #SomeCommand VARCHAR(MAX)='SELECT * FROM sys.objects';
EXEC(#SomeCommand);
Returns a list of all sys.object entries
Now I append a WHERE clause to the string
SET #SomeCommand=#SomeCommand + ' WHERE object_id IN(1,2,3,4,5,6,7,8,9)';
EXEC(#SomeCommand);
And you get a reduced result.
Another option is dynamic IN-list with a CSV paramter.
This is forbidden: DECLARE #idList VARCHAR(100)='1,2,3,4' and use it like IN (#idList).
But this works:
DECLARE #idList VARCHAR(100)='1,2,3,4,5,6,7,8,9';
SELECT sys.objects.*
FROM sys.objects
--use REPLACE to transform the list to <x>1</x><x>2</x>...
OUTER APPLY(SELECT CAST('<x>' + REPLACE(#idList,',','</x><x>') + '</x>' AS XML)) AS A(ListSplitted)
--Now use the XML (the former CSV) within your IN() as set-based filter
WHERE #idList IS NULL OR LEN(#idList)=0 OR object_id IN(SELECT B.v.value('.','int') FROM ListSplitted.nodes('/x') AS B(v));
With a version of SQL Server 2016+ this can be done much easier using STRING_SPLIT().
This approach allows you to pass the id-list as simple text parameter.

SQL query to check for inclusion of any element from an array

I have a database column containing a string that might look something like this u/1u/3u/19/g1/g4 for a particular row.
Is there a performant way to get all rows that have at least one of the following elements ['u/3', 'g4'] in that column?
I know I can use AND clauses, but the number of elements to verify against varies and could become large..
I am using RoR/ActiveRecord in my project.
in sql server, you can use XML to convert your list of search params into a record set, then cross join that with the base table, and do charIndex() to see if the column contains the substring.
Since i don't know your table or column names, i used a table (persons) that i already had data in, which has a column 'phone_home'. To search for any phone number that contains '202' or '785' i would use this query:
select person_id,phone_home,Split.data.value('.', 'VARCHAR(10)')
from (select *, cast('<n>202</n><n>785</n>' as XML) as myXML
from persons) as data cross apply myXML.nodes('/n') as Split(data)
where charindex(Split.data.value('.', 'VARCHAR(10)'),data.phone_Home) > 0
you will get duplicate records if it matches more than one value, so throw a distinct in there and remove the Split from the select statement if that is not desired.
Using xml in sql is voodoo magic to me...i got the idea from this post http://www.sqljason.com/2010/05/converting-single-comma-separated-row.html
no idea what performance is like...but at least there aren't any cursors or dynamic sql.
EDIT: Casting the XML is pretty slow, so i made it a variable so it only gets cast once.
declare #xml XML
set #xml = cast('<n>202</n><n>785</n>' as XML)
select person_id,phone_home,Split.persons.value('.', 'VARCHAR(10)')
from persons cross apply #xml.nodes('/n') as Split(persons)
where charindex(Split.persons.value('.', 'VARCHAR(10)'),phone_Home) > 0

How to efficiently SELECT rows from database table based on selected set of values

I have a transaction table of 1 million rows. The table has a field name "Code" to keep customer's ID. There are about 10,000 different customer code.
I have an GUI interface allow user to render a report from transaction table. User may select arbitrary number of customers for rendering.
I use IN operator first and it works for few customers:
SELECT * FROM TRANS_TABLE WHERE CODE IN ('...', '...', '...')
I quickly run into problem if I select few thousand customers. There is limitation using IN operator.
An alternate way is create a temporary table with only one field of CODE, and inject selected customer codes into the temporary table using INSERT statement. I may then using
SELECT A.* FROM TRANS_TABLE A INNER JOIN TEMP B ON (A.CODE=B.CODE)
This works nice for huge selection. However, there is performance overhead for temporary table creation, INSERT injection and dropping of temporary table.
Do you aware of better solution to handle this situation?
If you use SQL Server 2008, the fastest way to do this is usually with a Table-Valued Parameter (TVP):
CREATE TYPE CodeTable AS TABLE
(
Code int NOT NULL PRIMARY KEY
)
DECLARE #Codes AS CodeTable
INSERT #Codes (Code) VALUES (1)
INSERT #Codes (Code) VALUES (2)
INSERT #Codes (Code) VALUES (3)
-- Snip codes
SELECT t.*
FROM #Codes c
INNER JOIN Trans_Table t
ON t.Code = c.Code
Using ADO.NET, you can populate the TVP directly from your code, so you don't need to generate all those INSERT statements - just pass in a DataTable and ADO.NET will handle the rest. So you can write a Stored Procedure like this:
CREATE PROCEDURE GetTransactions
#Codes CodeTable READONLY
AS
SELECT t.*
FROM #Codes c
INNER JOIN Trans_Table t
ON t.Code = c.Code
... and just pass in the #Codes value as a parameter.
You can generate SQL such as
SELECT * FROM TRANS_TABLE WHERE CODE IN (?,?,?,?,?,?,?,?,?,?,?)
and re-use it in a loop until you've loaded all the IDs you need. The advantage is that if you only need a few IDs your DB doesn't need to parse all those in-clauses. If many IDs is a rare case then the performance hit may not matter. If you are not worried about the SQL parsing cache then you can limit the size of the in clause to the DB's actual limit, so that sometimes you don't need a loop and other times you do.
As you have to pass the IDs somehow, IN should be the fastest way.
MSDN mentions:
Including an extremely large number of values (many thousands) in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table.
If you still can use IN and the query is to slow, you could try to adjust your indexes like using some covering index for your query. Looking up random values by the clustered index can be slow, because of the random disk I/O required. A covering index could reduce that problem.
If you really pass limit of IN and you create a temporary table, I don't expect the creation of the table to be a major problem, as long as you insert the values at once (not thousands of queries of course). Choose the method with the least overhead, like one of those mentioned here:
http://blog.sqlauthority.com/2008/07/02/sql-server-2008-insert-multiple-records-using-one-insert-statement-use-of-row-constructor/
Of course, if there is some static pattern in your IDs you could select by that (like in SPs or UDFs). If you get those thousands of IDs out of your database itself, instead of passing them back and forth, you could just store them or use a subquery...
Maybe you could pass the customer codes to a stored procedure comma separated and use the split sql function mentioned here: http://www.devx.com/tips/Tip/20009.
Then declare a scalar table where you insert the splitted values in and use an IN clause.
CREATE PROCEDURE prc_dosomething (
#CustomerCodes varchar(MAX)
)
AS
DECLARE #customercodetable table(code varchar(10)) -- or whatever length you require.
SET #customercodetable = UTILfn_Split(#CustomerCodes) -- see the article above for the split function.
-- do some magic stuff here :).

Limit on the WHERE col IN (...) condition

I'm using the following code:
SELECT * FROM table
WHERE Col IN (123,123,222,....)
However, if I put more than ~3000 numbers in the IN clause, SQL throws an error.
Does anyone know if there's a size limit or anything similar?!!
Depending on the database engine you are using, there can be limits on the length of an instruction.
SQL Server has a very large limit:
http://msdn.microsoft.com/en-us/library/ms143432.aspx
ORACLE has a very easy to reach limit on the other side.
So, for large IN clauses, it's better to create a temp table, insert the values and do a JOIN. It works faster also.
There is a limit, but you can split your values into separate blocks of in()
Select *
From table
Where Col IN (123,123,222,....)
or Col IN (456,878,888,....)
Parameterize the query and pass the ids in using a Table Valued Parameter.
For example, define the following type:
CREATE TYPE IdTable AS TABLE (Id INT NOT NULL PRIMARY KEY)
Along with the following stored procedure:
CREATE PROCEDURE sp__Procedure_Name
#OrderIDs IdTable READONLY,
AS
SELECT *
FROM table
WHERE Col IN (SELECT Id FROM #OrderIDs)
Why not do a where IN a sub-select...
Pre-query into a temp table or something...
CREATE TABLE SomeTempTable AS
SELECT YourColumn
FROM SomeTable
WHERE UserPickedMultipleRecordsFromSomeListOrSomething
then...
SELECT * FROM OtherTable
WHERE YourColumn IN ( SELECT YourColumn FROM SomeTempTable )
Depending on your version, use a table valued parameter in 2008, or some approach described here:
Arrays and Lists in SQL Server 2005
For MS SQL 2016, passing ints into the in, it looks like it can handle close to 38,000 records.
select * from user where userId in (1,2,3,etc)
I solved this by simply using ranges
WHERE Col >= 123 AND Col <= 10000
then removed unwanted records in the specified range by looping in the application code. It worked well for me because I was looping the record anyway and ignoring couple of thousand records didn't make any difference.
Of course, this is not a universal solution but it could work for situation if most values within min and max are required.
You did not specify the database engine in question; in Oracle, an option is to use tuples like this:
SELECT * FROM table
WHERE (Col, 1) IN ((123,1),(123,1),(222,1),....)
This ugly hack only works in Oracle SQL, see https://asktom.oracle.com/pls/asktom/asktom.search?tag=limit-and-conversion-very-long-in-list-where-x-in#9538075800346844400
However, a much better option is to use stored procedures and pass the values as an array.
You can use tuples like this:
SELECT * FROM table
WHERE (Col, 1) IN ((123,1),(123,1),(222,1),....)
There are no restrictions on number of these. It compares pairs.