SQL argument limit in Oracle - sql

It appears that there is a limit of 1000 arguments in an Oracle SQL. I ran into this when generating queries such as....
select * from orders where user_id IN(large list of ids over 1000)
My workaround is to create a temporary table, insert the user ids into that first instead of issuing a query via JDBC that has a giant list of parameters in the IN.
Does anybody know of an easier workaround? Since we are using Hibernate I wonder if it automatically is able to do a similar workaround transparently.

An alternative approach would be to pass an array to the database and use a TABLE() function in the IN clause. This will probably perform better than a temporary table. It will certainly be more efficient than running multiple queries. But you will need to monitor PGA memory usage if you have a large number of sessions doing this stuff. Also, I'm not sure how easy it will be to wire this into Hibernate.
Note: TABLE() functions operate in the SQL engine, so they need us to declare a SQL type.
create or replace type tags_nt as table of varchar2(10);
/
The following sample populates an array with a couple of thousand random tags. It then uses the array in the IN clause of a query.
declare
search_tags tags_nt;
n pls_integer;
begin
select name
bulk collect into search_tags
from ( select name
from temp_tags
order by dbms_random.value )
where rownum <= 2000;
select count(*)
into n
from big_table
where name in ( select * from table (search_tags) );
dbms_output.put_line('tags match '||n||' rows!');
end;
/

As long as the temporary table is a global temporary table (ie only visible to the session), this is the recommended way of doing things (and I'd go that route for anything more than a dozen arguments, let alone a thousand).
I'd wonder where/how you are building that list of 1000 arguments. If this is a semi-permanent grouping (eg all employees based in a particular location) then that grouping should be in the database and the join done there. Databases are designed and built to do joins really quickly. Much quicker than pulling a bunch of id's back to the mid tier and then sending them back to the database.
select * from orders
where user_id in
(select user_id from users where location = :loc)

You can add additional predicates to split the list into chunks of 1000:
select * from orders where user_id IN (<first batch of 1000>)
OR user_id IN (<second batch of 1000>)
OR user_id IN ...

the comments regarding "if these IDs are in your database, use joins/correlation instead" hold true. However, if your list of IDs comes from elsewhere, like a SOLR result, you can get around the temp table requirement by issuing multiple queries, each with no more than 1000 ids present, and then merging the results of the query in memory. If you place the initial list of ids in a unique collection like a hashset, you can pop off 1000 ids at a time.

Related

PostgreSQL return select results AND add them to temporary table?

I want to select a set of rows and return them to the client, but I would also like to insert just the primary keys (integer id) from the result set into a temporary table for use in later joins in the same transaction.
This is for sync, where subsequent queries tend to involve a join on the results from earlier queries.
What's the most efficient way to do this?
I'm reticent to execute the query twice, although it may well be fast if it was added to the query cache. An alternative is store the entire result set into the temporary table and then select from the temporary afterward. That also seems wasteful (I only need the integer id in the temp table.) I'd be happy if there was a SELECT INTO TEMP that also returned the results.
Currently the technique used is construct an array of the integer ids in the client side and use that in subsequent queries with IN. I'm hoping for something more efficient.
I'm guessing it could be done with stored procedures? But is there a way without that?
I think you can do this with a Postgres feature that allows data modification steps in CTEs. The more typical reason to use this feature is, say, to delete records for a table and then insert them into a log table. However, it can be adapted to this purpose. Here is one possible method (I don't have Postgres on hand to test this):
with q as (
<your query here>
),
t as (
insert into temptable(pk)
select pk
from q
)
select *
from q;
Usually, you use the returning clause with the data modification queries in order to capture the data being modified.

Writing Efficient Queries in SAS Using Proc sql with Teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below.
libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
SELECT DISTINCT id
FROM mydb.sale_volume AS sv
WHERE sv.category IN ('a', 'b', 'c') AND
sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
SELECT id, SUM(sales)
FROM mydb.sale_volue AS sv
INNER JOIN output.id AS ids
ON ids.id = sv.id
WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
GROUP BY id
)
run;
The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.
The above approach is far slower than:
Running the first query to get the subset
Running a second query the sums every ID
Running a third query that inner joins the two result sets.
If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.
After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.
I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.
My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:
proc sql;
CREATE TABLE subset AS (
SELECT
id
FROM
bigTable
WHERE
someValue = x AND
date BETWEEN a AND b
)
This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:
proc sql;
CREATE TABLE subset_data AS (
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
INNER JOIN subset
ON subset.id = bigTable.id
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
)
For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.
I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.
The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.
SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.
The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:
proc sql;
connect to teradata (user=userid password=password mode=teradata);
create table mydata as
select * from connection to teradata (
select a.customer_id
, a.customer_name
, b.last_payment_date
, b.last_payment_amt
from base.customers a
join base.invoices b
on a.customer_id=b.customer_id
where b.bill_month = date '2013-07-01'
and b.paid_flag = 'N'
);
quit;
Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.
The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:
proc sql;
CREATE TABLE subset_data AS
SELECT bigTable.id,
SUM(bigTable.value) AS total
FROM TDATA.bigTable bigTable
JOIN TDATA.subset subset
ON subset.id = bigTable.id
WHERE bigTable.date BETWEEN a AND b
GROUP BY bigTable.id
;
That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.
One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.
I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.
You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?
I ask because the implication from your second query is that they're not unique.
- One id can have multiple values over time, and have different somevalues
If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.
Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.
And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.
You will then get 3x9 records back per id.
And as those two numbers grow, the number of records in your second query grows geometrically.
Alternative Query
If that's not the problem, an alternative query (which is not ideal, but possible) would be...
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
HAVING
MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
If ID is unique and a single value, then you can try constructing a format.
Create a dataset that looks like this:
fmtname, start, label
where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); start is the ID value; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion.
Here's a brief example using SASHELP.CLASS. ID here is name, but it can be numeric or character - whichever is right for your use.
data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you. Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
hlo='o';
call missing(start);
label='0';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1'. This won't be optimized with an index or anything, but it may be faster than the join. (It may also not be faster - depends on how the SQL optimizer is working.)
If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI.
Knowing about uniquenes helps the optimizer to produce a better plan.
Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved.
One alternate solution is to use SAS procedures. I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do:
proc sql;
create view blah as select ... (your join);
quit;
proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;
Or any number of other options (PROC MEANS, PROC TABULATE, etc.). That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table).

PostgreSQL query decomposition

I fail to decompose simple SQL queries. I use PostgreSQL but my question is also related to other RDBMS.
Consider the following example. We have table orders and we want to find first order after which total amount exceeded some limit:
drop table if exists orders cascade;
/**
Table with clients' orders
*/
create table orders(
date timestamp,
amount integer
/**
Other columns omitted
*/
);
/**
Populate with test data
*/
insert into orders(date,amount)
values
('2011-01-01',50),
('2011-01-02',49),
('2011-01-03',2),
('2011-01-04',1000);
/**
Selects first order that caused exceeding of limit
*/
create view first_limit_exceed
as
select min(date) from
(
select o1.date
from orders o1,
orders o2
where o2.date<=o1.date
group by o1.date
having sum(o2.amount) > 100
) limit_exceed;
/**
returns "2011-01-03 00:00:00"
*/
select * from first_limit_exceed;
Now let's make the problem a little harder. Consider we want to find total amount only for rows that satisfy some predicate. We have a lot of such predicates and creating separate version of view first_limit_exceed would be terrible code duplication. So we need some way to create parameterized view and pass either filtered set of rows or predicate itself to it.
In Postgres we can use query language functions as parameterized views. But Postgres does not allow function to get as argument neither set of row nor another function.
I still can use string interpolation on client's side or in plpgsql function, but it is error-prone and hard to test and debug.
Any advice?
In PostgreSQL 8.4 and later:
SELECT *
FROM (
SELECT *,
SUM(amount) OVER (ORDER BY date) AS psum
FROM orders
) q
WHERE psum > 100
ORDER BY
date
LIMIT 1
Add any predicates you want into the inner query:
SELECT *
FROM (
SELECT *,
SUM(amount) OVER (ORDER BY date) AS psum
FROM orders
WHERE date >= '2011-01-03'
) q
WHERE psum > 100
ORDER BY
date
LIMIT 1
It sounds a bit like you're trying to put too much code into the database. If you are interested in the rows of a certain relation that satisfy a particular predicate, just execute a select statement with an appropriate where clause in the client code. Having views that take predicates as parameters is reinventing the wheel that sql already solves nicely.
On the other hand, I can see an argument for storing queries themselves in the database, so that they can be composed into larger reports. This two is still better handled by application code. I might approach a problem like that by using a library that's good at dynamic sql generatation, (for example sqlalchemy), and then storing the query representations (sqlalchemy expression objects are 'pickleable') as blobs in the database.
To put it another way, databases are representers of facts, You store knowledge in them. applications have the duty of acting on user requests, When you find yourself defining transformations on the data, that's really more a matter of anticipating and implementing the requests of actual users, rather than just faithfully preserving knowledge.
Views are best used when the schema inevitably changes, so you can leave older applications that don't need to know about the new schema in a working state.

IN vs OR of Oracle, which faster?

I'm developing an application which processes many data in Oracle database.
In some case, I have to get many object based on a given list of conditions, and I use SELECT ...FROM.. WHERE... IN..., but the IN expression just accepts a list whose size is maximum 1,000 items.
So I use OR expression instead, but as I observe -- perhaps this query (using OR) is slower than IN (with the same list of condition). Is it right? And if so, how to improve the speed of query?
IN is preferable to OR -- OR is a notoriously bad performer, and can cause other issues that would require using parenthesis in complex queries.
Better option than either IN or OR, is to join to a table containing the values you want (or don't want). This table for comparison can be derived, temporary, or already existing in your schema.
In this scenario I would do this:
Create a one column global temporary table
Populate this table with your list from the external source (and quickly - another whole discussion)
Do your query by joining the temporary table to the other table (consider dynamic sampling as the temporary table will not have good statistics)
This means you can leave the sort to the database and write a simple query.
Oracle internally converts IN lists to lists of ORs anyway so there should really be no performance differences. The only difference is that Oracle has to transform INs but has longer strings to parse if you supply ORs yourself.
Here is how you test that.
CREATE TABLE my_test (id NUMBER);
SELECT 1
FROM my_test
WHERE id IN (1,2,3,4,5,6,7,8,9,10,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,
91,92,93,94,95,96,97,98,99,100
);
SELECT sql_text, hash_value
FROM v$sql
WHERE sql_text LIKE '%my_test%';
SELECT operation, options, filter_predicates
FROM v$sql_plan
WHERE hash_value = '1181594990'; -- hash_value from previous query
SELECT STATEMENT
TABLE ACCESS FULL ("ID"=1 OR "ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5
OR "ID"=6 OR "ID"=7 OR "ID"=8 OR "ID"=9 OR "ID"=10 OR "ID"=21 OR
"ID"=22 OR "ID"=23 OR "ID"=24 OR "ID"=25 OR "ID"=26 OR "ID"=27 OR
"ID"=28 OR "ID"=29 OR "ID"=30 OR "ID"=31 OR "ID"=32 OR "ID"=33 OR
"ID"=34 OR "ID"=35 OR "ID"=36 OR "ID"=37 OR "ID"=38 OR "ID"=39 OR
"ID"=40 OR "ID"=41 OR "ID"=42 OR "ID"=43 OR "ID"=44 OR "ID"=45 OR
"ID"=46 OR "ID"=47 OR "ID"=48 OR "ID"=49 OR "ID"=50 OR "ID"=51 OR
"ID"=52 OR "ID"=53 OR "ID"=54 OR "ID"=55 OR "ID"=56 OR "ID"=57 OR
"ID"=58 OR "ID"=59 OR "ID"=60 OR "ID"=61 OR "ID"=62 OR "ID"=63 OR
"ID"=64 OR "ID"=65 OR "ID"=66 OR "ID"=67 OR "ID"=68 OR "ID"=69 OR
"ID"=70 OR "ID"=71 OR "ID"=72 OR "ID"=73 OR "ID"=74 OR "ID"=75 OR
"ID"=76 OR "ID"=77 OR "ID"=78 OR "ID"=79 OR "ID"=80 OR "ID"=81 OR
"ID"=82 OR "ID"=83 OR "ID"=84 OR "ID"=85 OR "ID"=86 OR "ID"=87 OR
"ID"=88 OR "ID"=89 OR "ID"=90 OR "ID"=91 OR "ID"=92 OR "ID"=93 OR
"ID"=94 OR "ID"=95 OR "ID"=96 OR "ID"=97 OR "ID"=98 OR "ID"=99 OR
"ID"=100)
I would question the whole approach. The client of the SP has to send 100000 IDs. Where does the client get those IDs from? Sending such a large number of ID as the parameter of the proc is going to cost significantly anyway.
If you create the table with a primary key:
CREATE TABLE my_test (id NUMBER,
CONSTRAINT PK PRIMARY KEY (id));
and go through the same SELECTs to run the query with the multiple IN values, followed by retrieving the execution plan via hash value, what you get is:
SELECT STATEMENT
INLIST ITERATOR
INDEX RANGE SCAN
This seems to imply that when you have an IN list and are using this with a PK column, Oracle keeps the list internally as an "INLIST" because it is more efficient to process this, rather than converting it to ORs as in the case of an un-indexed table.
I was using Oracle 10gR2 above.

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.