How to structure SQL - select first X rows for each value of a column?

How to structure SQL - select first X rows for each value of a column? - sql

I have a table with the following type of data:
create table store (
n_id serial not null primary key,
n_place_id integer not null references place(n_id),
dt_modified timestamp not null,
t_tag varchar(4),
n_status integer not null default 0
...
(about 50 more fields)
);
There are indices on n_id, n_place_id, dt_modified and all other fields used in the query below.
This table contains about 100,000 rows at present, but may grow to closer to a million or even more. Yet, for now let's assume we're staying at around the 100K mark.
I'm trying to select rows from these table where one two conditions are met:
All rows where n_place_id is in a specific subset (this part is easy); or
For all other n_place_id values the first ten rows sorted by dt_modified (this is where it becomes more complicated).
Doing it in one SQL seems to be too painful, so I'm happy with a stored function for this. I have my function defined thus:
create or replace function api2.fn_api_mobile_objects()
returns setof store as
$body$
declare
maxres_free integer := 10;
resulter store%rowtype;
mcnt integer := 0;
previd integer := 0;
begin
create temporary table paid on commit drop as
select n_place_id from payments where t_reference is not null and now()::date between dt_paid and dt_valid;
for resulter in
select * from store where n_status > 0 and t_tag is not null order by n_place_id, dt_modified desc
loop
if resulter.n_place_id in (select n_place_id from paid) then
return next resulter;
else
if previd <> resulter.n_place_id then
mcnt := 0;
previd := resulter.n_place_id;
end if;
if mcnt < maxres_free then
return next resulter;
mcnt := mcnt + 1;
end if;
end if;
end loop;
end;$body$
language 'plpgsql' volatile;
The problem is that
select * from api2.fn_api_mobile_objects()
takes about 6-7 seconds to execute. Considering that after that this resultset needs to be joined to 3 other tables with a bunch of additional conditions applied and further sorting applied, this is clearly unacceptable.
Well, I still do need to get this data, so either I am missing something in the function or I need to rethink the entire algorithm. Either way, I need help with this.

CREATE TABLE store
( n_id serial not null primary key
, n_place_id integer not null -- references place(n_id)
, dt_modified timestamp not null
, t_tag varchar(4)
, n_status integer not null default 0
);
INSERT INTO store(n_place_id,dt_modified,n_status)
SELECT n,d,n%4
FROM generate_series(1,100) n
, generate_series('2012-01-01'::date ,'2012-10-01'::date, '1 day'::interval ) d
;
WITH zzz AS (
SELECT n_id AS n_id
, rank() OVER (partition BY n_place_id ORDER BY dt_modified) AS rnk
FROM store
)
SELECT st.*
FROM store st
JOIN zzz ON zzz.n_id = st.n_id
WHERE st.n_place_id IN ( 1,22,333)
OR zzz.rnk <=10
;
Update: here is the same selfjoin construct as a subquery (CTEs are treated a bit differently by the planner):
SELECT st.*
FROM store st
JOIN ( SELECT sx.n_id AS n_id
, rank() OVER (partition BY sx.n_place_id ORDER BY sx.dt_modified) AS zrnk
FROM store sx
) xxx ON xxx.n_id = st.n_id
WHERE st.n_place_id IN ( 1,22,333)
OR xxx.zrnk <=10
;

After much struggle, I managed to get the stored function to return the results in just over 1 second (which is a huge improvement). Now the function looks like this (I added the additional condition, which didn't affect the performance much):
create or replace function api2.fn_api_mobile_objects(t_search varchar)
returns setof store as
$body$
declare
maxres_free integer := 10;
resulter store%rowtype;
mid integer := 0;
begin
create temporary table paid on commit drop as
select n_place_id from payments where t_reference is not null and now()::date between dt_paid and dt_valid
union
select n_place_id from store where n_status > 0 and t_tag is not null group by n_place_id having count(1) <= 10;
for resulter in
select * from store
where n_status > 0 and t_tag is not null
and (t_name ~* t_search or t_description ~* t_search)
and n_place_id in (select n_place_id from paid)
loop
return next resulter;
end loop;
for mid in
select distinct n_place_id from store where n_place_id not in (select n_place_id from paid)
loop
for resulter in
select * from store where n_status > 0 and t_tag is not null and n_place_id = mid order by dt_modified desc limit maxres_free
loop
return next resulter;
end loop;
end loop;
end;$body$
language 'plpgsql' volatile;
This runs in just over 1 second on my local machine and in about 0.8-1.0 seconds on live. For my purpose, this is good enough, although I am not sure what will happen as the amount of data grows.

As a simple suggestion, the way I like to do this sort of troubleshooting is to construct a query that gets me most of the way there, and optimize it properly, and then add the necessary pl/pgsql stuff around it. The major advantage to this approach is that you can optimize based on query plans.
Also if you aren't dealing with a lot of rows, array_agg() and unnest() are your friends as they allow you (on Pg 8.4 and later!) to dispense with the temporary table management overhead and simply construct and query an array of tuples in memory as a relation. It may perform better also if you are just hitting an array in memory instead of a temp table (less planning overhead and less query overhead too).
Also on your updated query I would look at replacing that final loop with a subquery or a join, allowing the planner to decide when to do a nested loop lookup or when to try to find a better way.

Related

value limitation in an IN clause Oracle

I work for a company that has a DW - ETL setup. I need to write a query that looks for over 2500+ values in an WHEN - IN clause and also over 1000+ values in a WHERE - IN clause. Basically it would look like the following:
SELECT
,user_id
,CASE WHEN user_id IN ('user_n', +2500 user_[n+1] ) THEN 1
ELSE 0
,item_id
FROM user_table
WHERE item_id IN ('item_n', +1000 item_[n+1] );
As you probably already know PL/SQL allows a maximum of 1000 values in an IN clause, so I tried adding OR - IN clauses (as suggested in other stackoverflow threads):
SELECT
,user_id
,CASE WHEN user_id IN ('user_n', +999 user_[n+1] )
OR user_id IN ('user_n', +999 user_[n+1] )
OR user_id IN ('user_n', +999 user_[n+1] ) THEN 1
ELSE 0 END AS user_group
,item_id
FROM user_table
WHERE item_id IN ('item_n', +999 item_[n+1] )
OR item_id IN ('item_n', +999 item_[n+1] );
NOTE: i know the math is erroneous in the examples above, but you get the point
The problem is that queries have a maximum executing time of 120 minutes and the job is being automatically killed. So I googled what solutions I could find and it seems Temporary Tables could be the solution I'm looking for, but with all honesty none of the examples I found is clear enough on how to include the values I want in the table and also how to use this table in my original query. Not even the ORACLE documentation was of much help.
Another potential problem is that I have limited rights and I've seen other people mention that in their companies they don't have the rights to create temporary tables.
Some of the info I found in my research:
ORACLE documentation
StackOverflow thread
[StackOverflow thread 2]
Another solution I found was using tuples instead, as mentioned in THIS thread (which I haven't tried) because as another user mentions performance seems greatly affected.
Any guidance on how to use a Temporary Table or if anyone has another way of dealing with this limitation would be greatly appreciated.

Create a global temporary table so no undo logs are created
CREATE GLOBAL TEMPORARY TABLE <table_name> (
<column_name> <column_data_type>,
<column_name> <column_data_type>,
<column_name> <column_data_type>)
ON COMMIT DELETE ROWS;
then depending on how the user list arrives import the data into a holding table and then run
select 'INSERT INTO global_temporary_table <column> values '
|| holding_table.column
||';'
FROM holding_table.column;
This gives you insert statements as output which you run to insert the data.
then
SELECT <some_column>
FROM <some_table>
WHERE <some_value> IN
(SELECT <some_column> from <global_temporary_table>

Use a collection:
CREATE TYPE Ints_Table AS TABLE OF INT;
CREATE TYPE IDs_Table AS TABLE OF CHAR(5);
Something like this:
SELECT user_id,
CASE WHEN user_id MEMBER OF Ints_Table( 1, 2, 3, /* ... */ 2500 )
THEN 1
ELSE 0
END
,item_id
FROM user_table
WHERE item_id MEMBER OF IDs_table( 'ABSC2', 'DITO9', 'KMKM9', /* ... */ 'QD3R5' );
Or you can use PL/SQL to populate a collection:
VARIABLE cur REFCURSOR;
DECLARE
t_users Ints_Table;
t_items IDs_Table;
f UTL_FILE.FILE_TYPE;
line VARCHAR2(4000);
BEGIN
t_users.EXTEND( 2500 );
FOR i = 1 .. 2500 LOOP
t_users( t_users.COUNT ) := i;
END LOOP;
// load data from a file
f := UTL_FILE.FOPEN('DIRECTORY_HANDLE','datafile.txt','R');
IF UTL_FILE.IS_OPEN(f) THEN
LOOP
UTL_FILE.GET_LINE(f,line);
IF line IS NULL THEN EXIT; END IF;
t_items.EXTEND;
t_items( t_items.COUNT ) := line;
END LOOP;
OPEN :cur FOR
SELECT user_id,
CASE WHEN user_id MEMBER OF t_users
THEN 1
ELSE 0
END
,item_id
FROM user_table
WHERE item_id MEMBER OF t_items;
END;
/
PRINT cur;
Or if you are using another language to call the query then you could pass the collections as a bind value (as shown here).

In PL/SQL you could use a collection type. You could create your own like this:
create type string_table is table of varchar2(100);
Or use an existing type such as SYS.DBMS_DEBUG_VC2COLL which is a table of VARCHAR2(1000).
Now you can declare a collection of this type for each of your lists, populate it, and use it in the query - something like this:
declare
strings1 SYS.DBMS_DEBUG_VC2COLL := SYS.DBMS_DEBUG_VC2COLL();
strings2 SYS.DBMS_DEBUG_VC2COLL := SYS.DBMS_DEBUG_VC2COLL();
procedure add_string1 (p_string varchar2) is
begin
strings1.extend();
strings1(strings.count) := p_string;
end;
procedure add_string2 (p_string varchar2) is
begin
strings2.extend();
strings2(strings2.count) := p_string;
end;
begin
add_string1('1');
add_string1('2');
add_string1('3');
-- and so on...
add_string1('2500');
add_string2('1');
add_string2('2');
add_string2('3');
-- and so on...
add_string2('1400');
for r in (
select user_id
, case when user_id in table(strings2) then 1 else 0 end as indicator
, item_id
from user_table
where item_id in table(strings1)
)
loop
dbms_output.put_Line(r.user_id||' '||r.indicator);
end loop;
end;
/

You can use below example to understand Global temporary tables and the type of GTT.
CREATE GLOBAL TEMPORARY TABLE GTT_PRESERVE_ROWS (ID NUMBER) ON COMMIT PRESERVE ROWS;
INSERT INTO GTT_PRESERVE_ROWS VALUES (1);
COMMIT;
SELECT * FROM GTT_PRESERVE_ROWS;
DELETE FROM GTT_PRESERVE_ROWS;
COMMIT;
TRUNCATE TABLE GTT_PRESERVE_ROWS;
DROP TABLE GTT_PRESERVE_ROWS;--WONT WORK IF YOU DIDNOT TRUNCATE THE TABLE OR THE TABLE IS BEING USED IN SOME OTHER SESSION
CREATE GLOBAL TEMPORARY TABLE GTT_DELETE_ROWS (ID NUMBER) ON COMMIT DELETE ROWS;
INSERT INTO GTT_DELETE_ROWS VALUES (1);
SELECT * FROM GTT_DELETE_ROWS;
COMMIT;
SELECT * FROM GTT_DELETE_ROWS;
DROP TABLE GTT_DELETE_ROWS;
However as you mentioned you receive the input in an excel file so you can simply create a table and load data in that table. Once the data is loaded you can use the data in IN clause of your query.
select * from employee where empid in (select empid from temptable);

create temporary table userids (userid int);
insert into userids(...)
then a join or in subquery
select ...
where user_id in (select userid from userids);
drop temporary table userids;

Display Number of Rows based on input parameter

CREATE OR REPLACE PROCEDURE test_max_rows (
max_rows IN NUMBER DEFAULT 1000
)
IS
CURSOR cur_test ( max_rows IN number ) IS
SELECT id FROM test_table
WHERE user_id = 'ABC'
AND ROWNUM <= max_rows;
id test_table.id%TYPE;
BEGIN
OPEN cur_test(max_rows) ;
LOOP
FETCH cur_test INTO id;
EXIT WHEN cur_test%NOTFOUND;
DBMS_OUTPUT.PUT_LINE('ID:' || id);
END LOOP;
END;
My requirement is to modify the above code so that when I pass -1 for max_rows, the proc should return all the rows returned by the query. Otherwise, it should limit the rows as per max_rows.
For example:
EXECUTE test_max_rows(-1);
This command should return all the rows returned by the SELECT statement above.
EXECUTE test_max_rows(10);
This command should return only 10 rows.

You can do this with a OR clause; change:
AND ROWNUM <= max_rows;
to:
AND (max_rows < 1 OR ROWNUM <= max_rows);
Then passing zero, -1, or any negative number will fetch all rows, and any positive number will return a restricted list. You could also replace the default 1000 clause with default null, and then test for null instead, which might be a bit more obvious:
AND (max_rows is null OR ROWNUM <= max_rows);
Note that which rows you get with a passed value will be indeterminate because you don't have an order by clause at the moment.
Doing this in a procedure also seems a bit odd, and you're assuming whoever calls it will be able to see the output - i.e. will have done set serveroutput on or the equivalent for their client - which is not a very safe assumption. An alternative, if you can't specify the row limit in a simple query, might be to use a pipelined function instead - you could at least then call that from plain SQL.
CREATE OR REPLACE FUNCTION test_max_rows (max_rows IN NUMBER DEFAULT NULL)
RETURN sys.odcinumberlist PIPELINED
AS
BEGIN
FOR r IN (
SELECT id FROM test_table
WHERE user_id = 'ABC'
AND (max_rows IS NULL OR ROWNUM <= max_rows)
) LOOP
PIPE ROW (r.id);
END LOOP;
END;
/
And then call it as:
SELECT * FROM TABLE(test_max_rows);
or
SELECT * FROM TABLE(test_max_rows(10));
Here's a quick SQL Fiddle demo. But you should still consider if you can do the whole thing in plain SQL and PL/SQL altogether.

PL/pgSQL checking if a row exists

I'm writing a function in PL/pgSQL, and I'm looking for the simplest way to check if a row exists.
Right now I'm SELECTing an integer into a boolean, which doesn't really work. I'm not experienced with PL/pgSQL enough yet to know the best way of doing this.
Here's part of my function:
DECLARE person_exists boolean;
BEGIN
person_exists := FALSE;
SELECT "person_id" INTO person_exists
FROM "people" p
WHERE p.person_id = my_person_id
LIMIT 1;
IF person_exists THEN
-- Do something
END IF;
END; $$ LANGUAGE plpgsql;
Update - I'm doing something like this for now:
DECLARE person_exists integer;
BEGIN
person_exists := 0;
SELECT count("person_id") INTO person_exists
FROM "people" p
WHERE p.person_id = my_person_id
LIMIT 1;
IF person_exists < 1 THEN
-- Do something
END IF;

Simpler, shorter, faster: EXISTS.
IF EXISTS (SELECT FROM people p WHERE p.person_id = my_person_id) THEN
-- do something
END IF;
The query planner can stop at the first row found - as opposed to count(), which scans all (qualifying) rows regardless. Makes a big difference with big tables. The difference is small for a condition on a unique column: only one row qualifies and there is an index to look it up quickly.
Only the existence of at least one qualifying row matters. The SELECT list can be empty - in fact, that's shortest and cheapest. (Some other RDBMS don't allow an empty SELECT list on principal.)
Improved with #a_horse_with_no_name's comments.

Use count(*)
declare
cnt integer;
begin
SELECT count(*) INTO cnt
FROM people
WHERE person_id = my_person_id;
IF cnt > 0 THEN
-- Do something
END IF;
Edit (for the downvoter who didn't read the statement and others who might be doing something similar)
The solution is only effective because there is a where clause on a column (and the name of the column suggests that its the primary key - so the where clause is highly effective)
Because of that where clause there is no need to use a LIMIT or something else to test the presence of a row that is identified by its primary key. It is an effective way to test this.

PL/SQL loop through cursor

My problem isn't overly complicated, but I am a newbie to PL/SQL.
I need to make a selection from a COMPANIES table based on certain conditions. I then need to loop through these and convert some of the fields into a different format (I have created functions for this), and finally use this converted version to join to a reference table to get the score variable I need. So basically:
select id, total_empts, bank from COMPANIES where turnover > 100000
loop through this selection
insert into MY_TABLE (select score from REF where conversion_func(MY_CURSOR.total_emps) = REF.total_emps)
This is basically what I am looking to do. It's slightly more complicated but I'm just looking for the basics and how to approach it to get me started!

Here's the basic syntax for cursor loops in PL/SQL:
BEGIN
FOR r_company IN (
SELECT
ID,
total_emps,
bank
FROM
companies
WHERE
turnover > 100000
) LOOP
INSERT INTO
my_table
SELECT
score
FROM
ref_table
WHERE
ref.total_emps = conversion_func( r_company.total_emps )
;
END LOOP;
END;
/

You don't need to use PL/SQL to do this:
insert into my_table
select score
from ref r
join companies c
on r.total_emps on conversion_func(c.total_emps)
where c.turnover > 100000
If you have to do this in a PL/SQL loop as asked, then I'd ensure that you do as little work as possible. I would, however, recommend bulk collect instead of the loop.
begin
for xx in ( select conversion_func(total_emps) as tot_emp
from companies
where turnover > 100000 ) loop
insert into my_table
select score
from ref
where total_emps = xx.tot_emp
;
end loop;
end;
/
For either method you need one index on ref.total_emps and preferably one on companies.turnover

Getting weird issue with TO_NUMBER function in Oracle

I have been getting an intermittent issue when executing to_number function in the where clause on a varchar2 column if number of records exceed a certain number n. I used n as there is no exact number of records on which it happens. On one DB it happens after n was 1 million on another when it was 0.1. million.
E.g. I have a table with 10 million records say Table Country which has field1 varchar2 containing numberic data and Id
If I do a query as an example
select *
from country
where to_number(field1) = 23
and id >1 and id < 100000
This works
But if I do the query
select *
from country
where to_number(field1) = 23
and id >1 and id < 100001
It fails saying invalid number
Next I try the query
select *
from country
where to_number(field1) = 23
and id >2 and id < 100001
It works again
As I only got invalid number it was confusing, but in the log file it said
Memory Notification: Library Cache Object loaded into SGA
Heap size 3823K exceeds notification threshold (2048K)
KGL object name :with sqlplan as (
select c006 object_owner, c007 object_type,c008 object_name
from htmldb_collections
where COLLECTION_NAME='HTMLDB_QUERY_PLAN'
and c007 in ('TABLE','INDEX','MATERIALIZED VIEW','INDEX (UNIQUE)')),
ws_schemas as(
select schema
from wwv_flow_company_schemas
where security_group_id = :flow_security_group_id),
t as(
select s.object_owner table_owner,s.object_name table_name,
d.OBJECT_ID
from sqlplan s,sys.dba_objects d
It seems its related to SGA size, but google did not give me much help on this.
Does anyone have any idea about this issue with TO_NUMBER or oracle functions for large data?

which has field1 varchar2 containing
numberic data
This is not good practice. Numeric data should be kept in NUMBER columns. The reason is simple: if we don't enforce a strong data type we might find ourselves with non-numeric data in our varchar2 column. If that were to happen then a filter like this
where to_number(field1) = 23
would fail with ORA-01722: invalid number.
I can't for certain sure say this is what is happening in your scenario, because I don't understand why apparently insignificant changes in the filters of ID have changed the success of the query. It would be instructive to see the execution plans for the different versions of the queries. But I think it is more likely to be a problem with your data than a bug in the SGA.

Assuming you know that the given range of ids will always result in field1 containing numeric data, you could do this instead:
select *
from (
select /*+NO_MERGE*/ *
from country
where id >1 and id < 100000
)
where to_number(field1) = 23;

Suggest doing the following to determine for sure whether there are records containing non-numeric data. As others have said, variations in the execution plan and order of evaluation could explain why the error does not appear consistently.
(assuming SQLPlus as the client)
SET SERVEROUTPUT ON
DECLARE
x NUMBER;
BEGIN
FOR rec IN (SELECT id, field1 FROM country) LOOP
BEGIN
x := TO_NUMBER( rec.field1 );
EXCEPTION
WHEN OTHERS THEN
dbms_output.put_line( rec.id || ' ' || rec.field1 );
END;
END LOOP;
END;
/
An alternative workaround to your original issue would be to rewrite the query to avoid implicit type conversion, e.g.
SELECT id, TO_NUMBER( field1 )
FROM county
WHERE field1 = '23'
AND <whatever condition on id you want, if any>

Consider writing an IS_NUMBER PL/SQL function:
CREATE OR REPLACE FUNCTION IS_NUMBER (p_input IN VARCHAR2) RETURN NUMBER
AS
BEGIN
RETURN TO_NUMBER (p_input);
EXCEPTION
WHEN OTHERS THEN RETURN NULL;
END IS_NUMBER;
/
SQL> SELECT COUNT(*) FROM DUAL WHERE IS_NUMBER ('TEST') IS NOT NULL;
COUNT(*)
----------
0
SQL> SELECT COUNT(*) FROM DUAL WHERE IS_NUMBER ('123.45') IS NOT NULL;
COUNT(*)
----------
1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to structure SQL - select first X rows for each value of a column? - sql

Related

value limitation in an IN clause Oracle

Display Number of Rows based on input parameter

PL/pgSQL checking if a row exists

PL/SQL loop through cursor

Getting weird issue with TO_NUMBER function in Oracle

Categories

Resources