Search table for rows with combination of N columns matching any from the provided list - sql

I have a COFFEE_CONSUMPTION table with 5 columns in Postgres. It can grow up to e.g. 1 million rows and more.
EMPLOYEE_ID|ROOM_ID|DDAY |COUNT|TYPE |
-----------------------------------------------------
1 |1 |2023-02-16|1 |latte |
1 |1 |2023-02-16|3 |espresso |
2 |1 |2023-02-16|2 |latte |
3 |2 |2023-02-16|1 |espresso |
4 |2 |2023-02-17|3 |frappuccino |
... |... |... |... |... |
We need to search for records with EMPLOYEE_ID+DAY+TYPE combination matching one of those from the provided list.
Looking for an index-wise implementation I came up with such a solution:
Created 2 immutable functions
CREATE OR REPLACE FUNCTION C_CONCAT(text, VARIADIC text[])
RETURNS text
LANGUAGE sql
IMMUTABLE PARALLEL SAFE AS
'SELECT array_to_string($2, $1)';
CREATE FUNCTION C_TO_CHAR(date) RETURNS text AS
$$
select to_char($1, 'YYYY-MM-DD');
$$
LANGUAGE sql immutable;
Added index to the table:
CREATE INDEX COFFEE_CONSUMPTION_IDX
ON COFFEE_CONSUMPTION (C_CONCAT('', EMPLOYEE_ID, C_TO_CHAR(DDAY), TYPE));
Wrote a query:
SELECT *
FROM COFFEE_CONSUMPTION CC
WHERE C_CONCAT('', CC.EMPLOYEE_ID, C_TO_CHAR(CC.DDAY), CC.TYPE) IN ?
Any ideas on how to improve this and/or get rid of custom functions?
One of possible approaches would be to create a multi-column index, populate a temporary table with field combinations and inner join in select (as described here). But I believe this would be more verbose, more complex from client programming (e.g. JPA) perspective and equally performant.
UPD: the main problem turned out to be not the SQL query but JPA+Hibernate stack which doesn't allow querying by list of lists. So I decided to stick to the functions solution for now.

I don't see why you need that complication with a custom function:
create index on coffee_consumption (employee_id, dday, type);
That index should be used if you run:
select *
from coffee_consumption cc
where (cc.employee_id, cc.dday, cc.type) in (?,?,?)
or
select *
from coffee_consumption cc
where cc.employee_id = ?
cc.dday = ?
cc.type = ?

Related

SQL extract IDS based on two columns on common

i need to figure out how i can i accomplish this task that was given to me, you see, i have imported an Excel, cleaned out the information and used this information to start joining the tables i need to, when i started i realized i needed to make it very precisely so i needed the id of the data i'm using which doesn't come in this Excel document i imported (since the id are stored in the database and the Excel was built by other people who don't handle databases) so i have a workmate whom i asked about how to do this task, he told me to do the inner join on the columns in common, but the way i did it appeared an error and logically didn't work, therefore i thought extracting the id from the table they are stored would be a good idea (well maybe not) but i don't know how to do it nor if it Will work, i'll give you some examples of how the tables would look like:
table 1
----------------------
|ID|column_a|column_b|
|1 |2234 |3 |
|2 |41245 |23 |
|3 |442 |434 |
|4 |1243 |1 |
----------------------
table 2
---------------------------------
|creation_date|column_a|column_b|
|1/12/2018 |2234 |3 |
|4/31/2011 |41245 |23 |
|7/22/2014 |442 |434 |
|10/14/2017 |1243 |1 |
---------------------------------
as you can see, the values of the columns a and b match perfectly, so there could be a bridge between the two tables, i tried to join the data by the column a but did not work since the output was much larger that i should, i also tried doing a simple query with an IN statement but did not work either since i brought up nearly the entire databases duplicated (i'm working with big databases the table 1 contains nearly 35.000 rows and the table 2 contains nearly 10.000) extracting the ids ad if they were row files won't work since they are very different from what is in the id tables in the actual table i'm working with, so what do you think it would be the best way to achieve this task? any kind of help i would be grateful, thanks in advance.
EDIT
Based on the answer of R3_ i tried his query but adapted to my needs and worked in some cases but in others i got the cartesian product, the example i'm using is that i have in table 2 in column_a the number 1000 and column_b has number 1, table 1 has 10 ids for that combination of numbers since the 1000-1 number is not the same (technically it is, but it has stored different information and is usually differenced by the ID) so the output is either 10 rows (assuming that it is only picking those with id) or 450 not the 45 i need as result, the query i'm using is like this:
SELECT DISTINCT table_1.id, table_2.column_a, table_2.column_b --if i pick the columns from table 1 returns 10 rows if i pick them from table 2 it returns 450
FROM table_2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_1.column_b = table_2.column_b
WHERE table_2.column_a = 1022 AND table_2.column_b = 1
so the big deal has to do with the 10 id that has that 1000-1 combination so the sql doesn't know how to identify where the id should go, how can i do to obtain those 45 i need?
also i figured out that if i do the general query, there are some rows missing, here is how i print it:
SELECT table_1.id, table_1.column_a, table_1.column_b
FROM table_2 --in this case i try switching the columns i return from table 1 or 2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_2.column_b = table_1.column_b
the output of the latter example is 2666 rows and should be 2733, what am i doing wrong?
SELECT DISTINCT -- Adding DISTINCT clause for unique pairs of ID and creation_date
ID, tab1.column_a, tab1.column_b, creation_date
FROM [table 1] as tab1
LEFT JOIN [table 2] as tab2 -- OR INNER JOIN
ON tab1.column_a = tab2.column_a
AND tab1.column_b = tab2.column_b
-- WHERE ID IN ('01', '02') -- Filtering by desired ID

Self-Join with Natural Join

What is the difference between
select * from degreeprogram NATURAL JOIN degreeprogram ;
and
select * from degreeprogram d1 NATURAL JOIN degreeprogram d2;
in oracle?
I expected that they return the same result set, however, they do not. The second query does what I expect: it joins the two relations using the same named attributes and so it returns the same tuples as stored in degreeprogram. However, the first query is confusing for me: here, each tuple occurs several times in the result set-> what join condition is used here?
Thank you
NATURAL JOIN means join the two tables based on all columns having the same name in both tables.
I imagine that for each column in your table, Oracle is internally writing a condition like:
degreeprogram.column1 = degreeprogram.column1
(which you would not be able to write yourself due to ORA-00918 column ambiguously defined error)
And then, I imagine, Oracle is optimizing that away to just
degreeprogram.column1 is not null
So, you're not exactly getting a CROSS JOIN of your table with itself -- only a CROSS JOIN of those rows having no null columns.
UPDATE: Since this was the selected answer, I will just add from Thorsten Kettner's answer that this behavior is probably a bug on Oracle's part. In 18c, Oracle behaves properly and returns an ORA-00918 error when you try to NATURAL JOIN a table to itself.
The difference between those two statements is that the second explicitly defines a self join on the table, where the first statement, the optimizer is trying to figure out what you really want. On my database, the first statement performs a cartesian merge join and is not optimized at all, and the second statement has a better explain plan, using a single full table access with index scanning.
I'd call this a bug. This query:
select * from degreeprogram d1 NATURAL JOIN degreeprogram d2;
translates to
select col1, col2, ... -- all columns
from degreeprogram d1
join degreeprogram d2 using (col1, col2, ...)
and gives you all rows from the table where all columns are not null (because using(col) never matches nulls).
This query, however:
select * from degreeprogram NATURAL JOIN degreeprogram;
is invalid according to standard SQL, because every table must have a unique name or alias in a query. Oracle lets this pass, but doing so it should do something still to keep the table instances apart (e.g. create internally an alias for them). It obviously doesn't and multiplies the result with the number of rows in the table. A bug.
A so-called natural join instructs the database to
Find all column names common to both tables (in this case, degreeprogram and degreeprogram, which of course have the same columns.)
Generate a join condition for each pair of matching column names, in the form table1.column1 = table2.column1 (in this case, there will be one for every column in degreeprogram.)
Therefore a query like this
select count(*) from demo natural join demo;
will be transformed into
select count(*) from demo, demo where demo.x = demo.x;
I checked this by creating a table with one column and two rows:
create table demo (x integer);
insert into demo values (1);
insert into demo values (2);
commit;
and then tracing the session:
SQL> alter session set tracefile_identifier='demo_trace';
Session altered.
SQL> alter session set events 'trace [SQL_Compiler.*]';
Session altered.
SQL> select /* nj test */ count(*) from demo natural join demo;
COUNT(*)
----------
4
1 row selected.
SQL> alter session set events 'trace [SQL_Compiler.*] off';
Session altered.
Then in twelve_ora_6196_demo_trace.trc I found this line:
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT COUNT(*) "COUNT(*)" FROM "WILLIAM"."DEMO" "DEMO","WILLIAM"."DEMO" "DEMO" WHERE "DEMO"."X"="DEMO"."X"
and a few lines later:
try to generate single-table filter predicates from ORs for query block SEL$58A6D7F6 (#0)
finally: "DEMO"."X" IS NOT NULL
(This is merely an optimisation on top of the generated query above, as column X is nullable but the join allows the optimiser to infer that only non-null values are required. It doesn't replace the joins.)
Hence the execution plan:
-----------------------------------------+-----------------------------------+
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------+-----------------------------------+
| 0 | SELECT STATEMENT | | | | 7 | |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
| 2 | MERGE JOIN CARTESIAN | | 4 | 52 | 7 | 00:00:01 |
| 3 | TABLE ACCESS FULL | DEMO | 2 | 26 | 3 | 00:00:01 |
| 4 | BUFFER SORT | | 2 | | 4 | 00:00:01 |
| 5 | TABLE ACCESS FULL | DEMO | 2 | | 2 | 00:00:01 |
-----------------------------------------+-----------------------------------+
Query Block Name / Object Alias(identified by operation id):
------------------------------------------------------------
1 - SEL$58A6D7F6
3 - SEL$58A6D7F6 / DEMO_0001#SEL$1
5 - SEL$58A6D7F6 / DEMO_0002#SEL$1
------------------------------------------------------------
Predicate Information:
----------------------
3 - filter("DEMO"."X" IS NOT NULL)
Alternatively, let's see what dbms_utility.expand_sql_text does with it. I'm not quite sure what to make of this given the trace file above, but it shows a similar expansion taking place:
SQL> var result varchar2(1000)
SQL> exec dbms_utility.expand_sql_text('select count(*) from demo natural join demo', :result)
PL/SQL procedure successfully completed.
RESULT
----------------------------------------------------------------------------------------------------------------------------------
SELECT COUNT(*) "COUNT(*)" FROM (SELECT "A2"."X" "X" FROM "WILLIAM"."DEMO" "A3","WILLIAM"."DEMO" "A2" WHERE "A2"."X"="A2"."X") "A1"
Lesson: NATURAL JOIN is evil. Everybody knows this.

Creating dynamic columns from table data

Table data sample:
--------------------------
| key | domain | value |
--------------------------
| a | en | English |
| a | de | Germany |
Query which returns result I need:
select * from
(
select t1.key,
(select value from TABLE where t1.key=key AND code='en') en,
(select value from TABLE where t1.key=key AND code='de') de
from TABLE t1
) as t2
Data returned from query:
---------------------------
| key | en | de |
---------------------------
| a | English | Germany |
I do not want to list all available domains with:
(select value from TABLE where t1.key=key AND code='*') *
Is it possible to make this query more dynamic in Postgres: automatically add all domain columns that exist in the table?
For more than a few domains use crosstab() to make the query shorter and faster.
PostgreSQL Crosstab Query
A completely dynamic query, returning a dynamic number of columns based on data in your table is not possible, because SQL is strictly typed. Whatever you try, you'll end up needing two steps. Step 1: generate the query, step 2: execute it.
Execute a dynamic crosstab query
Or you return something more flexible instead of table columns, like an array or a document type like json. Details:
Dynamic alternative to pivot with CASE and GROUP BY
Refactor a PL/pgSQL function to return the output of various SELECT queries

How do I use variables in a select query?

I have this following select query that uses a scalar function to get full name. I want to eliminate the redundancy by using variable but so far there is no success. My query is
select
a.Id,
a.UserName,
getFullName(a.UserName),
a.CreateTime
from DataTable;
I don't want to retrieve 'a.User' two times. I would prefer if I can save a.User in a variable and then pass it to the function hence improving the efficiency.
Currently the work around I came up with is as following
select
Id,
UserName,
getFullName(UserName),
CreateTime
from (select a.Id, a.UserName, a.CreateTime from DataTable) temp
This solves the performance issue but adds the overhead to write same select two time. Any other suggestions would be great.
DataTable looks like this
+----+----------+------------+
| Id | UserName | CreateTime |
+----+----------+------------+
| 1 | ab | 10:00 |
| 2 | cd | 11:00 |
| 3 | ef | 12:00 |
+----+----------+------------+
Here is the NamesTable used to get the full names
+----------+----------+
| UserName | FullName |
+----------+----------+
| ab | Aa BB |
| cd | Cc Dd |
| ef | Ee Ff |
+----------+----------+
Here is the function that gets the full name
Create function [dbo].[getFullName](#user varchar(150)) returns varchar(500)
as
begin
declare #Result varchar(500);
select #Result = FullName from dbo.NamesTable where UserName = #user;
return #Result;
end;
You're solving a problem that doesn't exist. You seem to think that
select
a.Id,
a.UserName,
getFullName(a.UserName),
a.CreateTime
from DataTable;
Has some relatively expensive process behind it to get UserName that is happening twice. In reality, once the record is located, getting the UserName value is an virtually instant process since it will probably be stored in a "variable" by the SQL engine behind the scenes. You should have little to no performance difference between that query and
select
a.Id,
getFullName(a.UserName),
a.CreateTime
from DataTable;
The scalar function itself may have a performance issue, but it's not because you are "pulling" the UserName value "twice".
A better method would be to join to the other table:
select
a.Id,
a.UserName,
b.FullName,
a.CreateTime
from DataTable a
LEFT JOIN dbo.NamesTable b
ON a.UserName = b.UserName
As D Stanley says, you're trying to solve some problem that doesn't exist. I would further add that you shouldn't be using the function at all. SQL is meant to perform set-based operations. When you use a function like that you're now making it perform the same function over and over again for every row - a horrible practice. Instead, just JOIN in the other table (a set-based operation) and let SQL do what it does best:
SELECT
DT.Id,
DT.UserName,
NT.fullname,
DT.CreateTime
FROM
DataTable DT
INNER JOIN NamesTable NT ON NT.username = DT.username;
Also, DataTable and NamesTable are terrible names for tables. Of course they're tables, so there's no need to put "table" on the end of the name. Further, of course the first one holds "data", it's a database. Your table names should be descriptive. What exactly does DataTable hold?
If you're going to be doing SQL development in the future then I strongly suggest that you read several introductory books on the subject and watch as many tutorial videos as you can find.
Scalar UDF will execute for every row,but not defintely the way you think.below is sample demo and execution plan which proves the same..
create table testid
(
id int,
name varchar(20)
)
insert into testid
select n,'abc'
from numbers
where n<=1000000
create index nci_get on dbo.testid(id,name)
select id,name,dbo.getusername(id) from dbo.testid where id>4
below is the execution plan for above query
Decoding above plan:
Index seek outputs id,name
Then compute scalar tries to calculate new rows from existing row values.in this case expr1003 which is our function
Index seek cost is 97%,compute scalar cost is 3% and as you might be aware index seek is not an operator which goes to table to get data.so hopefully this clears your question

Why is Oracle SQL Optimizer ignoring index predicate for this view?

I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select
o.id, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when n.id is not null then n.other_columns else o.other_columns end as other_columns_n
from
TBL_A o left outer join TBL_A n
on o.id=n.id and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select b.id BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select v.id from v_a v
where v.id in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select v.id from v_a v
where v.id in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 8168 | 00:00:01 |
-------------------------------------------------------------------------
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 7 | 00:00:01 |
-------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)