Row-Level Security Predicate Filter - sql

On Oracle 19c.
We have users whose accounts are provisioned by specifying a comma separated list of department_code values. Each of the department_code values is a string of five alpha-numeric [A-Z0-9] characters. This comma separated value list of five character department_codes is what we call the user's security_string. We use this security_string to limit which rows the user may retrieve from a table, Restricted, by applying the following predicate.
select *
from Restricted R
where :security_string like '%' || R.department_code || '%';
A given department_code can be in Restricted many times and a given user can have many department_codes in their comma-separated value :security_string.
This predicate approach to applying row-level security is inefficient. No index can be used and it requires a full table scan on Restricted.
In alternative is to use dynamic SQL to do something like as follows.
execute immediate 'select *
from Restricted R
where R.department_code in(' || udf_quoted_values(:security_string) || ')';
Where udf_quoted_values is a user-defined function (UDF) that wraps in single quotes each department_code value within the :security_string.
However, this alternative approach also seems unsatisfactory as it requires a UDF, dynamic sql, and a full table scan is still likely.
I've considered bit-masking, but the number of bits needed is large 60 million (=36^5) and it would still require a UDF, dynamic sql, and a full table scan (function based index doesn't seem to be a candidate here). Also, bit-masking doesn't make much sense here as there is no nesting/hierarchy of department_codes.
execute immediate 'select *
from Restricted R
where BITAND(R.department_code_encoded,' || udf_encoded_number(:security_string) || ') > 0';
Where Restricted.department_code_encoded is a numeric encoded value of Restricted.department_code and udf_encoded_number is a user-defined function (UDF) that returns a number encoding the department_codes in the :security_string.
I've considered creating a separate table of just department codes, Department, and joining that to the Restricted table.
select *
from Restricted R
join Department D
on R.deparment_code = D.department_code
where :security_string like '%' || D.department_code || '%';
We still have the same problems as before, but now it is on the smaller (table cardinality) Department table (Department.department_code is unique where as Restricted.department_code is not unique). This provides for a smaller full table scan on Department than on Restricted, but now we have a join.
It is possible for us to change security_string or add additional user specific security values when the account is provisioned. We can also change the Oracle objects and queries. Note, the department_codes are not static, but don't change all that regularly either.
Any recommendations? Thank you in advance.

Why not converting the string to a table, like suggested here, and then do a join.

Related

Check if CSV string column contains desired values

I am new to PostgreSQL and I want to split string of the following format:
0:1:19
with : as delimiter. After split, I need to check if this split string contains either 0 or 1 as a whole number and select only those rows.
For example:
Table A
Customer
role
A
0:1:2
B
19
C
2:1
I want to select rows which satisfy the criteria of having whole numbers 0 or 1 in role.
Desired Output:
Customer
role
A
0:1:2
C
2:1
Convert to an array, and use the overlap operator &&:
SELECT *
FROM tbl
WHERE string_to_array(role, ':') && '{0,1}'::text[];
To make this fast, you could support it with a GIN index on the same expression:
CREATE INDEX ON tbl USING GIN (string_to_array(role, ':'));
See:
Can PostgreSQL index array columns?
Check if value exists in Postgres array
Alternatively consider a proper one-to-many relational design, or at least an actual array column instead of the string. Would make index and query cheaper.
We can use LIKE here:
SELECT Customer, role
FROM TableA
WHERE ':' || role || ':' LIKE '%:0:%' OR ':' || role || ':' LIKE '%:1:%';
But you should generally avoid storing CSV in your SQL tables if your design would allow for that.

PostgreSQL: Pattern matching only whole words

I have a "queries" table that holds hundreds of SQL queries and I am trying to filter out queries that can only be executed on the DB I am using. Because some of these queries refer to tables that exists only in another DB, so only a fraction of them can be executed successfully.
My query so far looks like this:
SELECT rr.name AS query_name,
(
SELECT string_agg(it.table_name::character varying, ', ' ORDER BY it.table_name)
FROM information_schema.tables it
WHERE rr.config ->> 'queries' SIMILAR TO ('%' || it.table_name || '%')
) AS related_tables
FROM queries rr
and it does work fine except the pattern I provided is not the best to filter out edge cases.
Let's say that I have a table called "customers_archived" in the old DB that does not exist in the new one, and a table called "customers" that exists in both the old and the new DB.
Now, with the query I wrote the engine thinks, "Well, I have a table called customers so any query that includes the word customers must be valid", but the engine is wrong because it also picks queries that include "customers_archived" table which does not exist in that DB.
So I tried to match whole words only but I could not get it to work because \ character won't work in PGSQL as far as I am concerned. How can I get this query to do what I am trying to achieve?
There is no totally reliable way of finding the tables referenced by a query short of building a full PostgreSQL SQL parser. For starters, the name could occur in a string literal, or the query could be
DO $$BEGIN EXECUTE 'SELECT * FROM my' || 'table'; END;$$;
But I think you would be better off if you make sure that there are non-word characters around your name in the match:
WHERE rr.config ->> 'queries' ~ '\y' || it.table_name || '\y'

How to get unique values from each column based on a condition?

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:
select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%'
Sample output for column names:
c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...
Then I would use returned column names to get unique/distinct value in each column and not just distinct row.
I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.
I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.
You can't just return rows, since distinct values don't go together any more.
You could return arrays, which can be had simpler than you may have expected:
SELECT array_agg(DISTINCT c1) AS c1_arr
,array_agg(DISTINCT c2a) AS c2a_arr
,array_agg(DISTINCT c2b) AS c2ba_arr
, ...
FROM m0301010000_ds;
This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.
Build SQL automatically
CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
, attname)
, E'\n ,' ORDER BY attnum)
|| E'\nFROM ' || _tbl
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
$func$ LANGUAGE sql;
Call:
SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');
Returns an SQL string as displayed above.
I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table
If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.
I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from #Erwin Brandstetter ;) and maintain it using a trigger on the original table.
Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.
For on insert: for each field to maintain check if that value is already there or not. If not, add it.
For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).
For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).
This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.
P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).

Suggestions for Querying Database for Names

I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.
The table has forename and surname fields and, currently, I am using something like this:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.
As such, I have been experimenting with Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:
Firstly, the text indices need to be refreshed in some meaningful way. Using on commit would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...
The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this score function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"
I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?
Thanks :)
I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, #X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!
So, my process is as follows:
Create a new table with columns: ID, token, sound and position; with the primary key over (ID, sound,position) and an additional index over (ID,sound).
Go through each person in the biographical table:
Concatenate their forename and surname.
Change the codepage to us7ascii, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID.
Like so:
declare
nameString varchar2(82);
token varchar2(40);
posn integer;
cursor myNames is
select id,
forename||' '||surname person_name
from mypeople;
begin
for person in myNames
loop
nameString := trim(
utl_i18n.escape_reference(
regexp_replace(
regexp_replace(person.person_name,'[^[:alpha:]]',' '),
'\s+',' '),
'us7ascii')
)||' ';
posn := 1;
while nameString is not null
loop
token := substr(nameString,1,instr(nameString,' ') - 1);
insert into personsearch values (person.id,lower(token),soundex(token),posn);
nameString := substr(nameString,instr(nameString,' ') + 1);
posn := posn + 1;
end loop;
end loop;
end;
/
So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch along with their ID.
To search, we do the same process: tokenise the search criteria and then return results where the Soundexes and relative positions match. We order by the position and then the Levenshtein distance (ld) from the original search for each token, in turn.
This query, for example, will search against two tokens (i.e., pre-tokenised search string):
with searchcriteria as (
select 'john' token1,
'smith' token2
from dual)
select alpha.id,
mypeople.forename||' '||mypeople.surname
from peoplesearch alpha
join mypeople
on mypeople.student_id = alpha.student_id
join peoplesearch beta
on beta.student_id = alpha.student_id
and beta.position > alpha.position
join searchcriteria
on 1 = 1
where alpha.sound = soundex(searchcriteria.token1)
and beta.sound = soundex(searchcriteria.token2)
order by alpha.position,
ld(alpha.token,searchcriteria.token1),
beta.position,
ld(beta.token,searchcriteria.token2),
alpha.student_id;
To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position field in the joined table must be greater than the position of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)
As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:
The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...
Anyway, hope that helps :) Comments welcome!
EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.

How can I run a query on IDs in a string?

I have a table A with this column:
IDS(VARCHAR)
1|56|23
I need to run this query:
select TEST from TEXTS where ID in ( select IDS from A where A.ID = xxx )
TEXTS.ID is an INTEGER. How can I split the string A.IDS into several ints for the join?
Must work on MySQL and Oracle. SQL99 preferred.
First of all, you should not store data like this in a column. You should split that out into a separate table, then you would have a normal join, and not this problem.
Having said that, what you have to do is the following:
Convert the number to a string
Pad it with the | (your separator) character, before it, and after it (I'll tell you why below)
Pad the text you're looking in with the same separator, before and after
Do a LIKE on it
This will run slow!
Here's the SQL that does what you want (assuming all the operators and functions work in your SQL dialect, you don't say what kind of database engine this is):
SELECT
TEXT -- assuming this was misspelt?
FROM
TEXTS -- and this as well?
JOIN A ON
'|' + A.IDS + '|' LIKE '%|' + CONVERT(TEXTS.ID) + '|%'
The reason why you need to pad the two with the separator before and after is this: what if you're looking for the number 5? You need to ensure it wouldn't accidentally fit the 56 number, just because it contained the digit.
Basically, we will do this:
... '|1|56|23|' LIKE '%|56|%'
If there is ever only going to be 1 row in A, it might run faster if you do this (but I am not sure, you would need to measure it):
SELECT
TEXT -- assuming this was misspelt?
FROM
TEXTS -- and this as well?
WHERE
(SELECT '|' + IDS + '|' FROM A) LIKE '%|' + CONVERT(TEXTS.ID) + '|%'
If there are many rows in your TEXTS table, it will be worth the effort to add code to generate the appropriate SQL by first retrieving the values from the A table, construct an appropriate SQL with IN and use that instead:
SELECT
TEXT -- assuming this was misspelt?
FROM
TEXTS -- and this as well?
WHERE
ID IN (1, 56, 23)
This will run much faster since now it can use an index on this query.
If you had A.ID as a column, and the values as separate rows, here's how you would do the query:
SELECT
TEXT -- assuming this was misspelt?
FROM
TEXTS -- and this as well?
INNER JOIN A ON TEXTS.ID = A.ID
This will run slightly slower than the previous one, but in the previous one you have overhead in having to first retrieve A.IDS, build the query, and risk producing a new execution plan that has to be compiled.