Select a portion of a comma delimited string in DB2/DB2400 - sql

I need to select a value within a comma delimited string using only SQL. Is this possible?
Data
A B C
1 Luigi Apple,Banana,Pineapple,,Citrus
I need to select specifically the 2nd item in column C, in this case banana. I need help. I cannot create new SQL functions, I can only use SQL. This is the as400 so the SQL is somewhat old tech.
Update..
With help from #Sandeep we were able to come up with
SELECT xmlcast(xmlquery('$x/Names/Name[2]' passing xmlparse(document CONCAT(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><Names><Name>',REPLACE(ODWDATA,',','</Name><Name>')),'</Name></Names>')) as "x") as varchar(1000)) FROM ACL00
I'm getting this error
Keyword PASSING not expected. Valid tokens: ) ,.
New update. Problem solved by using UDF of Oracle's INSTR

I'm assuming db2 which I don't use, so the following syntax may not be bang on but the approach works.
In Oracle I'd use INSTR() and SUBSTR(), Google suggests LOCATE() and SUBSTR() for db2
Use LOCATE to get the position of the first comma, and use that value in SUBSTR to grab the end of YourColumn starting after the first comma
SUBSTR(YourColumn, LOCATE(YourColumn, ',') + 1)
You started with "Apple,Banana,Pineapple,,Citrus", you should now have "Banana,Pineapple,,Citrus", so we use LOCATE and SUBSTR again on the string returned above.
SUBSTR(SUBSTR(YourColumn, LOCATE(YourColumn, ',') + 1), 1, LOCATE(SUBSTR(YourColumn, LOCATE(YourColumn, ',') + 1), ',') - 1)
First SUBSTR is getting the right hand side of the string so we only need a start position parameter, second SUBSTR is grabbing the left side of the string so we need two, the start position and the length to return.

If you want 2nd item only than you can use substring function:
DECLARE #TABLE TABLE
(
A INT,
B VARCHAR(100),
C VARCHAR(100)
)
DECLARE #NTH INT = 3
INSERT INTO #TABLE VALUES (1,'Luigi','Apple,Banana,Pineapple,,Citrus')
SELECT REPLACE(REPLACE(CAST(CAST('<Name>'+ REPLACE(C,',','</Name><Name>') +'</Name>' AS XML).query('/Name[sql:variable("#NTH")]') AS VARCHAR(1000)),'<Name>',''),'</Name>','') FROM #TABLE

I am answering my own question now. It is impossible to do this with the built in functions within AS400
You have to create an UDF of Oracle's INSTR
Enter this within STRSQL it will create a new function called INSTRB
CREATE FUNCTION INSTRB (C1 VarChar(4000), C2 VarChar(4000), N integer, M integer)
RETURNS Integer
SPECIFIC INSTRBOracleBase
LANGUAGE SQL
CONTAINS SQL
NO EXTERNAL ACTION
DETERMINISTIC
BEGIN ATOMIC
DECLARE Pos, R, C2L Integer;
SET C2L = LENGTH(C2);
IF N > 0 THEN
SET (Pos, R) = (N, 0);
WHILE R < M AND Pos > 0 DO
SET Pos = LOCATE(C2,C1,Pos);
IF Pos > 0 THEN
SET (Pos, R) = (Pos + 1, R + 1);
END IF;
END WHILE;
RETURN (Pos - 1)*(1-SIGN(M-R));
ELSE
SET (Pos, R) = (LENGTH(C1)+N, 0);
WHILE R < M AND Pos > 0 DO
IF SUBSTR(C1,Pos,C2L) = C2 THEN
SET R = R + 1;
END IF;
SET Pos = Pos - 1;
END WHILE;
RETURN (Pos + 1)*(1-SIGN(M-R));
END IF;
END
Then to select the nth delimited value within a comma delimited string... in this case the 14th
use this query utilizing the new function
SELECT SUBSTRING(C,INSTRB(C,',',1,13)+1,INSTRB(C,',',1,14)-INSTRB(C,',',1,13)-1) FROM TABLE

A much prettier solution IMO would be to encapsulate a Recursive Common Table Expression (recursive CTE aka RCTE) of the data from the column C to generate a result TABLE [i.e. a User Defined Table Function (a Table UDF aka UDTF)] then use a Scalar Subselect to choose which effective record\row number.
select
a
, b
, ( select S.token_vc
from table( split_tokens(c) ) as S
where S.token_nbr = 2
) as "2nd Item of column C"
from The_Table /* in OP described with columns a,b,c but no DDL */
Yet prettier would be to make the result of that same RCTE a scalar value, so as to allow being invoked simply as a Scalar UDF with the effective row number [as another argument] defining specifically which element to select.
select
a
, b
, split_tokens(c, 2) as "2nd Item of column C"
from The_Table /* in OP described with columns a,b,c but no DDL */
The latter could be more efficient, limiting the row-data produced by the RCTE, to only the desired numbered token and those preceding numbered tokens. I can not comment on the efficiency with regard to impacts on CPU and storage as contrasted with any of the other answers offered, but my own experience with the temporary-storage implementation and the overall quickness of the RCTE results has been positive especially when other row selection limits the number of derived-table results that must be produced for the overall query request.
The UDF [and\or UDTF and the RCTE that implements them] is left as an exercise for the reader; mostly, because I do not have a system on a release that has support for recursive table expressions. If asked [e.g. in a comment to this answer], I could provide untested code source.

I have found the locate_in_string function to work very well in this case.
select substr(
c,
locate_in_string(c, ',')+1,
locate_in_string(c, ',', locate_in_string(c, ',')+1) - locate_in_string(c, ',')-1
) as fruit2
from ACL00 for read only with ur;

Related

Can we sort the characters of a string in SQL oracle? [duplicate]

I'm looking for a function that would sort chars in varchar2 alphabetically.
Is there something built-in into oracle that I can use or I need to create custom in PL/SQL ?
From an answer at http://forums.oracle.com/forums/thread.jspa?messageID=1791550 this might work, but don't have 10g to test on...
SELECT MIN(permutations)
FROM (SELECT REPLACE (SYS_CONNECT_BY_PATH (n, ','), ',') permutations
FROM (SELECT LEVEL l, SUBSTR ('&col', LEVEL, 1) n
FROM DUAL
CONNECT BY LEVEL <= LENGTH ('&col')) yourtable
CONNECT BY NOCYCLE l != PRIOR l)
WHERE LENGTH (permutations) = LENGTH ('&col')
In the example col is defined in SQL*Plus, but if you make this a function you can pass it in, or could rework it to take a table column directly I suppose.
I'd take that as a start point rather than a solution; the original question was about anagrams so it's designed to find all permutations, so something similar but simplified might be possible. I suspect this doesn't scale very well for large values.
So eventually I went PL/SQL route, because after searching for some time I realized that there is no build-in function that I can use.
Here is what I came up with. Its based on the future of associative array which is that Oracle keeps the keys in sorted order.
create or replace function sort_chars(p_string in varchar2) return varchar deterministic
as
rv varchar2(4000);
ch varchar2(1);
type vcArray is table of varchar(4000) index by varchar2(1);
sorted vcArray;
key varchar2(1);
begin
for i in 1 .. length(p_string)
loop
ch := substr(p_string, i, 1);
if (sorted.exists(ch))
then
sorted(ch) := sorted(ch) || ch;
else
sorted(ch) := ch;
end if;
end loop;
rv := '';
key := sorted.FIRST;
WHILE key IS NOT NULL LOOP
rv := rv || sorted(key);
key := sorted.NEXT(key);
END LOOP;
return rv;
end;
Simple performance test:
set timing on;
create table test_sort_fn as
select t1.object_name || rownum as test from user_objects t1, user_objects t2;
select count(distinct test) from test_sort_fn;
select count (*) from (select sort_chars(test) from test_sort_fn);
Table created.
Elapsed: 00:00:01.32
COUNT(DISTINCTTEST)
-------------------
384400
1 row selected.
Elapsed: 00:00:00.57
COUNT(*)
----------
384400
1 row selected.
Elapsed: 00:00:00.06
You could use the following query:
select listagg(letter)
within group (order by UPPER(letter), ASCII(letter) DESC)
from
(
select regexp_substr('gfedcbaGFEDCBA', '.', level) as letter from dual
connect by regexp_substr('gfedcbaGFEDCBA', '.', level) is not null
);
The subquery splits the string into records (single character each) using regexp_substr, and the outer query merges the records into one string using listagg, after sorting them.
Here you should be careful, because alphabetical sorting depends on your database configuration, as Cine pointed.
In the above example the letters are sorted ascending "alphabetically" and descending by ascii code, which - in my case - results in "aAbBcCdDeEfFgG".
The result in your case may be different.
You may also sort the letters using nlssort - it would give you better control of the sorting order, as you would get independent of your database configuration.
select listagg(letter)
within group (order by nlssort(letter, 'nls_sort=german')
from
(
select regexp_substr('gfedcbaGFEDCBA', '.', level) as letter from dual
connect by regexp_substr('gfedcbaGFEDCBA', '.', level) is not null
);
The query above would give you also "aAbBcCdDeEfFgG", but if you changed "german" to "spanish", you would get "AaBbCcDdEeFfGg" instead.
You should remember that there is no common agreement what "alphabetically" means. It all depends on which country it is, and who is looking at your data and what context it is in.
For instance in DK, there are a large number of different sortings of a,aa,b,c,æ,ø,å
per the alphabet: a,aa,b,c,æ,ø,å
for some dictionary: a,aa,å,b,c,æ,ø
for other dictionaries: a,b,c,æ,ø,aa,å
per Microsoft standard: a,b,c,æ,ø,aa,å
check out http://www.siao2.com/2006/04/27/584439.aspx for more info. Which also happens to be a great blog for issues as these.
Assuming you don't mind having the characters returned 1 per row:
select substr(str, r, 1) X from (
select 'CAB' str,
rownum r
from dual connect by level <= 4000
) where r <= length(str) order by X;
X
=
A
B
C
For people using Oracle 10g, select listagg within group won't work. The accepted answer does work, but it generates every possible permutation of the input string, which results in terrible performance - my Oracle database struggles with an input string only 10 characters long.
Here is another alternative working for Oracle 10g. It's similar to Jeffrey Kemp's answer, only the result isn't splitted into rows:
select replace(wm_concat(ch), ',', '') from (
select substr('CAB', level, 1) ch from dual
connect by level <= length('CAB')
order by ch
);
-- output: 'ABC'
wm_concat simply concatenates records from different rows into a single string, using commas as separator (that's why we are also doing a replace later).
Please note that, if your input string had commas, they will be lost. Also, wm_concat is an undocumented feature, and according to this answer it has been removed in Oracle 12c. Use it only if you're stuck with 10g and don't have a better option (such as listagg, if you can use 11g instead).
From Oracle 12, you can use:
SELECT *
FROM table_name t
CROSS JOIN LATERAL (
SELECT LISTAGG(SUBSTR(t.value, LEVEL, 1), NULL) WITHIN GROUP (
ORDER BY SUBSTR(t.value, LEVEL, 1)
) AS ordered_value
FROM DUAL
CONNECT BY LEVEL <= LENGTH(t.value)
)
Which, for the sample data:
CREATE TABLE table_name (value) AS
SELECT 'ZYX' FROM DUAL UNION ALL
SELECT 'HELLO world' FROM DUAL UNION ALL
SELECT 'aæøåbcæøåaæøå' FROM DUAL;
Outputs:
VALUE
ORDERED_VALUE
ZYX
XYZ
HELLO world
EHLLOdlorw
aæøåbcæøåaæøå
aabcåååæææøøø
If you want to change how the letters are sorted then use NLSSORT:
SELECT *
FROM table_name t
CROSS JOIN LATERAL (
SELECT LISTAGG(SUBSTR(t.value, LEVEL, 1), NULL) WITHIN GROUP (
ORDER BY NLSSORT(SUBSTR(t.value, LEVEL, 1), 'NLS_SORT = BINARY_AI')
) AS ordered_value
FROM DUAL
CONNECT BY LEVEL <= LENGTH(t.value)
)
Outputs:
VALUE
ORDERED_VALUE
ZYX
XYZ
HELLO world
dEHLLlOorw
aæøåbcæøåaæøå
aaåååbcøøøæææ
db<>fiddle here

Implement find/find next algorithm

I have a database table (mysql/pgsql) with the following format:
id|text
1| the cat is black
2| a cat is a cat
3| a dog
I need to select the line that contains nth match of a word:
eg: "Select the 3rd match for the word cat, that is the number 2 entry."
Results: the 2nd row from the result where the 3rd word is cat
The only solution I could find is to search for all entries that have the text cat, load them in memory and find the match by counting them. But this is not efficient for a big number of matches(>1 million).
How would you handle this in an efficient way? Is there anything you can do directly in the database? Maybe using other technologies like lucene?
Update: having 1 million strings in memory might not be a big issue but the expectation of the application is to have between 1k-50k active users that might do this operation concurrently.
Consider creating another table with the below structure
Table : index_table
columns :
index_id , word, occurrence, id(foreign key to your original table)
Do one time indexing process as below:
Iterate over each entry in your original table split the text into words and for each word lookup in the new table for existence if not present insert a new entry with occurrence set as 1. If exists insert a new entry with occurrence = existing occurrence +1
Once you have done this one off indexing your selects become pretty simple.
For example for cat with 3rd match will be
SELECT *
FROM original_table o, index_table idx
WHERE idx.word = 'cat'
AND idx.occurrence = 3
AND o.id = idx.id
You do not need Lucene for this job. Furthermore, if you have a large number of positive matches, the effort to pump all required data out of your DB will well exceed the computational cost.
Here's a simple solution:
Index: we require two properties:
efficiently access the words for each id
efficiently access all IDs in ascending order
as follows:
create index i_words on example_data (id, string_to_array(txt, ' '));
Query: find the ID associated with the nth match with the following query:
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = :w -- :w = 'cat'
offset :n - 1 -- :n = 3
limit 1;
Executes in 2ms on 1 million rows.
Here's the full PostgreSQL setup if you'd rather try for yourself than take my word for it:
drop table if exists example_data;
create table example_data (
id integer primary key,
txt text not null
);
insert into example_data
(select generate_series(1, 1000000, 3) as id, 'the cat is black' as txt
union all
select generate_series(2, 1000000, 3), 'a cat is a cat'
union all
select generate_series(3, 1000000, 3), 'a dog'
order by id);
commit;
drop index if exists i_words;
create index i_words on example_data (id, string_to_array(txt, ' '));
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
select
id, word
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
Note that I'm still unsure what exactly "Select the 3rd match for the word cat, that is the number 2 entry" is supposed to mean.
Possible meanings:
the 2nd row from the result where the 3rd word is cat
the 3rd row where the 2nd word is "cat"
from all rows where "cat" appears at least 3 times, take the second row
from all rows where "cat" appears at least 2 times, take the third row
If it's 1 or 2, I think this could be done in an acceptable speed by using a trigram index to reduce the possible number of matching lines. A trigram index (supplied by the module pg_trgm) will allow Postgres to make use of an index when doing a e.g. like '%cat%'.
Assuming that only a small number of rows will satisfy that condition, the resulting lines can then be split into arrays and checked for the nth word.
Something like this:
with matching_rows as (
select id, line,
row_number() over (order by id) as rn
from the_table
where line like '%cat%' -- this hopefully reduces the result to only very few rows
)
select *
from matching_rows
where rn = 3 --<< "the third match for the word cat"
and (string_to_array(line, ' '))[2] = 'cat' -- "the second word is "cat"
Note that a trigram index does have disadvantages as well. Maintaining such an index is much more expensive (=slower) than maintaining a regular b-tree index. So if your table is heavily updated, this might not be a good solution - but you need to test that for yourself.
Also if the condition `like '%cat%' doesn't really reduce the number of rows substantially, this is probably not going to perform well either.
Some more information on trigram indexes:
http://www.depesz.com/index.php/2011/02/19/waiting-for-9-1-faster-likeilike/
http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html
Another option would be to filter out the "relevant" rows using Postgres' full text search instead of a plain LIKE condition.
Whatever algorithm you come up with for the database as-it-is is likely to be slow for this kind of data. You do need an efficient text-based search, lucene-based solutions like solr or elasticsearch will do nicely here. It would be the best option here, though finding a match against a 3rd token in a string is not something I know how to build without further googling.
You can also write a job in your db which will let you build a reverse map, string->id. like this:
rownum, id, text
1 1 the cat is black
2 3 nice cat
to
key, rownum, id
1_the 1 1
2_cat 1 1
3_is 1 1
4_black 1 1
1_nice 2 3
2_cat 2 3
If you can order by ID you don't need rownum. You should also call the column something else instead of rownum, I leave it like that for clarity
Now you can search for 1st ID where the word cat is a 2nd word like this by searching
SELECT ID WHERE ROWNUM=1 AND key='3_CAT'
Provided you created an (id, key) or (key, id) index, your searches should be pretty quick.
If you can fit all that data into memory, then you can use a simple Map<MyKey, Long> to do your search. MyKey would be, more or less Pair<Long,String> with proper equals and hashCode (and/or Comparable, if you use TreeMap) implementations.
(Thanks to Daniel Grosskopf for pointing out that I initially misinterpreted the question.)
This query will give you what you want with just SQL. It gets a running total of the counts of the occurrences of a word (e.g. 'cat') within the text, and then it returns the first row that hits the threshold that you want (e.g. 3).
SELECT id, text
FROM (SELECT entries.*,
SUM((SELECT COUNT(*)
FROM regexp_split_to_table(text, E'\\s+') AS words(word)
WHERE word = 'cat')) OVER (ORDER BY id) AS running_count
FROM entries) AS entries_with_running_count
WHERE running_count >= 3
LIMIT 1
See it in action in SQL Fiddle
How would you handle this in an efficient way? Is there any trick you
can do directly in the database?
You are not specifying what other restrictions/requirements you may have or what is your definition of
a big number of matches.
As a general answer I would say that doing string manipulation in the database is not an efficient approach.
It is too slow and imposes much work on your DB which is usually a shared resource.
IMO you should do this programmatically.
A way to do this could be to keep metadata in another table i.e. indexes of rows that contain the text cat and where in the sentence.
You can query this meta-table in order to figure the rows to query from your main table.
This extra table is more efficient than searching your defined table because queries with LIKE on suffixes can not use an index and you will end up with serial scans which would result in very slow performance
Solution for the Postgres database:
Add a new column to your table:
alter table my_table add text_as_array text[];
This column will contain the sentence spliced into words:
"the cat is black" -> ["the","cat","is","black"]
Populate this column with values from current records:
update my_table set text_as_array = string_to_array(text,' ');
(and don't forget to set it's value to string_to_array(text,' ') when inserting new records)
Create a gin index on it:
create index my_table_text_as_array_index on text_as_array gin(text_as_array);
analyze my_table;
Then all you need is run a fast query as simple as this:
select *
from my_table
where text_as_array #> ARRAY['cat']
and text_as_array[3] = 'cat' -- third word in sentence
order by id
limit 1
offset 2 -- second occurrence
It took 11ms to search over ~2,400,000 records in tests I did in my machine.
Explain:
Limit (cost=11252.08..11252.08 rows=1 width=104)
-> Sort (cost=11252.07..11252.12 rows=19 width=104)
Sort Key: id
-> Bitmap Heap Scan on my_table (cost=48.21..11251.83 rows=19 width=104)
Recheck Cond: (text_as_array #> '{cat}'::text[])
Filter: (text_as_array[3] = 'cat'::text)
-> Bitmap Index Scan on my_table_text_as_array_index (cost=0.00..48.20 rows=3761 width=0)
Index Cond: (text_as_array #> '{cat}'::text[])
A "directly in the database" solution seems preferable from an efficiency standpoint as most types of abstraction layer or loading/processing elsewhere are likely to incur additional overheads.
If the source text can be massaged such that only spaces separate the words (as mentioned in the comments - perhaps by pre-processing to suitably replace all non-alphabetical characters?), the following (My)SQL-only solution will work:
#############################################################
SET #searchWord = 'cat', # Search word: Must be lower case #
#n = 1, # n where nth match is to be found #
#############################################################
#matches = 0; # Initialise local variable
SELECT s.*
FROM sentence s
WHERE id =
(SELECT subq.id
FROM
(SELECT *,
#matches AS prevMatches,
(#matches := #matches + LENGTH(`text`) - LENGTH(
REPLACE(LOWER(`text`),
CONCAT(' ', #searchWord, ' '),
CONCAT(#searchWord, ' ')))
+ CASE WHEN LEFT(LOWER(`text`), 4) = CONCAT(#searchWord, ' ') THEN 1 ELSE 0 END
+ CASE WHEN RIGHT(LOWER(`text`), 4) = CONCAT(' ', #searchWord) THEN 1 ELSE 0 END)
AS matches
FROM sentence) AS subq
WHERE subq.prevMatches < #n AND #n <= subq.matches);
Explanation
All instances of ' cat ' on each line are replaced with a word that is one letter shorter. The difference in length is then calculated to find out the number of instances. Finally, the single possibilities of 'cat ' and ' cat' appearing a the start and end of the line are respectively catered for. Having done this, a cumulative total of matches is maintained for each line. This is bundled up into a subquery from which the nth match can be picked by finding the row where the number of cumulative number of matches is no greater than n but the previous total is less than n.
Further potential improvements
The above could of course be slightly simplified by making the source text lower case (which seems sensible if it is being pre-processed) and removing all calls to LOWER().
The subquery calculates a cumulative total number of matches. If it is likely that the same search terms will be reused, it might conceivably be possible to cache these results in another table and use triggers to maintain this whenever records are updated, inserted or deleted - however this would greatly add to the complexity and data storage requirements.
I would search for all rows with "cat" but limit the rows by n. This should give you a reasonably sized subset of your data that is guaranteed to contain the row you are looking for. The SQL would look similar to this:
select id, text
from your_table
where text ~* 'cat'
order by id
limit 3 --nth time cat appears
I would then implement your solution as a pl/pgsql function to get the id that contains the nth occurrence of your word:
CREATE OR REPLACE FUNCTION your_schema.row_with_nth_occurrence(character varying, integer)
RETURNS integer AS
$BODY$
Declare
arg_search_word ALIAS FOR $1;
arg_occurrence ALIAS FOR $2;
v_sql text;
v_sql2 text;
v_count integer;
v_count_total integer;
v_record your_table%ROWTYPE;
BEGIN
v_sql := 'select id, text
from your_table
where text ~* ' || arg_search_word || '
order by id
limit ' || arg_occurrence || ';';
v_count := 0;
v_count_total := 0;
FOR v_record IN v_sql LOOP
v_sql2 := 'SELECT count(*)
FROM regexp_split_to_table('||v_record.text||', E'\\s+') a
WHERE a = '|| arg_search_word ||';';
EXECUTE v_sql2 INTO v_count;
v_count_total := v_count_total + v_count;
IF v_count_total >= arg_occurrence THEN
RETURN v_record.id;
END IF;
END LOOP;
RAISE EXCEPTION '% does not occur % times in the database.', arg_search_word, arg_occurrence;
END;
All this function does is loop through the subset of rows potentially containing the desired word, counts the number of times it occurs in each row, and then returns the Id when it finds the row with the nth occurrence of the word.
Solution one:
Keep the rows in memory but centralized. All clients loop over the same list. Probably fast enough en reasonably memory friendly.
Solution two:
Use the streaming ResultSet technique from the JDBC driver; e.g.
Statement select = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
select.setFetchSize(Integer.MIN_VALUE);
ResultSet result = select.executeQuery(sql);
As explained in http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html, scroll down to Resultset. This should be memory friendly.
Now simply count on the result rows until satisfied and close the result.
I am having trouble understanding your statement:
eg: "Select the 3rd match for the word cat, that is the number 2
entry." Results: the 2nd row from the result where the 3rd word is cat
I will assume that you mean, you want to search for entries where the 3rd word of the text is "cat", and from those entries you want to second entry.
Since you mentioned that your problem lies with the concurrent access and the speed, you will need to somehow build an index which is optimized for your query. You could use anything for this, database, lucene, etc. My suggestion would be to build the index in-memory. Just think of it as a warm up for your service before it could start serving request.
In your case, you would want some kind of map with the word and word position as the key. This key will then map to a list of row numbers which is matching the key. So in the end, you will just have to do a lookup twice, first is to get a list of row numbers where it matches, then the row number which you want. So the performance you will need in the end will be a simple map lookup + array list lookup (constant).
I've provided a very simple example below. It's untested code, but it should roughly give you an idea.
You could also save the index into a file after it's been built if you want. After you have been the index and load them into memory, this will be very very fast.
// text entry from the DB
public class TextEntry {
private int rowNb;
private String text;
// getters & setters
}
// your index class
public class Index {
private Map<Key, List<Integer>> indexMap;
// getters and setters
public static class Key {
private int wordPosition;
private String word;
// getters and setters
}
}
// your searcher class
public class Searcher {
private static Index index = null;
private static List<TextEntry> allTextEntries = null;
public static init() {
// init all data with some synchronization check
// synchronization check whether index has been built
allTextEntries.forEach(entry -> {
// split the words, and build the index based on the word position and the word
String[] words = entry.split(" ");
for (int i = 0; i < words.length; i++) {
Index.Key key = new Index.Key(i + 1, words[i]);
int rowNumber = entry.getRowNb();
// if the key is already there, just add the row number if it's not the last one
if (indexMap.contains(key)) {
List entryMatch = indexMap.get(key);
if (entryMatch.get(entryMatch.size() - 1) !== rowNumber) {
entryMatch.add(rowNumber);
}
} else {
// if key is not there, add a new one
List entryMatch = new ArrayList<Integer>()
entryMatch.add(rowNumber);
indexMap.put(key, entryMatch);
}
}
});
}
public static TextEntry search(String word, int wordPosition, int resultNb) {
// call init if not yet called, do some check
int rowNb = index.getIndexMap().get(new Index.Key(word, wordPosition)).get(resultNb - 1);
return allTextEntries.get(rowNb);
}
}
In mysql
We need one function where we can count number of occurence of given substring in a field.
Create the Function (This function will count occurence of substring in given column)
CREATE FUNCTION substrCount(
x varchar(255), delim varchar(12)) returns int
return (length(x)-length(REPLACE(x,delim, '')))/length(delim);
This function should be able to find how many times 'cat' was present in text.
Please bear with me for syntax of code as it may not be fully functional(correct as required).
I will break this problem into 3 parts and we can do with the help of stored procedure.
Select all the rows containing the string 'cat' (or any other input).This should select maximum of n rows( n= no of occurences), so we will use limit in our query.
With cursor, iterate matched rows in while roop.
Increment occurence matches per row in count variable and exit once number of matches found.(Should be able to find match within 1 to n loops)
create stored procedure.
Assuming proper index ,this should be fast.
DELIMITER $$
CREATE PROCEDURE find_match(INOUT string_to_match varchar(100),
INOUT occurence_count INTEGER,OUT match_field varchar(100))
BEGIN
DECLARE v_count INTEGER DEFAULT 0;
DECLARE v_text varchar(100) DEFAULT "";
-- declare cursor and select by the order you want.
DEClARE matcher_cursor CURSOR FOR
SELECT textField FROM myTable
where textField like string_to_match
order by id
LIMIT 0, occurence_count;
-- declare NOT FOUND handler
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET v_finished = -1;
OPEN matcher_cursor;
get_matching_occurence: LOOP
FETCH matcher_cursor INTO v_text;
IF v_count = -1 THEN
LEAVE get_matching_occurence;
END IF;
-- use substring count function
v_count:= v_count + substrCount(v_text,string_to_match));
-- if count is equal to greater than occurenece that means matching row is found.
IF (v_count>= occurence_count) THEN
SET match_field = v_text;
v_count:=-1;
END IF;
END LOOP get_matching_occurence;
CLOSE _
END$$
DELIMITER ;
I tested this on a table with 1.2 million rows and it returns data in less than a second. I am using a split function (which is a modified form of Jeff Modem's splitter function) from here: 'http://sqlperformance.com/2012/08/t-sql-queries/splitting-strings-follow-up'.`
-- Step 1. Create table
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Sentence](
[id] [int] IDENTITY(1,1) NOT NULL,
[Text][varchar](250) NULL,
CONSTRAINT [PK_Sentence] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
Step 2. Create a split function
CREATE FUNCTION [dbo].[SplitSentence]
(
#CSVString NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
cteTally(N) AS (SELECT 0
UNION ALL
SELECT TOP (DATALENGTH(ISNULL(#CSVString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E2),
cteStart(N1) AS (SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(#CSVString,t.N,1) = #Delimiter OR t.N = 0))
SELECT Word = SUBSTRING(#CSVString, s.N1, ISNULL(NULLIF(CHARINDEX(#Delimiter,#CSVString,s.N1),0)-s.N1,50))
FROM cteStart s;
Step 3. Create a sql script to return the required data
DECLARE #n int = 3
DECLARE #Word varchar(50) = 'cat'
;WITH myData AS
(SELECT TOP (#n)
id
,[Text]
,sp.word
,ROW_NUMBER() OVER (ORDER BY Id) RowNo
FROM
Sentence
CROSS APPLY (SELECT * FROM SplitSentence(Sentence.[Text],' ')) sp
WHERE Word = #Word)
SELECT
*
FROM
myData
WHERE
RowNo = #n
Assumptions:
1. The sentence has a max length of 250 characters. If needed this can be modified in the create table statement.
2. The sentence will not have more than a 100 words. If more than 100 words are needed, the split function will have to be modified.
3. Any word in the sentence has a max length of 50 characters.
SQL Fiddle demo here: http://sqlfiddle.com/#!3/0a1d0/1
Notes:
I am aware that the original requirement is for MySQL/pgsql,
but I have limited knowledge of these and therefore my solution has been created/tested in MSSQL.
I would simply count the number of words in each line and then do a cumulative sum. I'm not sure what the most efficient way is to count words, but a difference of lengths might win:
select t.*
from (select t.*, sum(cnt) over (order by id) as cumecnt
from (select t.*,
(length(' ' || str || ' ') - length(replace(' ' || str || ' '), ' cat ', '')) / length(' cat ') as cnt
from t
) t
where num > 0
) t
where cumecnt >= 3 and cumecnt - cnt <= 3;
You would simply replace "3" and "cat" with the appropriate strings.
This method requires scanning the strings a handful of times in each row (once for each of the lengths and once for the replace). My guess is that this is faster than various array operations, regular expressions, or text. If you have more complicated definitions of what a word is, then you probably need to use regular expression replace:
Doing the work in the database is usually a big win. However, if you are looking for the 6th match out of one million rows, it might be faster to read back the values from the subquery and do the accumulation in the application. I don't think there is a way to short-circuit the database calculation to stop just on the "6th" row.

How to get the first field from an anonymous row type in PostgreSQL 9.4?

=# select row(0, 1) ;
row
-------
(0,1)
(1 row)
How to get 0 within the same query? I figured the below sort of working but is there any simple way?
=# select json_agg(row(0, 1))->0->'f1' ;
?column?
----------
0
(1 row)
No luck with array-like syntax [0].
Thanks!
Your row type is anonymous and therefore you cannot access its elements easily. What you can do is create a TYPE and then cast your anonymous row to that type and access the elements defined in the type:
CREATE TYPE my_row AS (
x integer,
y integer
);
SELECT (row(0,1)::my_row).x;
Like Craig Ringer commented in your question, you should avoid producing anonymous rows to begin with, if you can help it, and type whatever data you use in your data model and queries.
If you just want the first element from any row, convert the row to JSON and select f1...
SELECT row_to_json(row(0,1))->'f1'
Or, if you are always going to have two integers or a strict structure, you can create a temporary table (or type) and a function that selects the first column.
CREATE TABLE tmptable(f1 int, f2 int);
CREATE FUNCTION gettmpf1(tmptable) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL;
SELECT gettmpf1(ROW(0,1));
Resources:
https://www.postgresql.org/docs/9.2/static/functions-json.html
https://www.postgresql.org/docs/9.2/static/sql-expressions.html
The json solution is very elegant. Just for fun, this is a solution using regexp (much uglier):
WITH r AS (SELECT row('quotes, "commas",
and a line break".',null,null,'"fourth,field"')::text AS r)
--WITH r AS (SELECT row('',null,null,'')::text AS r)
--WITH r AS (SELECT row(0,1)::text AS r)
SELECT CASE WHEN r.r ~ '^\("",' THEN ''
WHEN r.r ~ '^\("' THEN regexp_replace(regexp_replace(regexp_replace(right(r.r, -2), '""', '\"', 'g'), '([^\\])",.*', '\1'), '\\"', '"', 'g')
ELSE (regexp_matches(right(r.r, -1), '^[^,]*'))[1] END
FROM r
When converting a row to text, PostgreSQL uses quoted CSV formatting. I couldn't find any tools for importing quoted CSV into an array, so the above is a crude text manipulation via mostly regular expressions. Maybe someone will find this useful!
With Postgresql 13+, you can just reference individual elements in the row with .fN notation. For your example:
select (row(0, 1)).f1; --> returns 0.
See https://www.postgresql.org/docs/13/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

Parameters in query with in clause?

I want to use parameter for query like this :
SELECT * FROM MATABLE
WHERE MT_ID IN (368134, 181956)
so I think about this
SELECT * FROM MATABLE
WHERE MT_ID IN (:MYPARAM)
but it doesn't work...
Is there a way to do this ?
I actually use IBX and Firebird 2.1
I don't know how many parameters in IN clause.
For whom ever is still interested. I did it in Firebird 2.5 using another stored procedure inspired by this post.
How to split comma separated string inside stored procedure?
CREATE OR ALTER PROCEDURE SPLIT_STRING (
ainput varchar(8192))
RETURNS (
result varchar(255))
AS
DECLARE variable lastpos integer;
DECLARE variable nextpos integer;
DECLARE variable tempstr varchar(8192);
BEGIN
AINPUT = :AINPUT || ',';
LASTPOS = 1;
NEXTPOS = position(',', :AINPUT, LASTPOS);
WHILE (:NEXTPOS > 1) do
BEGIN
TEMPSTR = substring(:AINPUT from :LASTPOS for :NEXTPOS - :LASTPOS);
RESULT = :TEMPSTR;
LASTPOS = :NEXTPOS + 1;
NEXTPOS = position(',', :AINPUT, LASTPOS);
suspend;
END
END
When you pass the SP the following list
CommaSeperatedList = 1,2,3,4
and call
SELECT * FROM SPLIT_STRING(:CommaSeperatedList)
the result will be :
RESULT
1
2
3
4
And can be used as follows:
SELECT * FROM MyTable where MyKeyField in ( SELECT * FROM SPLIT_STRING(:CommaSeperatedList) )
I ended up using a global temporary table in Firebird, inserting parameter values first and to retrieve results I use a regular JOIN instead of a WHERE ... IN clause. The temporary table is transaction-specific and cleared on commit (ON COMMIT DELETE ROWS).
Maybe you should wite it like this:
SELECT * FROM MATABLE
WHERE MT_ID IN (:MYPARAM1 , :MYPARAM2)
I don't think it's something that can be done. Are there any particular reason why you don't want to build the query yourself?
I've used this method a couple of times, it doesn't use parameters though. It uses a stringlist and it's property DelimitedText. You create a IDList and populate it with your IDs.
Query.SQL.Add(Format('MT_ID IN (%s)', [IDList.DelimitedText]));
You might also be interested in reading the following:
http://www.sommarskog.se/dynamic_sql.html
and
http://www.sommarskog.se/arrays-in-sql-2005.html
Covers dynamic sql with 'in' clauses and all sorts. Very interesting.
Parameters are placeholders for single values, that means that an IN clause, that accepts a comma delimited list of values, cannot be used with parameters.
Think of it this way: wherever I place a value, I can use a parameter.
So, in a clause like: IN (:param)
I can bind the variable to a value, but only 1 value, eg: IN (4)
Now, if you consider an "IN clause value expression", you get a string of values: IN (1, 4, 6) -> that's 3 values with commas between them. That's part of the SQL string, not part of a value, which is why it cannot be bound by a parameter.
Obviously, this is not what you want, but it's the only thing possible with parameters.
The answer from Yurish is a solution in two out of three cases:
if you have a limited number of items to be added to your in clause
or, if you are willing to create parameters on the fly for each needed element (you don't know the number of elements in design time)
But if you want to have arbitrary number of elements, and sometimes no elements at all, then you can generate SLQ statement on the fly. Using format helps.
SELECT * FROM MATABLE
WHERE MT_ID IN (:MYPARAM) instead of using MYPARAM with :, use parameter name.
like SELECT * FROM MATABLE
WHERE MT_ID IN (SELECT REGEXP_SUBSTR(**MYPARAM,'[^,]+', 1, LEVEL)
FROM DUAL
CONNECT BY REGEXP_SUBSTR(MYPARAM, '[^,]+', 1, LEVEL) IS NOT NULL))**
MYPARAM- '368134,181956'
If you are using Oracle, then you should definitely check out Tom Kyte's blog post on exactly this subject (link).
Following Mr Kyte's lead, here is an example:
SELECT *
FROM MATABLE
WHERE MT_ID IN
(SELECT TRIM(substr(text, instr(text, sep, 1, LEVEL) + 1,
instr(text, sep, 1, LEVEL + 1) -
instr(text, sep, 1, LEVEL) - 1)) AS token
FROM (SELECT sep, sep || :myparam || sep AS text
FROM (SELECT ',' AS sep
FROM dual))
CONNECT BY LEVEL <= length(text) - length(REPLACE(text, sep, '')) - 1)
Where you would bind :MYPARAM to '368134,181956' in your case.
Here is a technique I have used in the past to get around that 'IN' statement problem. It builds an 'OR' list based on the amount of values specified with parameters (unique). Then all I had to do was add the parameters in the order they appeared in the supplied value list.
var
FilterValues: TStringList;
i: Integer;
FilterList: String;
Values: String;
FieldName: String;
begin
Query.SQL.Text := 'SELECT * FROM table WHERE '; // set base sql
FieldName := 'some_id'; // field to filter on
Values := '1,4,97'; // list of supplied values in delimited format
FilterList := '';
FilterValues := TStringList.Create; // will get the supplied values so we can loop
try
FilterValues.CommaText := Values;
for i := 0 to FilterValues.Count - 1 do
begin
if FilterList = '' then
FilterList := Format('%s=:param%u', [FieldName, i]) // build the filter list
else
FilterList := Format('%s OR %s=:param%u', [FilterList, FieldName, i]); // and an OR
end;
Query.SQL.Text := Query.SQL.Text + FilterList; // append the OR list to the base sql
// ShowMessage(FilterList); // see what the list looks like.
if Query.ParamCount <> FilterValues.Count then
raise Exception.Create('Param count and Value count differs.'); // check to make sure the supplied values have parameters built for them
for i := 0 to FilterValues.Count - 1 do
begin
Query.Params[i].Value := FilterValues[i]; // now add the values
end;
Query.Open;
finally
FilterValues.Free;
end;
Hope this helps.
There is one trick to use reversed SQL LIKE condition.
You pass the list as string (VARCHAR) parameter like '~12~23~46~567~'
Then u have query like
where ... :List_Param LIKE ('%~' || CAST( NumField AS VARCHAR(20)) || '~%')
CREATE PROCEDURE TRY_LIST (PARAM_LIST VARCHAR(255)) RETURNS (FIELD1....)
AS
BEGIN
/* Check if :PARAM_LIST begins with colon "," and ands with colon ","
the list should look like this --> eg. **",1,3,4,66,778,33,"**
if the format of list is right then GO if not just add then colons
*/
IF (NOT SUBSTRING(:PARAM_LIST FROM 1 FOR 1)=',') THEN PARAM_LIST=','||PARAM_LIST;
IF (NOT SUBSTRING(:PARAM_LIST FROM CHAR_LENGTH(:PARAM_LIST) FOR 1)=',') THEN PARAM_LIST=PARAM_LIST||',';
/* Now you are shure thet :PARAM_LIST format is correct */
/ * NOW ! */
FOR SELECT * FROM MY_TABLE WHERE POSITION(','||MY_FIELD||',' in :PARAM_LIST)>0
INTO :FIELD1, :FIELD2 etc... DO
BEGIN
SUSPEND;
END
END
How to use it.
SELECT * FROM TRY_LIST('3,4,544,87,66,23')
or SELECT * FROM TRY_LIST(',3,4,544,87,66,23,')
if the list have to be longer then 255 characters then just change the part of header f.eg. like PARAM_LIST VARCHAR(4000)

sql function to return table of names and values given a querystring

Anyone have a t-sql function that takes a querystring from a url and returns a table of name/value pairs?
eg I have a value like this stored in my database:
foo=bar&baz=qux&x=y
and I want to produce a 2-column (key and val) table (with 3 rows in this example), like this:
name | value
-------------
foo | bar
baz | qux
x | y
UPDATE: there's a reason I need this in a t-sql function; I can't do it in application code. Perhaps I could use CLR code in the function, but I'd prefer not to.
UPDATE: by 'querystring' I mean the part of the url after the '?'. I don't mean that part of a query will be in the url; the querystring is just used as data.
create function dbo.fn_splitQuerystring(#querystring nvarchar(4000))
returns table
as
/*
* Splits a querystring-formatted string into a table of name-value pairs
* Example Usage:
select * from dbo.fn_splitQueryString('foo=bar&baz=qux&x=y&y&abc=')
*/
return (
select 'name' = SUBSTRING(s,1,case when charindex('=',s)=0 then LEN(s) else charindex('=',s)-1 end)
, 'value' = case when charindex('=',s)=0 then '' else SUBSTRING(s,charindex('=',s)+1,4000) end
from dbo.fn_split('&',#querystring)
)
go
Which utilises this general-purpose split function:
create function dbo.fn_split(#sep nchar(1), #s nvarchar(4000))
returns table
/*
* From https://stackoverflow.com/questions/314824/
* Splits a string into a table of values, with single-char delimiter.
* Example Usage:
select * from dbo.fn_split(',', '1,2,5,2,,dggsfdsg,456,df,1,2,5,2,,dggsfdsg,456,df,1,2,5,2,,')
*/
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 4000 END) AS s
FROM Pieces
)
go
Ultimately letting you do something like this:
select name, value
from dbo.fn_splitQuerystring('foo=bar&baz=something&x=y&y&abc=&=whatever')
I'm sure TSQL could be coerced to jump through this hoop for you, but why not parse the querystring in your application code where it most probably belongs?
Then you can look at this answer for what others have done to parse querystrings into name/value pairs.
Or this answer.
Or this.
Or this.
Please don't encode your query strings directly in URLs, for security reasons: anyone can easily substitute any old query to gain access to information they shouldn't have -- or worse, "DROP DATABASE;". Checking for suspicious "keywords" or things like quote characters is not a solution -- creative hackers will work around these measures, and you'll annoy everyone whose last name is "O'Reilly."
Exceptions: in-house-only servers or public https URLS. But even then, there's no reason why you can't build the SQL query on the client side and submit it from there.