I have a sqlite table containing records of variable length number prefixes. I want to be able to find the most complete prefix against another variable length number in the most efficient way:
eg. The table contains a column called prefix with the following numbers:
1. 1234
2. 12345
3. 123456
What would be an efficient sqlite query to find the second record as being the most complete match against 12345999.
Thanks.
A neat trick here is to reverse a LIKE clause -- rather than saying
WHERE prefix LIKE '...something...'
as you would often do, turn the prefix into the pattern by appending a % to the end and comparing it to your input as the fixed string. Order by length of prefix descending, and pick the top 1 result.
I've never used Sqlite before, but just downloaded it and this works fine:
sqlite> CREATE TABLE whatever(prefix VARCHAR(100));
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('1234');
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('12345');
sqlite> INSERT INTO WHATEVER(prefix) VALUES ('123456');
sqlite> SELECT * FROM whatever WHERE '12345999' LIKE (prefix || '%')
ORDER BY length(prefix) DESC LIMIT 1;
output:
12345
Personally I use next method, it will use indexes:
statement '('1','12','123','1234','12345','123459','1234599','12345999','123459999')'
should be generated by client
SELECT * FROM whatever WHERE prefix in
('1','12','123','1234','12345','123459','1234599','12345999','123459999')
ORDER BY length(prefix) DESC LIMIT 1;
select foo, 1 quality from bar where foo like "123*"
union
select foo, 2 quality from bar where foo like "1234*"
order by quality desc limit 1
I haven't tested it, but the idea would work in other dialects of SQL
a couple of assumptions.
you are joining with some other table so you want to know the largest variable length prefix for each record in the table you are joining with.
your table of prefixes is actually more than just the three you provide in your example...otherwise you can hardcode the logic and move on.
prefix_table.prefix
1234
12345
123456
etc.
foo.field
12345999
123999
select
a.field,
b.prefix,
max(len(b.prefix)) as length
from
foo a inner join prefix_table b on b.prefix = left(a.field, len(b.prefix))
group by
a.field,
b.prefix
note that this is untested but logically should make sense.
Without resorting to a specialized index, the best performing strategy may be to hunt for the answer.
Issue a LIKE query for each possible prefix, starting with the longest. Stop once you get rows returned.
It's certainly not the prettiest way to achieve what you wan't but as opposed to the other suggestions, indexes will be considered by the query planner. As always, it depends on your actual data. In particular, on how many rows in your table, and how long the average hunt will be.
Related
I feel like this should be simple but I'm relatively unskilled in SQL and I can't seem to figure it out. I'm used to wrangling data in python (pandas) or Spark (usually pyspark) and this would be a one-liner in either of those. Specifically, I'm using Snowflake SQL, but I think this is probably relevant to a lot of flavors of SQL.
Essentially I just want to trim the first character off of a specific column. More generally, what I'm trying to do is replace a column with a substring of the same column. I would even settle for creating a new column that's a substring of an existing column. I can't figure out how to do any of these things.
On obvious solution would be to create a temporary table with something like
CREATE TEMPORARY TABLE tmp_sub AS
SELECT id_col, substr(id_col, 2, 10) AS id_col_sub FROM table1
and then join it back and write a new table
CREATE TABLE table2 AS
SELECT
b.id_col_sub as id_col,
a.some_col1, a.some_col2, ...
FROM table1 a
JOIN tmp_sub b
ON a.id_col = b.id_col
My tables have roughly a billion rows though and this feels extremely inefficient. Maybe I'm wrong? Maybe this is just the right way to do it? I guess I could replace the CREATE TABLE table2 AS... to INSERT OVERWRITE INTO table1 ... and at least that wouldn't store an extra copy of the whole thing.
Any thoughts and ideas are most welcome. I come at this humbly from the perspective of someone who is baffled by a language that so many people seem to have mastery over.
I'm not sure the exact syntax/functions in Snowflake but generally speaking there's a few different ways of achieving this.
I guess the general approach that would work universally is using the SUBSTRING function that's available in any database.
Assuming you have a table called Table1 with the following data:
+-------+-----------------------------------------+
Code | Desc
+-------+-----------------------------------------+
0001 | 1First Character Will be Removed
0002 | xCharacter to be Removed
+-------+-----------------------------------------+
The SQL code to remove the first character would be:
select SUBSTRING(Desc,2,len(desc)) from Table1
Please note that the "SUBSTRING" function may vary according to different databases. In Oracle for example the function is "SUBSTR". You just need to find the Snowflake correspondent.
Another approach that would work at least in SQLServer and MySQL would be using the "RIGHT" function
select RIGHT(Desc,len(Desc) - 1) from Table1
Based on your question I assume you actually want to update the actual data within the table. In that case you can use the same function above in an update statement.
update Table1 set Desc = SUBSTRING(Desc,2,len(desc))
You didn't try this?
UPDATE tableX
SET columnY = substr(columnY, 2, 10 ) ;
-Paul-
There is no need to specify the length, as is evidenced from the following simple test harness:
SELECT $1
,SUBSTR($1, 2)
,RIGHT($1, -2)
FROM VALUES
('abcde')
,('bcd')
,('cdef')
,('defghi')
,('e')
,('fg')
,('')
;
Both expressions here - SUBSTR(<col>, 2) and RIGHT(<col>, -2) - effectively remove the first character of the <col> column value.
As for the strategy of using UPDATE versus INSERT OVERWRITE, I do not believe that there will be any difference in performance or outcome, so I might opt for the UPDATE since it is simpler. So, in conclusion, I would use:
UPDATE tableX
SET columnY = SUBSTR(columnY, 2)
;
I have table where millions of records are there I'm just posting sample data. Actually I'm looking to get only Endorsement data by using LIKE or LEFT but there is no difference between them in Execution time. IS there any fine way to get data in less time while dealing with Alphanumeric Data. I have 4.4M records in table. Suggest me
declare #t table (val varchar(50))
insert into #t(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal')
SELECT * FROM #t where RIGHT(val,11) = 'Endorsement'
SELECT * FROM #t where val like '%Endorsement%'
Imagine you'd have to find names in a telephone book that end with a certain string. All you could do is read every single name and compare. It doesn't help you at all to see where the names with A, B, C, etc. start, because you are not interested in the initial characters of the names but only in the last characters instead. Well, the only thing you could do to speed this up is ask some friends to help you and each person scans a range of pages only. In a DBMS it is the same. The DBMS performs a full table scan and does this parallelized if possible.
If however you had a telephone book listing the words backwards, so you'd see which words end with A, B, C, etc., that sure would help. In SQL Server: Create a computed column on the reverse string:
alter table t add reverse_val as reverse(val);
And add an index:
create index idx_reverse_val on t(reverse_val);
Then query the string with LIKE. The DBMS should notice that it can use the index for speeding up the search process.
select * from t where reverse_val like reverse('Endorsement') + '%';
Having said this, it seems strange that you are interested in the end of your strings at all. In a good database you store atomic information, e.g. you would not store a person's name and birthdate in the same column ('John Miller 12.12.2000'), but in separate columns instead. Sure, it does happen that you store names and want to look for names starting with, ending with, containing substrings, but this is a rare thing after all. Check your column and think about whether its content should be separate columns instead. If you had the string ('Endorsement', 'Renewal', etc.) in a separate column, this would really speed up the lookup, because all you'd have to do is ask where val = 'Endorsement' and with an index on that column this is a super-simple task for the DBMS.
try charindex or patindex:
SELECT *
FROM #t t
WHERE CHARINDEX('endorsement', t.val) > 0
SELECT *
FROM #t t
WHERE PATINDEX('%endorsement%', t.val) > 0
CREATE TABLE tbl
(val varchar(50));
insert into tbl(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal');
CREATE CLUSTERED INDEX inx
ON dbo.tbl(val)
SELECT * FROM tbl where val like '%Endorsement';
--LIKE '%Endorsement' will give better performance it will utilize the index well efficiently than RIGHT(val,ll)
I am trying to write code that allows me to check if there are any cases of a particular pattern inside a table.
The way I am currently doing is with something like
select count(*)
from database.table
where column like (some pattern)
and seeing if the count is greater than 0.
I am curious to see if there is any way I can speed up this process as this type of pattern finding happens in a loop in my query and all I need to know is if there is even one such case rather than the total number of cases.
Any suggestions will be appreciated.
EDIT: I am running this inside a Teradata stored procedure for the purpose of data quality validation.
Using EXISTS will be faster if you don't actually need to know how many matches there are. Something like this would work:
IF EXISTS (
SELECT *
FROM bigTbl
WHERE label LIKE '%test%'
)
SELECT 'match'
ELSE
SELECT 'no match'
This is faster because once it finds a single match it can return a result.
If you don't need the actual count, the most efficient way in Teradata will use EXISTS:
select 1
where exists
( select *
from database.table
where column like (some pattern)
)
This will return an empty result set if the pattern doesn't exist.
In terms of performance, a better approach is to:
select the result set based on your pattern;
limit the result set's size to 1.
Check whether a result was returned.
Doing this prevents the database engine from having to do a full table scan, and the query will return as soon as the first matching record is encountered.
The actual query depends on the database you're using. In MySQL, it would look something like:
SELECT id FROM database.table WHERE column LIKE '%some pattern%' LIMIT 1;
In Oracle it would look like this:
SELECT id FROM database.table WHERE column LIKE '%some pattern%' AND ROWNUM = 1;
Such a query as in the title would look like this I guess:
select * from table t where (t.k1='apple' and t.k2='pie') or (t.k1='strawberry' and t.k2='shortcake')
... --10000 more key pairs here
This looks quite verbose to me. Any better alternatives? (Currently using SQLite, might use MYSQL/Oracle.)
You can use for example this on Oracle, i assume that if you use regular concatenate() instead of Oracle's || on other DB, it would work too (as it is simply just a string comparison with the IN list). Note that such query might have suboptimal execution plan.
SELECT *
FROM
TABLE t
WHERE
t.k1||','||t.k2 IN ('apple,pie',
'strawberry,shortcake' );
But if you have your value list stored in other table, Oracle supports also the format below.
SELECT *
FROM
TABLE t
WHERE (t.k1,t.k2) IN ( SELECT x.k1, x.k2 FROM x );
Don't be afraid of verbose syntax. Concatenation tricks can easily mess up the selectivity estimates or even prevent the database from using indexes.
Here is another syntax that may or may not work in your database.
select *
from table t
where (k1, k2) in(
('apple', 'pie')
,('strawberry', 'shortcake')
,('banana', 'split')
,('raspberry', 'vodka')
,('melon', 'shot')
);
A final comment is that if you find yourself wanting to submit 1000 values as filters you should most likely look for a different approach all together :)
select * from table t
where (t.k1+':'+t.k2)
in ('strawberry:shortcake','apple:pie','banana:split','etc:etc')
This will work in most of the cases as it concatenate and finds in as one column
off-course you need to choose a proper separator which will never come in the value of k1 and k2.
for e.g. if k1 and k2 are of type int you can take any character as separator
SELECT * FROM tableName t
WHERE t.k1=( CASE WHEN t.k2=VALUE THEN someValue
WHEN t.k2=otherVALUE THEN someotherValue END)
- SQL FIDDLE
It is possible to write a query get all that record from a table where a certain field contains a numeric value?
something like "select street from tbladdress where street like '%0%' or street like '%1%' ect ect"
only then with one function?
Try this
declare #t table(street varchar(50))
insert into #t
select 'this address is 45/5, Some Road' union all
select 'this address is only text'
select street from #t
where street like '%[0-9]%'
street
this address is 45/5, Some Road
Yes, but it will be inefficient, and probably slow, with a wildcard on the leading edge of the pattern
LIKE '%[0-9]%'
Searching for text within a column is horrendously inefficient and does not scale well (per-row functions, as a rule, all have this problem).
What you should be doing is trading disk space (which is cheap) for performance (which is never cheap) by creating a new column, hasNumerics for example, adding an index to it, then using an insert/update trigger to set it based on the data going into the real column.
This means the calculation is done only when the row is created or modified, not every single time you extract the data. Databases are almost always read far more often than they're written and using this solution allows you to amortize the cost of the calculation over many select statement executions.
Then, when you want your data, just use:
select * from mytable where hasNumerics = 1; -- or true or ...
and watch it leave a regular expression query or like '%...%' monstrosity in its dust.
To fetch rows that contain only numbers,use this query
select street
from tbladdress
where upper(street) = lower(street)
Works in oracle .
I found this solution " select street from tbladresse with(nolock) where patindex('%[0-9]%',street) = 1"
it took me 2 mins to search 3 million on an unindexed field