Oracle DBMS_LOB.INSTR and CONTAINS performance - sql

Is there any performance difference between dbms_lob.instr and contains or am I doing something wrong?
Here is my code
SELECT DISTINCT ha.HRE_A_ID, ha.HRE_A_FIRSTNAME, ha.HRE_A_SURNAME, ha.HRE_A_CITY,
ha.HRE_A_EMAIL, ha.HRE_A_PHONE_MOBIL
FROM HRE_APPLICANT ha WHERE ha.HRE_A_STATUS_ID=1 AND ha.HRE_A_CURRENT_STATUS_ID <= '7'
AND ((DBMS_LOB.INSTR(hre_a_for_search,'java') > 0)
OR EXISTS
(SELECT 1 FROM gob_attachment, gob_table WHERE hre_a_id=gob_a_record_id
AND gob_a_table_id = gob_t_id AND gob_t_code = 'HRE_APPLICANT'
AND CONTAINS (gob_a_document, 'java') > 0))
ORDER BY HRE_A_SURNAME
and last two lines changed for using instr
AND dbms_lob.instr(gob_a_document,utl_raw.cast_to_raw('java')) <> 0))
ORDER BY HRE_A_SURNAME
My problem is that I would like to use instr instead of contains, but instr seems to me a lot slower then contains.

CONTAINS will use an Oracle Text index so you'd expect it to be much more efficient than something like INSTR that has to read the entire CLOB at runtime. If you generate the query plans for the two statements, I expect that you'll see that the difference is related to the Oracle Text index.
Why do you want to use INSTR rather than CONTAINS?

Related

Find duplicates in case-sensitive query in MS Access

I have a table containing Japanese text, in which I believe that there are some duplicate rows. I want to write a SELECT query that returns all duplicate rows. So I tried running the following query based on an answer from this site (I wasn't able to relocate the source):
SELECT [KeywordID], [Keyword]
FROM Keyword
WHERE [Keyword] IN (SELECT [Keyword]
FROM [Keyword] GROUP BY [Keyword] HAVING COUNT(*) > 1);
The problem is that Access' equality operator treats the two Japanese writing systems - hiragana and katakana - as the same thing, where they should be treated as distinct. Both writing systems have the same phonetic value, although the written characters used to represent the sound are different - e.g. あ (hiragana) and ア (katakana) both represent the sound 'a'.
When I run the above query, however, both of these characters will appear, as according to Access, they're the same character and therefore a duplicate. Essentially it's a case-insensitive search where I need a case-sensitive one.
I got around this issue when doing a simple SELECT to find a Keyword using StrComp to perform a binary comparison, because this method correctly treats hiragana and katakana as distinct. I don't know how I can adapt the query above to use StrComp, though, because it's not directly evaluating one string against another as in the linked question.
Basically what I'm asking is: how can I do a query that will return all duplicates in a table, case-sensitive?
You can use exists instead:
SELECT [KeywordID], [Keyword]
FROM Keyword as k
WHERE EXISTS (SELECT 1
FROM Keyword as k2
WHERE STRCOMP(k2.Keyword, k.KeyWord, 0) = 0 AND
k.KeywordID <> k2.KeywordID
);
Try with a self join:
SELECT k1.[KeywordID], k1.[Keyword], k2.[KeywordID], k2.[Keyword]
FROM Keyword AS k1 INNER JOIN Keyword AS k2
ON k1.[KeywordID] < k2.[KeywordID] AND STRCOMP(k1.[Keyword], k2.[Keyword], 0) = 0

SQL full text search behavior on numeric values

I have a table with about 200 million records. One of the columns is defined as varchar(100) and it's included in a full text index. Most of the values are numeric. Only few are not numeric.
The problem is that it's not working well. For example if a row contains the value '123456789' and i look for '567', it's not returning this row. It will only return rows where the value is exactly '567'.
What am I doing wrong?
sql server 2012.
Thanks.
Full text search doesn't support leading wildcards
In my setup, these return the same
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'28400')
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"2840*"')
This gives zero rows
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"*840*"')
You'll have to use LIKE or some fancy trigram approach
The problem is probably that you are using a wrong tool since Full-text queries perform linguistic searches and it seems like you want to use simple "like" condition.
If you want to get a solution to your needs then you can post DDL+DML+'desired result'
You can do this:
....your_query.... LIKE '567%' ;
This will return all the rows that have a number 567 in the beginning, end or in between somewhere.
99% You're missing % after and before the string you search in the LIKE clause.
es:
SELECT * FROM t WHERE att LIKE '66'
is the same as as using WHERE att = '66'
if you write:
SELECT * FROM t WHERE att LIKE '%66%'
will return you all the lines containing 2 'sixes' one after other

How do I sort a VARCHAR column in PostgreSQL that contains words and numbers?

I need to order a select query using a varchar column, using numerical and text order. The query will be done in a java program, using jdbc over postgresql.
If I use ORDER BY in the select clause I obtain:
1
11
2
abc
However, I need to obtain:
1
2
11
abc
The problem is that the column can also contain text.
This question is similar (but targeted for SQL Server):
How do I sort a VARCHAR column in SQL server that contains words and numbers?
However, the solution proposed did not work with PostgreSQL.
Thanks in advance, regards,
I had the same problem and the following code solves it:
SELECT ...
FROM table
order by
CASE WHEN column < 'A'
THEN lpad(column, size, '0')
ELSE column
END;
The size var is the length of the varchar column, e.g 255 for varying(255).
You can use regular expression to do this kind of thing:
select THECOL from ...
order by
case
when substring(THECOL from '^\d+$') is null then 9999
else cast(THECOL as integer)
end,
THECOL
First you use regular expression to detect whether the content of the column is a number or not. In this case I use '^\d+$' but you can modify it to suit the situation.
If the regexp doesn't match, return a big number so this row will fall to the bottom of the order.
If the regexp matches, convert the string to number and then sort on that.
After this, sort regularly with the column.
I'm not aware of any database having a "natural sort", like some know to exist in PHP. All I've found is various functions:
Natural order sort in Postgres
Comment in the PostgreSQL ORDER BY documentation

Regular expressions inside SQL Server

I have stored values in my database that look like 5XXXXXX, where X can be any digit. In other words, I need to match incoming SQL query strings like 5349878.
Does anyone have an idea how to do it?
I have different cases like XXXX7XX for example, so it has to be generic. I don't care about representing the pattern in a different way inside the SQL Server.
I'm working with c# in .NET.
You can write queries like this in SQL Server:
--each [0-9] matches a single digit, this would match 5xx
SELECT * FROM YourTable WHERE SomeField LIKE '5[0-9][0-9]'
stored value in DB is: 5XXXXXX [where x can be any digit]
You don't mention data types - if numeric, you'll likely have to use CAST/CONVERT to change the data type to [n]varchar.
Use:
WHERE CHARINDEX(column, '5') = 1
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
CHARINDEX
ISNUMERIC
i have also different cases like XXXX7XX for example, so it has to be generic.
Use:
WHERE PATINDEX('%7%', column) = 5
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
PATINDEX
Regex Support
SQL Server 2000+ supports regex, but the catch is you have to create the UDF function in CLR before you have the ability. There are numerous articles providing example code if you google them. Once you have that in place, you can use:
5\d{6} for your first example
\d{4}7\d{2} for your second example
For more info on regular expressions, I highly recommend this website.
Try this
select * from mytable
where p1 not like '%[^0-9]%' and substring(p1,1,1)='5'
Of course, you'll need to adjust the substring value, but the rest should work...
In order to match a digit, you can use [0-9].
So you could use 5[0-9][0-9][0-9][0-9][0-9][0-9] and [0-9][0-9][0-9][0-9]7[0-9][0-9][0-9]. I do this a lot for zip codes.
SQL Wildcards are enough for this purpose. Follow this link: http://www.w3schools.com/SQL/sql_wildcards.asp
you need to use a query like this:
select * from mytable where msisdn like '%7%'
or
select * from mytable where msisdn like '56655%'

SQL produced by Entity Framework for string matching

Given this linq query against an EF data context:
var customers = data.Customers.Where(c => c.EmailDomain.StartsWith(term))
You’d expect it to produce SQL like this, right?
SELECT {cols} FROM Customers WHERE EmailDomain LIKE #term+’%’
Well, actually, it does something like this:
SELECT {cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Do you know why?
Also, replacing the Where selector to:
c => c.EmailDomain.Substring(0, term.Length) == term
it runs 10 times faster but still produces some pretty yucky SQL.
NOTE: Linq to SQL correctly translates StartsWith into Like {term}%, and nHibernate has a dedicated LikeExpression.
I don't know about MS SQL server but on SQL server compact LIKE 'foo%' is thousands time faster than CHARINDEX, if you have INDEX on seach column. And now I'm sitting and pulling my hair out how to force it use LIKE.
http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/1b835b94-7259-4284-a2a6-3d5ebda76e4b
The reason is that CharIndex is a lot faster and cleaner for SQL to perform than LIKE. The reason is, that you can have some crazy "LIKE" clauses. Example:
SELECT * FROM Customer WHERE EmailDomain LIKE 'abc%de%sss%'
But, the "CHARINDEX" function (which is basically "IndexOf") ONLY handles finding the first instance of a set of characters... no wildcards are allowed.
So, there's your answer :)
EDIT: I just wanted to add that I encourage people to use CHARINDEX in their SQL queries for things that they didn't need "LIKE" for. It is important to note though that in SQL Server 2000... a "Text" field can use the LIKE method, but not CHARINDEX.
Performance seems to be about equal between LIKE and CHARINDEX, so that should not be the reason. See here or here for some discussion. Also the CAST is very weird because CHARINDEX returns an int.
charindex returns the location of the first term within the second term.
sql starts with 1 as the first location (0 = not found)
http://msdn.microsoft.com/en-us/library/ms186323.aspx
i don't know why it uses that syntax but that's how it works
I agree that it is no faster, I was retrieving tens of thousands of rows from our database with the letter i the name. I did find however that you need to use > rather than = ... so use
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) > 0)
rather than
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Here are my two tests ....
select * from members where surname like '%i%' --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) > 0) --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) = 1) --too few results