How to search for different character sets in postgresql? - sql

I want to search a table in a postgres DB which contains both Arabic and English text. For example:
id | content
-----------------
1 | دجاج
2 | chicken
3 | دجاج chicken
The result would get me row 3.
I imagine this has to do with limiting characters using regex, but I cannot find a clean solution to select both. I tried:
SELECT regexp_matches(content, '^([x00-\xFF]+[a-zA-Z][x00-\xFF]+)*')
FROM mg.messages;
However, this only matches english and some non english characters within {}.

I know nothing about Arabic text or RTL languages in general, but this worked:
create table phrase (
id serial,
phrase text
);
insert into phrase (phrase) values ('apple pie');
insert into phrase (phrase) values ('فطيرة التفاح');
select *
from phrase
where phrase like ('apple%')
or phrase like ('فطيرة%');
http://sqlfiddle.com/#!15/75b29/2

If you want to find all articles that has at least one Unicode characters from the Arabic range (U+0600 -> U-06FF), you would have to use the following:
SELECT content FROM mg.messages WHERE content ~ E'[\u0600-\u06FF]';
Which would indeed return id 1 (Arabic only),
...you would have to adapt the pattern to match any Arabic character followed or preceded by another ASCII (english?) character.
If you want to search for any other character set (range), here is a list of all the Unicode Blocks (Hebrew, Greek, Cyrillic, Hieroglyphs, Ideographs, dingbats, etc.)

Related

Getting the Column containing the non-english language in ORACLE

I have above entries in my database, my requirement is to extract the fields containing the non-english language characters ( including if the data containing the combination of english and non-english characters like HotelName field for the ID 45).
I tried by regexp_like function by looking for the alphanumeric and non-alphanumeric, but i have some data with combination of both the condition fails there.
Thanks in Advance
Raghavan
Does this do what you want?
where regexp_like(hotelname, '[^a-zA-Z0-9 ]')
That is, where the hotel name contains any character that is not a "letter" or digit. You may need to take additional characters into account as well, such as commas, periods, and hyphens.

SQL Text Function for special characters

I have a field with text reviews in it and I want to spot where people have used special characters to get offensive words etc past the filters, so instead of typing badword they type b.a.d.w.o.r.d or b*a*d*w*o*r*d,
Is there a way to look for say 3 or more special characters in word in a text review, maybe some sort of count function for special characters?
If you have a table with a field containing words you dont want to allow you could add it in your WHERE clause like so using REGEX_REPLACE.
SELECT yourfield
FROM yourtable
WHERE REGEXP_REPLACE(yourfield,'[^a-zA-Z'']','') NOT IN (SELECT badwords
FROM badwordstable)

Strange behaviour with Fulltext search in SQL Server

I have MyTable with a Column Message NVARCHAR(MAX).
Record with ID 1 contains the Message '0123456789333444 Test'
When I run the following query
DECLARE #Keyword NVARCHAR(100)
SET #Keyword = '0123456789000001*'
SELECT *
FROM MyTable
WHERE CONTAINS(Message, #Keyword)
Record ID 1 is showing up in the results and in my opinion it should not because 0123456789333444 does not contains 0123456789000001.
Can someone explain why the records is showing up anyway?
EDIT
select * from sys.dm_fts_parser('"0123456789333444 Test"',1033,0,0)
returns the following:
group_id phrase_id occurrence special_term display_term expansion_type source_term
1 0 1 Exact Match 0123456789333444 0 0123456789333444 Test
1 0 1 Exact Match nn0123456789333444 0 0123456789333444 Test
1 0 2 Exact Match test 0 0123456789333444 Test
This is because the #Keyword is not wrapped in double quotes. Which forces zero, one, or more matches.
Specifies a match of words or phrases beginning with
the specified text. Enclose a prefix term in double quotation marks
("") and add an asterisk () before the ending quotation mark, so that
all text starting with the simple term specified before the asterisk
is matched. The clause should be specified this way: CONTAINS (column,
'"text"'). The asterisk matches zero, one, or more characters (of the
root word or words in the word or phrase). If the text and asterisk
are not delimited by double quotation marks, so the predicate reads
CONTAINS (column, 'text*'), full-text search considers the asterisk as
a character and searches for exact matches to text*. The full-text
engine will not find words with the asterisk (*) character because
word breakers typically ignore such characters.
When is a phrase, each word contained in the phrase is
considered to be a separate prefix. Therefore, a query specifying a
prefix term of "local wine*" matches any rows with the text of "local
winery", "locally wined and dined", and so on.
Have a look at the MSDN on the topic. MSDN
Have you tried to query the following view to see what's on the system stoplist?
select * from sys.fulltext_system_stopwords where language_id = 1033;
Found a solution that works. I've added language 1033 as an additional parameter.
SELECT * FROM MyTable WHERE CONTAINS(Message, #Keyword, langauge 1033)

PostgreSQL String search for partial patterns removing exrtaneous characters

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...
You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.

In Postgres / SQL how can I search for names that doesn't start with a letter?

I want to find all the names that start with numbers, weird chars (.,-#$, etc) and everything else that isn't a letter.
For example, i have 3 names: John, #1 John and 2John. What I want to get is the last 2 names. (and I don't know what weird chars the names can start, so it must be something like ![a-Z])..
I'm using postgresql.
SELECT *
FROM Table
WHERE name ~ '^[^a-zA-Z]'
If accented or non-Latin characters don't fall under your definition of "weird stuff", you may use:
SELECT *
FROM Table
WHERE name ~ '^[^[:alpha:]]'
PostgreSQL Manual: Pattern Matching