I'm trying to create a regex that will look for french words whether a user specifies the accented characters or not. So if the the user has searched for "déclaré" but types in declare instead I would like to be able to match the text still. I'm having difficulty making this more dynamic so that it can be matched on any french word...
Closest example from another user from a different post was:
d[eèéê]cl[aàáâ]r[eèéê]
Is it even possible to write a regex for something like this?
Any advice would be much appreciated.
I had once to create something like that.
The best thing I could come up with was something akin to having a dictionary of known letters with diacritics and replace them on the search terms, before creating a pattern for a regular expression.
Pretty much like you did on your own example.
Related
I am trying to pull 'COURSE_TITLE' column value from 'PS_TRAINING' table in PeopleSoft and writing into UTF-8 text file to get loaded into Workday system. The file is erroring out while loading because of bad characters(Ã â and many more) present in the column. I have used a procedure which will convert non-ascii value into space. But because of this procedure, the 'Course_Title' which are written in non-english language like Chinese, Korean, Spanish also replacing with spaces.
I even tried using regular expressions (``regexp_like(course_title, 'Ã) only to find bad characters but since the table has hundreds of thousands of rows, it would be difficult to find all bad characters. Please suggest a way to solve this.
If you change your approach, this may work.
Define what you want, and retrieve it.
select *
from PS_TRAINING
where not regexp_like(course_title, '[0-9A-Za-z]')```
If you take out too much data, just add it to the regex
I'm a bit lost.
I've had a look at the documentation but I'm not sure if you can use LIKE and pattern match in Big Query the same as SSMS.
The code shown here works in SSMS but the results are not correct in Big Query, so was wondering if there was another way to do it.
WHERE column_name NOT LIKE '[a-Z]%'
I'm looking to return strings which contain special characters or numerics.
Use REGEXP_CONTAINS instead
where not regexp_contains(column_name, r'[a-zA-Z]')
Meantime, LIKE is also supported as a comparison operator
I have a database that has around 10k records and some of them contain HTML characters which I would like to replace.
For example I can find all occurrences:
SELECT * FROM TABLE
WHERE TEXTFIELD LIKE '%/%'
the original string example:
this is the cool mega string that contains /
how to replace all / with / ?
The end result should be:
this is the cool mega string that contains /
If you want to replace a specific string with another string or transformation of that string, you could use the "replace" function in postgresql. For instance, to replace all occurances of "cat" with "dog" in the column "myfield", you would do:
UPDATE tablename
SET myfield = replace(myfield,"cat", "dog")
You could add a WHERE clause or any other logic as you see fit.
Alternatively, if you are trying to convert HTML entities, ASCII characters, or between various encoding schemes, postgre has functions for that as well. Postgresql String Functions.
The answer given by #davesnitty will work, but you need to think very carefully about whether the text pattern you're replacing could appear embedded in a longer pattern you don't want to modify. Otherwise you'll find someone's nooking a fire, and that's just weird.
If possible, use a suitable dedicated tool for what you're un-escaping. Got URLEncoded text? use a url decoder. Got XML entities? Process them though an XSLT stylesheet in text mode output. etc. These are usually safer for your data than hacking it with find-and-replace, in that find and replace often has unfortunate side effects if not applied very carefully, as noted above.
It's possible you may want to use a regular expression. They are not a universal solution to all problems but are really handy for some jobs.
If you want to unconditionally replace all instances of "/" with "/", you don't need a regexp.
If you want to replace "/" but not "Ǘ", you might need a regexp, because you can do things like match only whole words, match various patterns, specify min/max runs of digits, etc.
In the PostgreSQL string functions and operators documentation you'll find the regexp_replace function, which will let you apply a regexp during an UPDATE statement.
To be able to say much more I'd need to know what your real data is and what you're really trying to do.
If you don't have postgres, you can export all database to a sql file, replace your string with a text editor and delete your db on your host, and re-import your new db
PS: be careful
Basically my issue is that users would like to search for a french word that has accented characters but without typing in the accented characters and then have the actual accented word appeared highlighted if found... So for example they would type in "declare" but in the result sets it would look like "déclare" and if found "déclare" would be highlighted.
My first thought was to just simply replace the characters with a regex but then I remembered that I would need to re-insert the replaced characters after the search... I was thinking of then using some sort of character map that would track position and the character so that when the search was finshed I could put the result set back to the way it was. This seems a little brute force to me and I was wondering if anyone had a better alternative? I'm using Visual Studio 2005 with this app.
Any advice would be much appreciated!
Thanks
A regular expression by default matches text. The "replacement" mode is not the normal mode. So, what you want is in fact the default. The precise syntax will depend on your Regex engine, e.g. in .Net you'd use Regex.IsMatch()
I'm working on a search module that searches in text columns that contains html code. The queries are constructed like: WHERE htmlcolumn LIKE '% searchterm %';
Default the modules searches with spaces at both end of the searchterms, with wildcards at the beginning and/or the end of the searchterms these spaces are removed (*searchterm -> LIKE '%searchterm %'; Also i've added the possibility to exclude results with certain words (-searchterm -> NOT LIKE '% searchterm %'). So far so good.
The problem is that words that that are preceded by an html-tag are not found (<br/>searchterm is not found when searching on LIKE '% searchterm.., also words that come after a comma or end with a period etc.).
What i would like to do is search for words that are not preceded or followed by the characters A-Z and a-z. Every other characters are ok.
Any ideas how i should achieve this? Thanks!
Look into MySQLs fulltextsearch, it might be able to use non-letter characters as delimiters. It will alsow be much much faster than a %term% search since that requires a full table-scan.
You could use a regular expression: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
Generally speaking, it is better to use full text search facilities, but if you really want a small SQL, here it is:
SELECT * FROM `t` WHERE `htmlcolumn` REGEXP '[[:<:]]term[[:>:]]'
It returns all records that contain word 'term' whether it is surrounded with spaces, punctuation, special characters etc
I don't think SQL's "LIKE" operator alone is the right tool for the job you are trying to do. Consider using Lucene, or something like it. I was able to integrate Lucene.NET into my application in a couple days. You'll spend more time than that trying to salvage your current approach.
If you have no choice but to make your current approach work, then consider storing the text in two columns in your database. The first column is for the pure text, with punctuation etc. The second column is the text that has been pre-preprocessed, just words, no punctuation, normalized so as to be easier for your "LIKE" approach.