PostgreSQL: Getting Strings between double Brackets - sql

I am new to PostgreSQL and decided to play around with it a little bit. Thus I want to extract Strings from a Marked-Up text, but I do not get the result I wanted.
To do that I have a simple table named book and a column with a text data type called page.
The data looks something like this:
Simon the Sorcerer is a [[teenager]] transported into a fantasy world as a
[[Sorcerer (person)|sorcerer]] dressed in a cloak and [[pointy hat]]; his
cloak and hat are purple in the first game, but change to red for the rest
of the series (aside from possible magical colour changes in the third game).
He must use his logic and magical skills to solve puzzles as he progresses
through the games.
Now I want to extract each String between the brackets [[ ]]. But how do I do that correctly? Or is that even possible without Plug-ins?
I tried different versions with LIKE which did not produce any results, e.g.:
SELECT * FROM book WHERE page LIKE '[[%]]';
SELECT * FROM book WHERE page LIKE '[[teenager]]';
...
I also played a bit with regex, but I only got the whole dataset back instead of the String between the brackets:
SELECT * from book where page ~ '[^[[]+(?=\]])';
Is there a solution for my problem? Help will be much appreciated!

Use the regexp_matches function:
select regexp_matches(page,'[^[[]+(?=\]])','g') from book

Related

Search postgresql database for strings contianing specific words

I'm looking to query a postgresql database full of strings, specifically for strings with the word 'LOVE' in - this means only this specific version of the word and nothing where love is the stem or has that sequence of characters inside another word. I've so far been using the SELECT * FROM songs WHERE title LIKE '%LOVE%';, which mostly returns the desired results.
However, it also returns results like CRIMSON AND CLOVER, LOVESTONED/I THINK SHE KNOWS (INTERLUDE), LOVER YOU SHOULD'VE COME OVER and TO BE LOVED, which I want to exclude as they are specifically the word 'LOVE'.
I know you can use SELECT * FROM songs WHERE title = 'LOVE';, but this will obviously miss any string that isn't exactly 'LOVE'. Is there an operation in postgresql that can return the results I need?
You can use a regular expression that looks for love either with a space before or after, or if the word is at the start or end of the string:
with songs (title) as (
values
('Crimson And Clover'),
('Love hurts'),
('Only love can tear us apart'),
('To be loved'),
('Tainted love')
)
select *
from songs
where title ~* '\mlove\M';
The ~* is the regex operator and uses case insensitive comparison. The \m and \M restrict the match to the beginning and end of a word.
returns:
title
---------------------------
Love hurts
Only love can tear us apart
Tainted love
Online example: http://rextester.com/EUTHKM33922

Semantic mediawiki - Set propery to range of values

I'm trying to set a specific property to a non exact value, for example say that I want to define the height of a pine tree to usually between 3-80 m (according to wikipedia). Then I would like to set something like [[Has height::3-80]] (of course this doesn't work) and defining the unit to meters with "custom units". Then I would like to be able to query for example "trees that can reach the height of 70 meters" and the pine tree would be included. I've been searching and trying different angles for hours now and I can't figure it out. Tried with #set_recurring_event but that seems to be only for dates/time. Also understood how to set multiple values for a property with #arraymap but this doesn't seem to help me here. Really would appreciate help with this (it's probably very easy and right in front of me) Thx! COG
There's no such things. But you able to create template, with parameters you want. The you just use code kinda {{range|min|max|units}}. For example your range of heights looks like {{range|3|80|m}}.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).

Data Cleanup, post conversion from ALLCAPS to Title Case

Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:
MacDonald, PhD, CPA, III
Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").
I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.
There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:
SQL Server: Make all UPPER case to Proper Case/Title Case
Don't know if this is of any help
private static function ucNames($surname) {
// ( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
$replaceValue = ucwords($surname);
return preg_replace('/
(?: ^ | \\b ) # assertion: beginning of string or a word boundary
( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
( [^\W\d_] ) # match next char; we exclude digits and _ from \w
/xe',
"'\$1' . strtoupper('\$2')",
$replaceValue);
}
It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe
Here is the answer I was looking for:
There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.
I was able to use two applications to solve my problem.
StyleList: this app, among other
things, converts ALL CAPS to mixed
case and in the process it does not
dirty up the data, leaving titles
such as CPA, MD, III, etc. in tact;
as well as natural, common
camel-case names such as McDonalds.
Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.
Here is a link to the solutions offered by Melissa Data:
http://www.melissadata.com/dqt/index.htm
For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.