Spacy: Detect if entity is quoted - spacy

I need to detect if a given entity is surrounded by quotes, either single or double quotes. How would I go about this.
My first thought was to add a custom extension to the span:
def is_quoted(span):
prev_token = span.doc[span.start - 1]
next_token = span.doc[span.end + 1]
return prev_token in ["\"", "'"] and next_token in ["\"", "'"]
Span.set_extension("is_quoted", getter=is_quoted)
But would this really be the most efficient way of doing this? I only want to do this on entities.
Or am I better of just writing a custom matcher, with a specific regex? But this will then run on my entire document.

Your custom extension looks fine if you only care about quotes immediately before and after. You just need to handle the case where the span is at the start or end of the doc correctly, which you aren't doing now - if the span is at the start you'll check doc[-1], the last token, for example.
Do you care about things like John said, "I never though I'd meet Peter Smith!", where "Peter Smith" is an entity? If so I would figure out a policy for nested quotes (maybe just ignore them if rare) and create an extension that walks through each sentence and marks each token as in quotes, or not in quotes (with quotes themselves defined however you want).
If you care about complex cases I wouldn't use a Matcher for this - it can't handle nesting well, and I think any solution with it would be more complicated than basic state tracking. (If you only care about immediately before and after it should be fine.)

Related

Why programming documentation has square brackets and commas in weird places?

Why in various programming documentation for functions do they have square brackets around parameters, but they are ordered such that the later parameters seem to be subsets of the first? Or if the brackets in that language delineate arrays it's as if the second parameter is supposed to be inside of the array of the first, but often the parameters are not even supposed to be arrays, and also they have commas in weird places.
I've seen this style all over the place and tried to find some place where it is written down why they do this. Maybe someone just arbitrarily decided on that and other programmers thought, "oh that looks cool, I'll try that in writing my own documentation.."
Or maybe there is some big book of rules for how to make programming docs? If so I'd like to know about it.
Here is an example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice
if you go to the link in the blue box near the top of the page right bellow the h2? heading "syntax" it says this: arr.slice([begin[, end]]) meaning that the first parameter is begin, and the next parameter is end, for the slice method. When you first see something like this it looks like the brackets and commas are randomly placed.. but they do it all over the place and the same way. There must be some method to the madness!
Brackets around a parameter name indicate that it is an optional parameter. E.g. you can call the slice method without an end parameter. This is just a general rule of language syntax documentation, square brackets indicate optional words/tokens.

IEnumString searching substrings - possible?

I've implemented auto completion to a combobox like this article shows. Is it possible to make it search for substrings instead of just the beginning of the words?
http://www.codeproject.com/Articles/2371/IAutoComplete-and-custom-IEnumString-implementatio
I haven't found any way to customize how IEnumString/IAutoComplete compares the strings. Is it possible?
The built in search options help a bit but it is complete chaos. To find instring matches you need to set flag AcoWordFilter. But this will prevent from numbers being matched!! However, there is a trick to get the numbers to match: preced with a double-quote as in "3 to find a string containing or starting with "3". Some more chaos? In the AcoWordFilter you also need to prefix other characters not considered part of a "word", eg. you need to prefix parentheses with a " but then you will not find parentheses at the first position!
So the solution is either to create your own implementation of IAutoComplete or offer the user to switch between the modes (a bit awkward).
I dont think that the MS engineers are especially proud of such chaos. How about one more option: AcoSearchAnwhere?
After retrieving the Edit control's IAutoComplete interface, query it for an IAutoComplete2 interface. Calling its SetOptions member you can disable prefix filtering by specifying the ACO_NOPREFIXFILTERING AUTOCOMPLETEOPTIONS.
This is available on Windows Vista and later. If you need a solution that works with pre-Vista versions, you'll have to write your own.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

How can I validate text box input?

I am creating a program and I need to validate my text boxes. For the program the user needs to put in a phrase. But I am not sure how to make sure that the user actually entered in a phrase, the phrase isn't (ex.) skldkfdl, or that there isn't a space.
Strings in Java
You could do a String.Trim() to get rid of trailing whitespaces first...
then do a String.IndexOf(" ") to check for a space.
If the function returns -1, it means there is no space in the string.
Running on the assumption that you're using VB.Net - Add an event handler for the event where you want to validate the text, such as when a "Submit" button is clicked. You may want to use a CancelEventHandler, so that you can cancel the click.
In the event handler, if you're looking for just simple validation, you can use if-statements to check some simple conditions, such as if you just want to check "if input.equals(password)".
Look here for an example of using CancelEventHandler
If you're looking for some more complex validation, you'll want to use regular expressions.
This page might help get you started
Checking to see if something is "a phrase", as in, proper English, would be very difficult. You would need to make sure that all of the words are in the dictionary, and then you would need to check for proper grammar, which is incredibly complex, given English grammar rules. You may want to simplify your approach, depending on your problem. For example, maybe just check that no weird characters are used, that there is more than one space, and that each word contains a vowel.