Find a url for a file in html using a regular expression - objective-c

I've set myself a somewhat ambitious first task in learning regular expressions (and one which relates to a problem I'm trying to solve). I need to find any instance of a url that ends in .m4v, in a big html string.
My first attempt was this for jpg files
http.*jpg
Which of course seems correct on first glance, but of course returns stuff like this:
http://domain.com/page.html" title="Misc"><img src="http://domain.com/image.jpg
Which does match the expression in theory. So really, I need to put something in http.*m4v that says 'only the closest instance between http and m4v'. Any ideas?

As you've noticed, an expression such as the following is greedy:
http:.*\.jpg
That means it reads as much input as possible while satisfying the expression.
It's the "*" operator that makes it greedy. There's a well-defined regex technique to making this non-greedy… use the "?" modifier after the "*".
http:.*?\.jpg
Now it will match as little as possible while still satisifying the expression (i.e. it will stop searching at the first occurrence of ".jpg".
Of course, if you have a .jpg in the middle of a URL, like:
http://mydomain.com/some.jpg-folder/foo.jpg
It will not match the full URL.
You'll want to define the end of the URL as something that can't be considered part of the URL, such as a space, or a new line, or (if the URL in nested inside parentheses), a closing parenthesis. This can't be solved with just one little regex however if it's included in written language, since URLs are often ambiguous.
Take for example:
At this page, http://mysite.com/puppy.html, there's a cute little puppy dog.
The comma could technically be a part of a URL. You have to deal with a lot of ambiguities like this when looking for URLs in written text, and it's hard not to have bugs due to the ambiguities.
EDIT | Here's an example of a regex in PHP that is a quick and dirty solution, being greedy only where needed and trying to deal with the English language:
<?php
$str = "Checkout http://www.foo.com/test?items=bat,ball, for info about bats and balls";
preg_match('/https?:\/\/([a-zA-Z0-9][a-zA-Z0-9-]*)(\.[a-zA-Z0-9-]+)*((\/[^\s]*)(?=[\s\.,;!\?]))\b/i', $str, $matches);
var_dump($matches);
It outputs:
array(5) {
[0]=>
string(38) "http://www.foo.com/test?items=bat,ball"
[1]=>
string(3) "www"
[2]=>
string(4) ".com"
[3]=>
string(20) "/test?items=bat,ball"
[4]=>
string(20) "/test?items=bat,ball"
}
The explanation is in the comments.

Perl, ruby, php and javascript should all work with these:
/(http:\/\/(?:(?:(?!\http:\/\/).))+\.jpg)/
The URLs will be stored in the matched groups. Tested this out against "http://a.com/b.jpg-folder/c.jpg http://mydomain.com/some.jpg-folder/foo.jpg" and it worked correctly without being too greedy.

Related

Ambigous route matches on slugified urls

I'm getting an ambiguous match for uri: Toyota-Corolla-vehicles/2 Iv'e isolated the problem down to these two routes
[HttpGet("{make}-vehicles/{makeId:int}")]
[HttpGet("{make}-{query}-vehicles/{makeId:int}")]
It looks pretty unambiguous to me. Shouldn't the uri match the route with two dashes in it?
For more context:
I'm using readable url's like this Toyota-vehicles-in-2005. So I can't use forward slashes for separation.
[HttpGet("{make}-vehicles-in-{year}/{makeId:int}")]
The docs state:
Complex segments (for example, [Route("/dog{token}cat")]), are
processed by matching up literals from right to left in a non-greedy
way. See the source code for a description. For more information, see
this issue.
https://github.com/aspnet/Routing/blob/9cea167cfac36cf034dbb780e3f783114ef94780/src/Microsoft.AspNetCore.Routing/Patterns/RoutePatternMatcher.cs#L296
https://github.com/aspnet/AspNetCore.Docs/issues/8197
I would suggest you to use these routes:
[HttpGet("{make}/vehicles/{makeId:int}")]
[HttpGet("{make}/{query}/vehicles/{makeId:int}")]
There is no ambiguity between Toyota/vehicles/2 and Toyota/Corolla/vehicles/2 in that case. In your case it has ambiguity due to fact, that {query} is type of string, so Toyota-Corolla-vehicles string matches both {make}-vehicles and {make}-{query}-vehicles, because we can parse it like:
ALL {make} parameter equals to Toyota-Corolla;
{make}-{query} equals to Toyota-Corolla, where {make} is Toyota and {query} is Corolla correspondingly.
So, the problem is with your character -. If you don't want to change your routes, you can leave only [HttpGet("{make}-vehicles/{makeId:int}")] and distinguish then Toyota-Corolla by string.Split method.

Doing multiple gsub in a rails 3 app on email body template

I have an email template where the user can enter text like this:
Hello {first_name}, how are you?
and when the email actually gets sent it replaces the placeholder text {first_name} with the actual value.
There will be several of these placeholders, though, and I wasn't sure that gsub is meant to be used like this.
body = #email.body.gsub("{first_name}", #person.first_name)gsub("{last_name}", #person.last_name).gsub("",...).gsub("",...).gsub("",...).etc...
Is there a cleaner solution to achieving this functionality? Also, if anyone's done something similar to this, did they find that they eventually hit a point where using multiple gsubs on a few paragraphs for hundreds of emails was just too slow?
EDIT
I ran some tests comparing multiple gsubs vs using regex and it came out that the regex was usually 3x FASTER than using multiple gsubs. However, I think the regex code is a littler harder to read as-is, so I'm going to have to clean it up a bit but it does indeed seem that using regex is significantly faster than multiple gsubs. Since my use case will involve multiple substitutions for a large number of documents, the faster solution is better for me, even though I'll have to add a little more documentation.
You have to put in regular expressions all strings you want to catch and in the hash you put the replacement of all catches:
"123456789".gsub /(123|456)/, "123" => "ABC",
"456" => "DEF"
This code only works for ruby 1.9.
If you can use a template library like erb or haml, they are the proper tool for this kind of task.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

How to Parse Some Wiki Markup

Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!
This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).
I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.
If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!