VB expressions to help search through scraped data in UiPath - vb.net

I have made a process that reads PDFs and scrapes their text in UiPath. I am struggling to come up with a regular expression that I can use to search for a PO Number. The text that comes from the scrape is fairly unstructured so my best bet is to search for a set of numbers that starts with a 'PO' with no space. For example, "PO1234567890". I will be setting a variable so the system knows that no PO number was found if the string doesn't come up with anything. Any reference material would be welcome as I am a beginner to VB. Thanks!
I have researched and cannot find a way to do the type of search I would like to do.
I expect to be able to search for a "PO1234567890" and no let something like "PO" save. So I somehow need to be able to search for "PO - two digits" and any numbers following without whitespace.

Just try the following:
Dim Regex As System.Text.RegularExpressions.Regex
Regex = New System.Text.RegularExpressions.Regex("PO[0-9]+")
Regex.Matches(SearchString)
The regex string PO[0-9]+ means:
PO followed by at least one number
if you want more digits for example 3... just use PO[0-9]{3}[0-9]* that means:
PO followed by three numbers and as numbers as it can match.
If you need help using regex matches just ask.
Hope it helps!

Related

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.
From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

Find All in a Textbox

I am working on an application to search for and build a list of all the times a string (or variable of) is in a text file. Kind of like a Find All function in a text editor that I can build a list with the info that is found, such as
S350
S250
S270
S5000
What can I use to do this search? It will have one value that does not change (The S in this case) followed by up to 4 digits
RegEx seems like a good choice.
Something like.. S(\d{1,4})? might work for you.
Expresso is my preferred regular expression composer.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

How to change sentence construction using Word VBA?

I have over a hundred text files and I need to change the construction of several sentences using a specific format. I am not very familiar or experienced with Word VBA but I hope I could get some ideas to help me get started. I have below the original paragraph and its desired output. Basically I need to place the values (e.g. 40-120 parts) after each item (e.g. isoleucine) and enclose those with "(" and ")".
Original: An acid combination for increasing immunity, comprising the following raw materials by weight: 40-120 parts of isoleucine, 45-135 parts of leucine, 76.5-229.5 parts of lysine hydrochloride, 21.5-64.5 parts of methionine, 35-105 parts of phenylalanine, 40-120 parts of valine, 30-90 parts of threonine, 39-117 parts of arginine, 23-69 parts of histidine, 37.5-112.5 parts of glycine, 50-150 parts of aspartate, 900-2700 parts of dried mushroom, 750-2250 parts of medlar and 250-750 parts of licorice.
Desired Output: An acid combination for increasing immunity comprises (pts.wt.): isoleucine (40-120), leucine (45-135), lysine hydrochloride (76.5-229.5), methionine (21.5-64.5), phenylalanine (35-105), valine (40-120), threonine (30-90), arginine (39-117), histidine (23-69), glycine (37.5-112.5), aspartate (50-150), dried mushroom (900-2700), medlar (750-2250) and licorice (250-750).
Maybe you could try the following sequence :
Find the part you want to change (numbers seperated by - and parts) with the Find function (another link) and a well-formed regexp (meant wildcards for Word)
Set the brackets at the beginning and at the end of the matched element (use the range object)
Delete the last word ("part") - or whatever you want to do
Loop through every results to do the same (see an example of looping through find function here)
Don't forget you can record macro if you are looking for some tips or specific objects (even if the code produced is less complete than the one produced by Excel vba).
Please don't hesitate to post some code if you want some more help,
Regards,
Max