How to change sentence construction using Word VBA? - vba

I have over a hundred text files and I need to change the construction of several sentences using a specific format. I am not very familiar or experienced with Word VBA but I hope I could get some ideas to help me get started. I have below the original paragraph and its desired output. Basically I need to place the values (e.g. 40-120 parts) after each item (e.g. isoleucine) and enclose those with "(" and ")".
Original: An acid combination for increasing immunity, comprising the following raw materials by weight: 40-120 parts of isoleucine, 45-135 parts of leucine, 76.5-229.5 parts of lysine hydrochloride, 21.5-64.5 parts of methionine, 35-105 parts of phenylalanine, 40-120 parts of valine, 30-90 parts of threonine, 39-117 parts of arginine, 23-69 parts of histidine, 37.5-112.5 parts of glycine, 50-150 parts of aspartate, 900-2700 parts of dried mushroom, 750-2250 parts of medlar and 250-750 parts of licorice.
Desired Output: An acid combination for increasing immunity comprises (pts.wt.): isoleucine (40-120), leucine (45-135), lysine hydrochloride (76.5-229.5), methionine (21.5-64.5), phenylalanine (35-105), valine (40-120), threonine (30-90), arginine (39-117), histidine (23-69), glycine (37.5-112.5), aspartate (50-150), dried mushroom (900-2700), medlar (750-2250) and licorice (250-750).

Maybe you could try the following sequence :
Find the part you want to change (numbers seperated by - and parts) with the Find function (another link) and a well-formed regexp (meant wildcards for Word)
Set the brackets at the beginning and at the end of the matched element (use the range object)
Delete the last word ("part") - or whatever you want to do
Loop through every results to do the same (see an example of looping through find function here)
Don't forget you can record macro if you are looking for some tips or specific objects (even if the code produced is less complete than the one produced by Excel vba).
Please don't hesitate to post some code if you want some more help,
Regards,
Max

Related

Writing to a spreadsheet in Game maker studio 2.0

I'm making a short game for one of my classes, its for a small research study, so i want to be able to write the participants game data to a new line in a spreadsheet. I've seen stuff on reading from a .csv file but nothing about writing to it and was wondering what i need to do for that.
CSV, as per the name, is a very simple format - items go one per line and are separated by commas. If the item contains a comma or is multi-line, it should be surrounded by double-quotes ". If the item also needs to contain double quotes, they are replaced by pairs of double quotes "".
name,value
ex1,hello
ex2,"hello, you!"
ex3,"hello, ""quotes""!"
ex4,"hello,
lines!"
Rest assured, this is not hard to produce with file_text_* or buffer functions. You can check this implementation for an example.

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

Freemarker: Removing Items from One Sequence from Another Sequence

This may be something really simple, but I couldn't figure it out and been trying to find an example online to no avail. I'm basically trying to remove items found in one sequence from another sequence.
Example #1
Items added to the cart is in one sequence; items removed from cart is in another sequence:
<#assign Added_Items_to_Cart = "AAAA,BBBB,CCCC,DDDD,EEEE,FFFF">
<#assign Deleted_Items_from_Cart = "BBBB,DDDD">
The result I'm looking for is: AAAA,CCCC,EEEE,FFFF
Example #2
What if the all items added to and deleted from cart are in the same sequence?
<#assign Cart_Activity = "AAAA,BBBB,BBBB,CCCC,DDDD,EEEE,DDDD,FFFF,Add,Add,Delete,Add,Add,Add,Delete,Add">
The result I'm looking for is the same: AAAA,CCCC,EEEE,FFFF
First things first: You ask about sequence but the data you are dealing with are strings.
I know you are using the string to work as a sequence (and it works), but sequences are sequences and strings are strings, and they have diferente ways of dealing with. I just felt this was important to clarify if someone who is starting to learn how to program get to this answer.
Some assumptions since you're providing strings with data separated by comma:
You want a string with data separated by comma as a result.
You know how to properly create strings with data separated by comma.
You dont have commas in your items names.
Observations:
I'll give you the logic but not the code donne, as this can be a great chance for you to learn/practice freemarker (stackoverflow spirit, you know...)
You question is not about something specific of freemaker (it just happens to be the language you want to work with). Think about adding the logic tag to you question. :-)
Now to the answer on how to do what you want on a "string that is working as a sequence":
Example #1
Change your string to a real sequence :-)
1 - Use a built-in to split your string on commas. Do it for both Added_Items_to_Cart and Deleted_Items_from_Cart. Now you have two real sequences to work with.
2 - Create a new string tha twill be your result .
3 - Iterate over the sequence of added itens.
4 - For each item of the added list, you will check if the deleted list also contains this item.
4.1 - If the deleted list contains the item you do nothing.
4.2 - If the deleted list do not contains the item, you add that item to your string result
At the end of this nested iteration (thats another hint) you should get the result you're looking for.
Example #2
There are many ways of doing it and i'll just share the one that pops out of my mind right now.
I think it's noteworthy that in this approach you will always have an even sized list, as you always insert 2 infos each time: item and action.
So always the first half will be the 'item list' and the second half will be the 'action list'.
1 - Change that string to a sequence (yes, like on the other example).
2 - Get half of its size (in your example size = 16 so half of it is 8)
3 - Iterate over a range from 0 to half-1 (in your example 0 to 7)
4 - At each iteration you'll have a number. Lets call it num (yes I'm very creative):
4.1 - If at the position num + half you have the word "Add" you add the item of position num in your result string
4.2 - If at the position num + half you have the word "Delete" you remove the item of position num from your result string
And for the grand finale, some really usefull links that will help you in your freemarker life forever!!!
All built-ins from freemarker:
https://freemarker.apache.org/docs/ref_builtins.html
All directives from freemarker:
https://freemarker.apache.org/docs/ref_directive_alphaidx.html
Freemarekr cheatsheet :
https://freemarker.apache.org/docs/dgui_template_exp.html#exp_cheatsheet

VB expressions to help search through scraped data in UiPath

I have made a process that reads PDFs and scrapes their text in UiPath. I am struggling to come up with a regular expression that I can use to search for a PO Number. The text that comes from the scrape is fairly unstructured so my best bet is to search for a set of numbers that starts with a 'PO' with no space. For example, "PO1234567890". I will be setting a variable so the system knows that no PO number was found if the string doesn't come up with anything. Any reference material would be welcome as I am a beginner to VB. Thanks!
I have researched and cannot find a way to do the type of search I would like to do.
I expect to be able to search for a "PO1234567890" and no let something like "PO" save. So I somehow need to be able to search for "PO - two digits" and any numbers following without whitespace.
Just try the following:
Dim Regex As System.Text.RegularExpressions.Regex
Regex = New System.Text.RegularExpressions.Regex("PO[0-9]+")
Regex.Matches(SearchString)
The regex string PO[0-9]+ means:
PO followed by at least one number
if you want more digits for example 3... just use PO[0-9]{3}[0-9]* that means:
PO followed by three numbers and as numbers as it can match.
If you need help using regex matches just ask.
Hope it helps!

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).