How to awk to read a dictionary and replace words in a file? - awk

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):
The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences
Each sentence in "source-A" is on its own line and terminates with a newline (\n)
We have a dictionary/conversion file ("converse-B") that looks like this:
aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes
"converse-B" is a two column, tab delimited file.
Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)
How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?
For example, the "output-C" would look like this:
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
The tricky part is the term potato.
If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.
In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.
An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.

sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.
$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences
file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.
NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

Related

ANSI escape codes in GNU Smalltalk

I'm trying to make a console-based program that makes use of ANSI escape codes with GNU Smalltalk. I can't seem to figure out how to go about printing a string object formatted with ANSI escape codes. I've tried the following.
'\x1b[31mHi' displayNl
This prints the entire string, including the escape code, without any formatting. I would have expected this to print "Hi" in red (and then everything else in the console after that, as I didn't reset the color.)
After googling a bit, I was able to find a couple issues on mailing lists where people were trying to produce things like newlines using "\n". Most of the answers were using the Transcript object's cr method, but I didn't find anything about colors in the textCollector class.
It looks like it shouldn't be all that hard to create my own module in C to achieve this functionality, but I'd like to know if there's a better way first.
I'm aware of the ncurses bindings, but I'm not sure that'd be practical for just making certain pieces of text in the program colored. So, is there a standard way of outputting colored text to the terminal in GNU Smalltalk using ANSI escape sequences?
Ended up getting an answer on the GNU Smalltalk mailing list. Looks like you can use an interpolation operator to achieve this.
For example ('%1[31mHi' % #($<16r1B>)) displayNl. would change the color to red, and ('%1[34mHi' % #($<16r1B>)) displayNl. would change the color to blue.
Basically, the % operator looks for a sequences that look like "%(number)" and replaces them with the objects in the array to the right of the operator. In our case, the array has one item, which is the ascii escape character in hexadecimal. So the "%1" in "%1[31mHi' is being replaced with the escape character, and then printed.
(This answer was stolen almost verbatim from Paolo on the GNU Smalltalk mailing list.)

In awk, is there a functional difference between putting conditions outside the brackets or inside?

As far as I can tell, this:
$2 > 50{print}
functions exactly the same way as this:
{if ($2 > 50) print}
When should I use one vs the other, or is it a matter of style?
2nd one is much more readable in my opinion (especially for non awk gurus who may have to maintain you code).
It's mostly style. The first example is canonical AWK. There are, of course, times when an if statement is required.
Consider that:
$2 > 50 {foo; print}
takes 20 characters versus 27 for the following:
{if ($2 > 50) {foo; print}}
If you remove spaces, it would be 16 versus 22 characters.
So the former is more desirable if you are interested in economy (both in terms of actual characters and visually).
As potong points out, the if form requires curly braces if there are more than one statement (as I have shown in my second example above). If there is only one statement, they are optional. This is the source of subtle bugs if it's written without and additional statements are later added without adding braces. Also, for the reader of a script, the intent is much clearer if the braces are included.
Yes, they are the same. As with most syntax alternatives, there are trade offs.
I mostly agree with John3136 that 2nd form has a higher likelyhood of being understood by someone with background in c, c++, C#, java, and others.
And as Dennis Williamson points out, there are times you have to use the if (...) form.
For one-lines, I think the first form is useful, acceptable, and helps reinforce the primary design of awk, i.e. [pattern]-[action] (being optional or providing a default behaviour).
If you look at comp.lang.awk, you'll see that the long-timers mostly prefer style 1.
Large blocks of code in style 1, to my eye, are difficult to read, but do have the look of lex code.
So.... depends ;-) .... are you writing code that will ultimately be maintained by likely non-awk experts? Then stick w style 2. Writing for yourself? Use both so you don't get rusty!
I hope this helps.
Yes they are the same.
On the first form, you can omit { print } because it's the default action. So, that works too:
$2 > 50

How do I create a sub arrary in awk?

Given a list like:
Dog bone
Cat catnip
Human ipad
Dog collar
Dog collar
Cat collar
Human car
Human laptop
Cat catnip
Human ipad
How can I get results like this, using awk:
Dog bone 1
Dog collar 2
Cat catnip 2
Cat collar 1
Human car 1
Human laptop 1
Human ipad 2
Do I need a sub array? It seems to me like a need an array of "owners" which is populated by arrays of "things."
I'd like to use awk to do this, as this is a subscript of another program in awk, and for now, I'd rather not create a separate program.
By the way, I can already do it using sort and grep -c, and a few other pipes, but I really won't be able to do that on gigantic data files, as it would be too slow. Awk is generally much faster for this kind of thing, I'm told.
Thanks,
Kevin
EDIT: Be aware, that the columns are actually not next to eachother like this, in the real file, they are more like column $8 and $11. I say this because I suppose if they were next to eachother I could incorporate an awk regex ~/Dog\ Collar/ or something. But I won't have that option. -thanks!
awk does not have multi-dimensional arrays, but you can manage by constructing 2D-ish array keys:
awk '{count[$1 " " $2]++} END {for (key in count) print key, count[key]}' | sort
which, from your input, outputs
Cat catnip 2
Cat collar 1
Dog bone 1
Dog collar 2
Human car 1
Human ipad 2
Human laptop 1
Here, I use a space to separate the key values. If your data contains spaces, you can use some other character that does not appear in your input. I typically use array[$a FS $b] when I have a specific field separator, since that's guaranteed not to appear in the field values.
GNU Awk has some support for multi-dimensional arrays, but it's really just cleverly concatenating keys to form a sort of compound key.
I'd recommend learning Perl, which will be fairly familiar to you if you like awk, but Perl supports true Lists of Lists. In general, Perl will take you much further than awk.
Re your comment:
I'm not trying to be superior. I understand you asked how to accomplish a task with a specific tool, awk. I did give a link to the documentation for simulating multi-dimensional arrays in awk. But awk doesn't do that task well, and it was effectively replaced by Perl nearly 20 years ago.
If you ask how to cross a lake on a bicycle, and I tell you it'll be easier in a boat, I don't think that's unreasonable. If I tell you it'll be easier to first build a bridge, or first invent a Star Trek transporter, then that would be unreasonable.

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

Is there a tool to clean the output of the script(1) tool?

script(1) is a tool for keeping a record of an interactive terminal session; by default it writes to the file transcript. My problem is that I use ksh93, which has readline features, and so the transcript is mucked up with all sorts of terminal escape sequences and it can be very difficult to reconstruct the command that was actually executed. Not to mention the stray ^M's and the like.
I'm looking for a tool that will read a transcript file written by script, remove all the junk, and reconstruct what the shell thought it was executing, so I have something that shows $PS1 and the commands actually executed. Failing that, I'm looking for suggestions on how to write such a tool, ideally using knowledge from the terminfo database, or failing that, just using ANSI escape sequences.
A cheat that looks in shell history, as long as it really really works, would also be acceptable.
Doesn't cat/more work by default for browsing the transcript? Do you intend to create a script out of the commands actually executed (which in my experience can be dangerous)?
Anyway, 3 years without an answer, so I will give it a shot with an incomplete solution. If your are only interested in the commands actually typed, remove the non-printable characters, then replace PS1' with something readable and unique, and grep for that unique string. Like this:
$ sed -i 's/[^[:print:]]//g' transcript
$ sed 's/]0;cartman#southpark: ~cartman#southpark:~/CARTMAN/g' transcript | grep CARTMAN
Explanation: After first sed, PS1' can be taken from one of the first few lines of the transcript file, as is -- PS1' is different from PS1 -- and can be modified with a unique readable string ("CARTMAN" here). Note that the dollar sign at the end of the prompt was left out intentionally.
In the few examples that I tried, this didn't solve everything but took care of most issues.
This is essentially the same question asked recently in Can I programmatically “burn in” ANSI control codes to a file using unix utils? -- removing all nonprinting characters will not fix
embedded escape sequences
backspace/overstriking for underlining
use of carriage-returns for overstriking