Simplest way to find text by regex and replace by lookup table - awk

A legacy web application needs to be internationalized. Error messages are currently written inside source code in this way:
addErrorMessage("some text here");
These signs can be easily found and extracted using regex. They should be replaced with something like this:
addErrorMessage(ResourceBundle.getBundle("/Bundle", lcale).getString("key for text here"));
The correspondence between key for text here and some text here will be in a .property file.
According to some linux guru it can be achieved using awk, but I don't know anything about it. I can write a small application to do that task but it could be overkill. Are there ide plugin or existing applications for this goal ?

awk -v TextOrg='some text here' -v key='key for text here' ' "addErrorMessage(\"" TextOrg "\")" {
gsub( "addErrorMessage(\"" TextOrg "\")" \
, "addErrorMessage(ResourceBundle.getBundle(\"/Bundle\", lcale).getString(\"" key "\"))")
}
7
' YourFile
this is one way for a specific combination. Be carefful with:
assignation of value (-v ... that are constraint by shell interpretation in this case)
gsub is using regex to find, so your text need to be treated with this constraint (ex: "this f***ing text" -> "this f\*\*\*ing text" )
You certainly want to do if for several peer.
her with a file conatinaing peers
assuming that Trad.txt is a file that containt a series of 2 lines 1st, original text, second the key (to avoid some chara as separator that need complexe escape sequence interpretation if used)
ex: Trad.txt
some text
key text
other text
other key
sample code (simple, no exhaustive security, ...) Not tested, but for the concept with awk
awk '
# for first file only
FNR == NR {
# keep in memory first line as text to change
if ( NR % 2 ) TextOrg = $0
else {
# load in array the key corresponding (index is the text to change)
Key[ TextOrg] = $0
Len[ TextOrg] = length( addErrorMessage(\"" $0 "\")"
}
# don't go further in script for this line
next
}
# this point and further is reach only by second file
# if addError is found
/addErrorMessage(".*")/{
# try with each peer if there is a change (a more complex loop is more perfomant checking just necessary replacement but this one do the job)
for( TextOrg in Key) {
# to avoid regex interpretation
# Assuming for this sample code that there is 1 replace (loop is needed normaly)
# try a find of text (return the place where)
Here = index( addErrorMessage(\"" TextOrg "\")", $0)
if( Here > 0) {
# got a match, replace the substring be recreating a full one
$0 = substr( $0, 1, Here) \
"addErrorMessage(ResourceBundle.getBundle(\"/Bundle\", lcale).getString(\"" Key[ TextOrg] "\"))") \
substr( $0, Here + Len[ TextOrg])
}
}
}
# print the line in his current state (modified or not)
7
' Trad.txt YourFile
Finally, this is a workaround solution because lot of special case could occurs like "ref: function addErrorMessage(\" ...\") bla bla" will be an issue, or space inside () not treated here, or cutted line insdie (), ...

Related

awk replace string with another with new lines ( one time ) after finding another string

I wanted replace ___SIGNATURE___ with an HTML code signature after the first occurrence of "text/html" and only one replacement string ___SIGNATURE___. Any remaining ___SIGNATURE___ tags should simply be removed.
I am processing an email message where the header has a multipart boundary so there are two body parts, one with text/plain and another with text/html and the ___SIGNATURE___ tag exists in both.
So my part of my script looks like this:
awk -v signature="$(cat $disclaimer_file)" '/text\/html/ {html=1;} html==1 && !swap(swap=sub(/___SIGNATURE___/, signature);}1 in.$$ > temp.mail && mv temp.mail in.$$
sed -i "s/charset=us-ascii/charset=utf-8/1;s/___SIGNATURE___//" in.$$
It works, but is that optimal solution?
I have used altermime before but it was not good solution for my case.
Without access to sample messages, it's hard to predict what exactly will work, and whether we need to properly parse the MIME structures or if we can just blindly treat the message as text.
In the latter case, refactoring to something like
awk 'NR==FNR { signature = signature ORS $0; next }
{ sub(/charset="?[Uu][Ss]-[Aa][Ss][Cc][Ii][Ii]"?/, "charset=\"utf-8\"") }
/text\/html/ { html = 1 }
/text\/plain/ { html = 0 }
/___SIGNATURE___/ {
if (html && signature) {
# substr because there is a ORS before the text
sub(/___SIGNATURE___/, substr(signature, 2))
signature = ""
} else
sub(/___SIGNATURE___/, "")
} 1' "$disclaimer_file" "in.$$"
would avoid invoking both Awk and sed (and cat, and the quite pesky temporary file), where just Awk can reasonably and quite comfortably do all the work.
If you need a proper MIME parser, I would look into writing a simple Python script. The email library in Python 3.6+ is quite easy to use and flexible (but avoid copy/pasting old code which uses raw MIMEMultipart etc; you want to use the (no longer very) new EmailMessage class).

Double-quote match between pattern1 and first pattern2 and replace newlines within a file (output whole file)

Problem definition
I’d like to capture a multiline substring from a file (or STDIN), double-quote it and replace all newlines (\n → \\n) within it (not outside of it).
There might be several start and end patterns in the file; I want to modify all of instances. Some instances might be on a single line.
I prefer GNU sed to GNU awk (just because I use sed more than awk, however, it does not matter which one will be used in the solution, just make it work in Bash on Linux.
Example of the matched substring
date: await (async (y: number): Promise<Dayjs> => {
const firstDayOfOT = dayjs((await Seasons.earlyOrdinaryTime(y, config.epiphanyOnSunday))[0].date);
return firstDayOfOT.add(2, 'week').startOf('week');
})(year),
What I have tried
I have tried the following command, but sed is greedy, i.e. it matches last end match and therefore does not work when there are more than one instances. Also note that this command only double-quotes the captured match; it does not replaces newlines.
sed -n '1h; 1!H; ${ g; s/date: \(await.*[(]year[)]\)/date: "\1"/p }' file
Bonus points
While above I have talked about a single start/end pattern pair, actually, there are three (plus two one-liner variations of one them), see below. Note that I have removed the indentation (it actually does not matter, as it can be easily batched by ^\s*).
If you are eager enough to help me out, you can include these patterns too.
date: ((y: number): dayjs.Dayjs => {
const date = dayjs.utc(`${y}-11-1`);
if (date.day() === 6) {
return dayjs.utc(`${y}-11-2`);
} else {
return date;
}
})(year),
date: await (async (y: number): Promise<Dayjs> => {
const firstDayOfOT = dayjs((await Seasons.earlyOrdinaryTime(y,
config.epiphanyOnSunday))[0].date);
return firstDayOfOT.add(2, 'week').startOf('week');
})(year)
date: ((): dayjs.Dayjs => {
const firstDay = dayjs.utc(`${year}-1-1`);
const feastDay = 22 - (firstDay.day() == 0 ? 7 : firstDay.day());
return dayjs.utc(`${year}-1-${feastDay}`);
})(),
// One-liners
date: ((y: number): Dayjs => Dates.pentecostSunday(y).add(1, 'day'))(year),
date: ((y: number): dayjs.Dayjs => Dates.pentecostSunday(y).add(1, 'day'))(year),
Why do I need this?
I’d like to parse the calendar files from here (all those files except for index.ts and test.ts). I wish jq could parse TypeScript objects (or whatever they are called), but because it can’t do that, I want to ‘convert’ it using hjson to a proper JSON string and then parse it using jq.
Now, in order to make the file(s) ‘convertable’ by hjson, I need to do the following:
remove lines above const _dates: Array<RomcalLiturgicalDayInput> (including);
remove lines below Get localized liturgical day names (including);
format arrays:
hjson does not like the square brackets on the same line as the array items (they must be on separate lines);
each array item must be on a separate line;
when not quoted, each array item must not be followed by a comma, otherwise that comma becomes part of the array item;
no value can be a multiline one unless newlines are replaced by \\n;
if special characters (interpreted by TS/JS, e.g. {}[], backtick or even a comma) are included, that value must be quoted.
Except for the final two, I have already done this. See the following command (general.ts is from here). Note that I am sure that this could be optimised and perfected.
sed '1,/const _dates: Array<RomcalLiturgicalDayInput> = \[/d;/Get localized liturgical day names/,$d' general.ts | \
sed -z 's/^/[\n/;s/\];/]/; \
s/\([^\n]\)\(Titles.[^, \n]*\),*/\1\n\2/g; \
s/\(Titles.[^, \n]*\),*\s*\]/\1\n]/g; \
s/\([^\n]\)\(LiturgicalColors.[^, \n]*\),*/\1\n\2/g; \
s/\(LiturgicalColors.[^, \n]*\),*\s*\]/\1\n]/g; \
s/\(cycles: {\) \(celebrationCycle: CelebrationsCycle.TEMPORALE\) \(}\)/\1\n\2\n\3/g' | \
sed -n '1h; 1!H; ${ g; s/date: \(await.*[(]year[)]\)/date: "\1"/p }' | less
This might work for you (GNU sed):
sed '/^date: \(await\|((y: number):\|(():\)/{
:a;/\((year)\|()\),\?$/!{N;ba};s/\n/\\n/g;s/.*/"&"/}' file
Gather up lines beginning with date: and either await, ((y: number): or ((): and lines ending (year) or () with or without a ,. Then replace all \n by \\n and surround the collection by double quotes.
This may need some tweaking to satisfy all your requirements.

How to insert variable in user-defined character class?

What I am trying to do is to allow programs to define character class depending on text encountered. However, <[]> takes characters literally, and the following yields an error:
my $all1Line = slurp "htmlFile";
my #a = ($all1Line ~~ m:g/ (\" || \') ~ $0 {} :my $marker = $0; http <-[ $marker ]>*? page <-[ $marker ]>*? /); # error: $marker is taken literally as $ m a r k e r
I wanted to match all links that are the format "https://foo?page=0?ssl=1" or 'http ... page ...'
Based on your example code and text, I'm not entirely sure what your source data looksl ike, so I can't provide much more detailed information. That said, based on how to match characters from an earlier part of the match, the easiest way to do this is with array matching:
my $input = "(abc)aaaaaa(def)ddee(ghi)gihgih(jkl)mnmnoo";
my #output = $input ~~ m:g/
:my #valid; # initialize variable in regex scope
'(' ~ ')' $<valid>=(.*?) # capture initial text
{ #valid = $<valid>.comb } # split the text into characters
$<text>=(#valid+) # capture text, so long as it contains the characters
/;
say #output;
.say for #output.map(*<text>.Str);
The output of which is
[「(abc)aaaaaa」
valid => 「abc」
text => 「aaaaaa」 「(def)ddee」
valid => 「def」
text => 「ddee」 「(ghi)gihgih」
valid => 「ghi」
text => 「gihgih」]
aaaaaa
ddee
gihgih
Alternatively, you could store the entire character class definition in a variable and reference the variable as <$marker-char-class>, or you if you want to avoid that, you can define it all inline as code to be interpreted as regex with <{ '<[' ~ $marker ~ ']>' }>. Note that both methods are subject to the same problem: you're constructing the character class from the regex syntax, which may require escape characters or particular ordering, and so is definitely suboptimal.
If it's something you'll do very often and not very adhoc, you could also define your own regex method token, but that's probably very overkill and would serve better as its own question.

Parsing and creating new arguments with getline AWK code

I am writing a pretty long AWK program (NOT terminal script) to parse through a network trace file. I have a situation where the next line in the trace file is ALWAYS a certain type of 'receive' (3 possible types) - however, I only want AWK to handle/print on one type. In short, I want to tell AWK if the next line contains a certain receive type, do not include it. It is my understanding that getline is the best way to go about this.
I have tried a couple different variations of getline and getline VAR via the manual, I still cannot seem to search through and reference fields in the next line like I want. Updated from edit:
if ((event=="r") && (hopSource == hopDest)) {
getline x
if ((x $31 =="arp") || (x $35 =="AODV")) {
#printf("Badline %s %s \n", $31, $35)
}
else {
macLinkRec++;
#printf("MAC Link Recieved from HEAD - %d to MEMBER %d \n", messageSource, messageDest)
}
}
I am using the print(badline) as just a marker to see what is going on. I fully understand how to restructure the code once I get the search and reference correct. I am also able to print the correct 'next' lines. However, I would expect to be able to search through the next line and create new arguments based on what is contained in the next line. How do I search a 'next line' based on an argument in AWK? How do I reference fields in that line to create new arguments?
Final note, the 'next line' number of fields (NF) varies, but I feel that the $35 field reference should handle any problems there.

Mark duplicate headers in a fasta file

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.
An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)