How can sed replace but skip the replaced characters in the pipeline? - awk

I'm looking to escape some character ( and ), their respective escape codes are (40) and (41).
echo 'Hello (world)' | sed 's/(/(40)/g;s/)/(41)/g'
This code fails with Hello (40(41)world(41) because it will also process the output from the first replacement. Is there any way I can skip the replacement characters or do conditional branches here. I don't want to use a temporary (as the input sequence could contain anything).

All you need is:
$ echo 'Hello (world)' | sed 's/(/(40\n/g; s/)/(41)/g; s/\n/)/g'
Hello (40)world(41)
The above is safe because \n can't be present in the input since sed reads one line at a time. With some seds you might need to use a backslash followed by a literal newline or $'\n' instead of just \n.
Given the answer you posted, though, this may be what you really want (uses GNU awk for ord(), multi-char RS, and RT):
$ cat tst.awk
#load "ordchr"
BEGIN { RS = "[][(){}]"; ORS="" }
{ print $0 ( RT=="" ? "" : "(" ord(RT) ")" ) }
$ echo 'Hello (world) foo [bar] other {stuff} etc.' | awk -f tst.awk
Hello (40)world(41) foo (91)bar(93) other (123)stuff(125) etc.
If you have an older gawk that doesn't support #load than get a new one but if that's impossible for some reason then just create an array of the values, e.g.:
$ cat tst.awk
BEGIN {
RS = "[][(){}]"
ORS = ""
for (i=0;i<=255;i++) {
char = sprintf("%c",i)
map[char] = "(" i ")"
}
}
{ print $0 ( RT=="" ? "" : map[RT] ) }
$ echo 'Hello (world) foo [bar] other {stuff} etc.' | awk -f tst.awk
Hello (40)world(41) foo (91)bar(93) other (123)stuff(125) etc.
EDIT: timing data
Given a file that has these 10 lines:
$ head -10 file1m
When (chapman) billies leave [the] street, And drouthy {neibors}, neibors, meet;
As market days are wearing late, And folk begin to [tak] the gate,
While (we) sit bousing {at} the nappy, An' getting [fou] and unco happy,
We think na on the [lang] Scots (miles), The mosses, {waters}, slaps and stiles,
That lie between us and our hame, Where sits our sulky, sullen dame,
Gathering her [brows] like gathering storm, (Nursing) her wrath to keep it warm.
This truth fand honest Tam o' Shanter,
As he frae Ayr ae night did canter:
(Auld Ayr, wham ne'er a town surpasses,
For honest men and bonie lasses).
repeating to a total of 1 million lines, 10.5 million characters, 60.4 million bytes:
$ wc file1m
1000000 10500000 60400000 file1m
the 3rd-run timing stats for the sed script and both awk scripts above are:
$ time sed 's/(/(40\n/g; s/)/(41)/g; s/\n/)/g; s/\[/(91)/g; s/\]/(93)/g; s/{/(123)/g; s/}/(125)/g;' file1m > sed.out
real 0m7.488s
user 0m7.378s
sys 0m0.093s
$ cat function.awk
#load "ordchr"
BEGIN { RS = "[][(){}]"; ORS="" }
{ print $0 ( RT=="" ? "" : "(" ord(RT) ")" ) }
$ time awk -f function.awk file1m > awk_function.out
real 0m7.426s
user 0m7.269s
sys 0m0.155s
$ cat array.awk
BEGIN {
RS = "[][(){}]"
ORS = ""
for (i=0;i<=255;i++) {
char = sprintf("%c",i)
map[char] = "(" i ")"
}
}
{ print $0 ( RT=="" ? "" : map[RT] ) }
$ time awk -f array.awk file1m > awk_array.out
real 0m4.758s
user 0m4.648s
sys 0m0.092s
I verified that all 3 scripts produce the same, successfully modified output:
$ head -10 sed.out
When (40)chapman(41) billies leave (91)the(93) street, And drouthy (123)neibors(125), neibors, meet;
As market days are wearing late, And folk begin to (91)tak(93) the gate,
While (40)we(41) sit bousing (123)at(125) the nappy, An' getting (91)fou(93) and unco happy,
We think na on the (91)lang(93) Scots (40)miles(41), The mosses, (123)waters(125), slaps and stiles,
That lie between us and our hame, Where sits our sulky, sullen dame,
Gathering her (91)brows(93) like gathering storm, (40)Nursing(41) her wrath to keep it warm.
This truth fand honest Tam o' Shanter,
As he frae Ayr ae night did canter:
(40)Auld Ayr, wham ne'er a town surpasses,
For honest men and bonie lasses(41).
$ wc sed.out
1000000 10500000 68800000 sed.out
$ diff sed.out awk_function.out
$ diff sed.out awk_array.out
$

The problem is solved by creating an ord function in awk. It doesn't appear sed has this functionality.
#! /bin/sh
awk '
BEGIN { _ord_init() }
function _ord_init(low, high, i, t) {
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") {
low = 0;
high = 127;
} else if (sprintf("%c", 128 + 7) == "\a") {
low = 128;
high = 255;
} else {
low = 0;
high = 255;
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i);
_ord_[t] = i;
}
}
function ord(str, c) {
c = substr(str, 1, 1)
return _ord_[c]
}
// {
split($0, array, "\\[|\\]|\\(|\\)|\\{|\\}", separators);
len = length(array);
seplen = length(separators);
for (i = 1; i < len; ++i) {
printf "%s(%s)", array[i], ord(separators[i]);
}
printf "%s", array[len];
}
'

You can do this in perl, which supports one-liners and look-behind in regular expressions. Simply require the close-paren not be part of an existing escape:
$ echo 'Hello (world)' | perl -pe 's/\(/(40)/g; s/(?<!\(40)\)/(41)/g'
Hello (40)world(41)

It's tricky in sed but easy in any language with associative arrays.
perl -pe 'BEGIN { %h = ("(" => "(40)", ")" => "(41)" );
$r = join("|", map { quotemeta } keys %h); }
s/($r)/$h{$1}/g'

Related

Run awk in parallel

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?
Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print $0 >> Fpath & ?
Any help will be appreciated.
Sample log file
"email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com datahere2
email5#foo.com;dtat'ah'ere2
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
email3#foo.com:datahere2
Expected Output
# cat em
email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com:datahere2
email5#foo.com:dtat'ah'ere2
email3#foo.com:datahere2
# cat errorfile
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
Code:
#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}' "$FILE"
done
popd > /dev/null
Look up the man page for the GNU tool named parallel if you want to run things in parallel but we can vastly improve the execution speed just by improving your script.
Your current script makes 2 mistakes that greatly impact efficiency:
Calling awk once per file instead of once for all files, and
Leaving all output files open while the script is running so awk has to manage them
You currently, essentially, do:
for file in *; do
awk '
{
Fpath = substr($1,1,2)
Fpath = gensub(/[^[:alnum:]]/,"_","g",Fpath)
print > Fpath
}
' "$file"
done
If you do this instead it'll run much faster:
sort * |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
Having said that, you're manipulating your input lines before figuring out the output file names so - this is untested but I THINK your whole script should look like this:
#/usr/bin/env bash
pushd "_test2" > /dev/null
awk '
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
sub(/[,|;: \t]+/, ":")
if (/^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+:[\x00-\x7F]+$/) {
print
}
else {
print > "errorfile"
}
}
' * |
sort -t':' -k1,1 |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
popd > /dev/null
Note the use of $0 instead of $1 in the scripts - that's another performance improvement because awk only does field splitting (which takes time of course) if you name specific fields in your script.
Assuming multiple cores are available, the simple way to run parallel is to use xargs, Depending on your config try 2, 3, 4, 5, ... until you find the optimal number. This assumes that there are multiple input files, and that there is NO single files that is much larger than all other files.
Notice added 'fflush' so that lines will not be split. This will have some negative performance impact, but is required, assuming you the individual input files to get merged into single set of output files. Possible to wrokaround this problem by splitting each file, and then merging the combined files.
#! /bin/sh
pushd "_test2" > /dev/null
ls * | xargs --max-procs=4 -L1 awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
fflush(Fpath)
}
else
print $0 >> "errorfile"
fflush("errorfile")
}' "$FILE"
popd > /dev/null
From practical point of view you might want to create an awk script, e.g., split.awk
#! /usr/bin/awk -f -
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}
And then the 'main' code will look like below, easier to manage.
xargs --max-procs=4 -L1 awk -f split.awk

How can I use sed to generate an awk file?

How do I write sed commands to generate an awk file.
Here is my problem:
For example, I have a text file, A.txt which contains a word on each line.
app#
#ple
#ol#
The # refers when the word starts/ ends/ starts and ends. For example, app# shows that the word starts with 'app'. #ple shows that the word ends with 'ple'. #ol# shows that the word has 'ol' in the middle of the word.
I have to generate an awk file from sed commands which reads in another file, B.txt (which contains a word on each line) and increments the variable start, end, middle.
How do I write sed commands whereby for each line in the text file, A.txt, it will generate an awk code ie.
{ {if ($1 ~/^app/)
{start++;}
}
For example, if I input the other file, B.txt with these words into the awk script,
application
people
bold
cold
The output would be; start = 1, end = 1, middle = 2.
I'd use ed over sed for this, actually.
A quick script that creates A.awk from A.txt and runs it on B.txt:
#!/bin/sh
ed -s A.txt <<'EOF'
1,$ s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
1,$ s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
1,$ s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
0 a
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
.
$ a
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
.
w A.awk
EOF
# awk -f A.awk B.txt would work too, but this demonstrates a self-contained awk script
chmod +x A.awk
./A.awk B.txt
Running it:
$ ./translate.sh
start = 1, end = 1, middle = 2
$ cat A.awk
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
$0 ~ /^app/ { start++ }
$0 ~ /ple$/ { end++ }
$0 ~ /.+ol.+/ { middle++ }
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
Note: This assumes that the middle patterns shouldn't match at the start or end of a line.
But here's a attempt using sed to create A.awk, putting all the sed commands in a file, as trying to this as a one-liner using -e and getting all the escaping right is not something I feel up to at the moment:
Contents of makeA.sed:
s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
1 i\
#!/usr/bin/awk -f\
BEGIN { start = end = middle = 0 }
$ a\
END { printf "start = %d, end = %d, middle = %d\\n", start, end, middle }
Running it:
$ sed -f makeA.sed A.txt > A.awk
$ awk -f A.awk B.txt
start = 1, end = 1, middle = 2
Off the top of my head, and not tested:
/\(.*\)#$/s//{if ($1 ~ /^\1/) start++; next}/
/#\(.*\)$/s//{if ($1 ~ /\1$/) end++; next}/
/\(.*\)/s//{if ($1 ~ /\1/) middle++; next}/
The construct \(.*\) matches any text and saves it in a back-reference, then \1 recalls the back-reference. The empty pattern following the s command refers back to the pattern that matched the line. The next prevents the third pattern from matching after one of the other two has already matched.

How to merge lines using awk command so that there should be specific fields in a line

I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt

Concatenating multiple lines with a discriminator

I have the input like this
Input:
a,b,c
d,e,f
g,h,i
k,l,m
n,o,p
q,r,s
I wan to be able to concatenate the lines with a discriminator like "|"
Output:
a,b,c|d,e,f|g,h,i
k,l,m|n,o.p|q,r,s
The file has 1million lines and I want to be able to concatenate lines like the example before.
Any ideas about how to approach this?
#OP, if you want to group them for every 3 records,
$ awk 'ORS=(NR%3==0)?"\n":"|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s
with Perl,
$ perl -lne 'print $_ if $\ = ($. % 3 == 0) ? "\n" : "|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s
Since your tags include sed here is a way to use it:
sed 'N;N;s/\n/|/g' datafile
gawk:
BEGIN {
state=0
}
state==0 {
line=$0
state=1
next
}
state==1 {
line=line "|" $0
state=2
next
}
state==2 {
print line "|" $0
state=0
next
}
If Perl is fine, you can try:
$i = 1;
while(<>) {
chomp;
unless($i % 3)
{ print "$line\n"; $i = 1; $line = "";}
$line .= "$_|";
$i++;
}
to run:
perl perlfile.pl 1millionlinesfile.txt
$ paste -sd'|' input | sed -re 's/([^|]+\|[^|]+\|[^|]+)\|/\1\n/g'
With paste, we join the lines together, and then sed dices them up. The pattern grabs runs of 3 pipe-terminated fields and replaces their respective final pipes with newlines.
With Perl:
#! /usr/bin/perl -ln
push #a => $_;
if (#a == 3) {
print join "|" => #a;
#a = ();
}
END { print join "|" => #a if #a }