How to "do something" for each input text files - awk

Say that I read in the following information stored in three diffrent text files (Can be many more)
File 1
1 2 rt 45
2 3 er 44
File 2
rf r 4 5
3 er 4 t
er t yu 4
File 3
er tyu 3er 3r
der 4r 5e
edr rty tyu 4r
edr 5t yt5 45
When I read in this information I want it to print this information from these two files into separate arrays as for now they are printed out in the same time
Now I Have this script printing out all information at the same time
{
TESTd[NR-1] = $2; g++
}
END {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
}
But is there a way to read in multiple files and do this for every text file?
Like instead of getting this output when doing awk -f test.awk 1.txt 2.txt 3.txt
["2"]
["3"]
["r"]
["er"]
["t"]
["tyu"]
["4r"]
["rty"]
["5t"]
_____
I get this output
["2"]
["3"]
_____
["r"]
["er"]
["t"]
_____
["tyu"]
["4r"]
["rty"]
["5t"]
_____
And reading in each file at the time is preferably not an option here since I will have like 30 text files.
EDIT________________________________________________________________
I want to do this in awk if possible because I'm going to do something like this
{
PRINTONCE[NR-1] = $2; g++
PRINTONEATTIME[NR-1] = $3
}
END {
#Do this for all arguments once
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONCE[i] "\"] \n"
}
print " _____"
#Do this for loop for every .txt file that is read in as an argument
#for(j=0;j<args.length;j++){
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONEATTIME[i] "\"] \n"
}
print " _____"
}

From what i understand, you have an awk script that works and you want to run that awk script on many files and want their output to have a new line(or _) in between so you can distinguish which output is from which file.
Try this bash script :-
dir=~/*.txt #all txt files in ~(home) directory
for f in $dir
do
echo "File is $f"
awk 'BEGIN{print "Hello"}' $f #your awk code will take $f file as input.
echo "------------------"; echo;
done
Also, if you do not want to do this to all files you can write the for loop as for f in 1.txt 2.txt 3.txt.

If you don't want to do it in awk directly. You can call it like this in bash or zsh for example:
for fic in test*.txt; awk -f test.awk $fic

It's quite simple to do it directly in awk:
# define a function to print out the array
function dump(array, n) {
for (i = 0 ; i <= n-1; i ++ ) {
print " [\"" array[i] "\"]"
}
print " _____"
}
# dump and reset when starting a new file
FNR==1 && NR!=1 {
dump(TESTd, g)
delete TESTd
g = 0
}
# add data to the array
{
TESTd[FNR-1] = $2; g++
}
# dump at the end
END {
dump(TESTd, g)
}
N.B. using delete TESTd is a non-standard gawk feature, but the question is tagged as gawk so I assumed it's OK to use it.
Alternatively you could use one or more of ARGIND, ARGV, ARGC or FILENAME to distinguish the different files.
Or as suggested by see https://stackoverflow.com/a/10691259/981959, with gawk 4 you can use an ENDFILE group instead of END in your original:
{
TESTd[FNR-1] = $2; g++
}
ENDFILE {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
delete TESTd
g = 0
}

Write a bash shell script or a basic shell script. Try to put below into test.sh. Then call /bin/sh test.sh or /bin/bash test.sh, see which one will work
for f in *.txt
do
echo "File is $f"
awk -F '\t' 'blah blah' $f >> output.txt
done
Or write a bash shell script to call your awk script
for f in *.txt
do
echo "File is $f"
/bin/sh yourscript.sh
done

Related

Run awk in parallel

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?
Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print $0 >> Fpath & ?
Any help will be appreciated.
Sample log file
"email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com datahere2
email5#foo.com;dtat'ah'ere2
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
email3#foo.com:datahere2
Expected Output
# cat em
email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com:datahere2
email5#foo.com:dtat'ah'ere2
email3#foo.com:datahere2
# cat errorfile
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
Code:
#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}' "$FILE"
done
popd > /dev/null
Look up the man page for the GNU tool named parallel if you want to run things in parallel but we can vastly improve the execution speed just by improving your script.
Your current script makes 2 mistakes that greatly impact efficiency:
Calling awk once per file instead of once for all files, and
Leaving all output files open while the script is running so awk has to manage them
You currently, essentially, do:
for file in *; do
awk '
{
Fpath = substr($1,1,2)
Fpath = gensub(/[^[:alnum:]]/,"_","g",Fpath)
print > Fpath
}
' "$file"
done
If you do this instead it'll run much faster:
sort * |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
Having said that, you're manipulating your input lines before figuring out the output file names so - this is untested but I THINK your whole script should look like this:
#/usr/bin/env bash
pushd "_test2" > /dev/null
awk '
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
sub(/[,|;: \t]+/, ":")
if (/^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+:[\x00-\x7F]+$/) {
print
}
else {
print > "errorfile"
}
}
' * |
sort -t':' -k1,1 |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
popd > /dev/null
Note the use of $0 instead of $1 in the scripts - that's another performance improvement because awk only does field splitting (which takes time of course) if you name specific fields in your script.
Assuming multiple cores are available, the simple way to run parallel is to use xargs, Depending on your config try 2, 3, 4, 5, ... until you find the optimal number. This assumes that there are multiple input files, and that there is NO single files that is much larger than all other files.
Notice added 'fflush' so that lines will not be split. This will have some negative performance impact, but is required, assuming you the individual input files to get merged into single set of output files. Possible to wrokaround this problem by splitting each file, and then merging the combined files.
#! /bin/sh
pushd "_test2" > /dev/null
ls * | xargs --max-procs=4 -L1 awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
fflush(Fpath)
}
else
print $0 >> "errorfile"
fflush("errorfile")
}' "$FILE"
popd > /dev/null
From practical point of view you might want to create an awk script, e.g., split.awk
#! /usr/bin/awk -f -
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}
And then the 'main' code will look like below, easier to manage.
xargs --max-procs=4 -L1 awk -f split.awk

How to merge lines using awk command so that there should be specific fields in a line

I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile

Set number of lines to variable in awk

I am experimenting with an awk script (an independent file).
I want it to process a text file which looks like this:
value1: n
value2: n
value3: n
value4: n
value5: n
value6: n
value7: n
value1: n
:
The text file contains a lot of these blocks with 7 values in each of them. I want the awk script to print some of these values (name of the value and "n") into a new file or the commandline. I thought I'd process it with a while loop, which works with a variable set to the number of all lines. But I just cant get the total of all lines in the file into a variable. It seems I have to process every line and do something with it until the end of the file to get the total. But I'd like to have the total in a variable and then process it with the while loop which loops until the total is reached.
Do you have an idea?
Where $1 is the input parameter to your script: myscript textfile.txt
count="`wc -l $1 | cut -d' ' -f1`"
echo "Number of lines in $1 is $count"
Then do your awk command utilising $count as your line count
Edit: courtesy of fedorqui
count="`wc -l <$1`"
echo "Number of lines in $1 is $count"
Edit 2: (forgive my awk command it's not something that I use much)
count="`wc -l </etc/fstab`"
echo "Number of lines in /etc/fstab is $count"
awk '{print $0,"\t","\tLine ",NR," of ","'$count'";}' /etc/fstab
Either do two passes over the file:
awk 'NR==FNR { lines = NR; next }
{ ... this happens on the second pass, use lines as you wish ... }' file file
or read the lines into an array and process it in END:
awk '{ a[NR] = $0 }
END { lines = NR; for(i=1; i<=lines; ++i) { line = a[i]; ... } }' file
The first consumes I/O, the second memory.
In more detail,
awk 'NR==FNR { count++; next }
{ print "Item " FNR " of " count ": " $0 }' file file
or similarly
awk '{ a[NR] = $0; }
END { for (i=1; i<=NR; ++i) print "Item " i " of " NR ": " a[i] }' file
If you need the line count outside of Awk, you will need to print it and capture it from your script. If you have an Awk script which performs something useful and produces the count as a side effect, you will want to print the line count to standard output, and take care that all other output is directed somewhere else. Perhaps like this;
lines=$(awk '{ sum += $1; }
END { print "Average: " sum/NR >"average.txt"; print NR }')
# Back in Bash
echo "Total of $lines processed, output in average.txt"

Redirect input for gawk to a system command

Usually a gawk script processes each line of its stdin. Is it possible to instead specify a system command in the script use the process each line from output of the command in the rest of the script?
For example consider the following simple interaction:
$ { echo "abc"; echo "def"; } | gawk '{print NR ":" $0; }'
1:abc
2:def
I would like to get the same output without using pipe, specifying instead the echo commands as a system command.
I can of course use the pipe but that would force me to either use two different scripts or specify the gawk script inside the bash script and I am trying to avoid that.
UPDATE
The previous example is not quite representative of my usecase, this is somewhat closer:
$ { echo "abc"; echo "def"; } | gawk '/d/ {print NR ":" $0; }'
2:def
UPDATE 2
A shell script parallel would be as follows. Without the exec line the script would read from stdin; with the exec it would use the command that line as input:
/tmp> cat t.sh
#!/bin/bash
exec 0< <(echo abc; echo def)
while read l; do
echo "line:" $l
done
/tmp> ./t.sh
line: abc
line: def
From all of your comments, it sounds like what you want is:
$ cat tst.awk
BEGIN {
if ( ("mktemp" | getline file) > 0 ) {
system("(echo abc; echo def) > " file)
ARGV[ARGC++] = file
}
close("mktemp")
}
{ print FILENAME, NR, $0 }
END {
if (file!="") {
system("rm -f \"" file "\"")
}
}
$ awk -f tst.awk
/tmp/tmp.ooAfgMNetB 1 abc
/tmp/tmp.ooAfgMNetB 2 def
but honestly, I wouldn't do it. You're munging what the shell is good at (creating/destroying files and processes) with what awk is good at (manipulating text).
I believe what you're looking for is getline:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ print line} }' <<< ''
abc
def
Adjusting the answer to you second example:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ counter++; if ( line ~ /d/){print counter":"line} } }' <<< ''
2:def
Let's break it down:
awk '{
cmd = "echo abc; echo def"
# line below will create a line variable containing the ouptut of cmd
while ( ( cmd | getline line) > 0){
# we need a counter because NR will not work for us
counter++;
# if the line contais the letter d
if ( line ~ /d/){
print counter":"line
}
}
}' <<< ''
2:def

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt