How to update a variable from inside an AWK function - awk

I run this script from a loop inside another script and I want to:
a) print errors into a file keeping track of number of line, name of file and error.
b)I want to print into another file the unique names of files in which an error has been found, provided that a single file could have more than one error and I don't like repetitions.
I know I can sort | unique the file in the end from the calling script but... Is there another technique?
Something like:
if(tempVar != FILENAME)
{
print FILENAME >> uniqueFiles;
}
tempVar= FILENAME;
here's my script
awk '
function errorHandler(error1)
{
print FILENAME >> uniqueFiles;
print FILENAME";"NR";"error >> errorListing;
uniqueFiles = FILENAME;
}
BEGIN {
uniqueFiles="files.txt";
errorListing="errorList.txt";
error1="Error code 1"
}
{
if(NR>1)
{
if(length($1) != 10)
{
errorHandler(error1);
}
}
}
END{}' $1

Related

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

GAWK does not terminate after ENDFILE block with single file

I have a gawk script below that reads a protein FASTA file and only prints out the records that don't have an X in their sequence and are within a certain range length. I wanted to modify the file in place so I had the script write to a temporary file and then rename it to the original file. The BEGINFILE and ENDFILE constructs in gawk seemed convenient for this. However, for some reason, gawk does not exit after executing the code in the ENDFILE even if it is given a single file argument. It seems to jump back to another line of code and then just hang. Does anyone know what could cause this to happen? The weird part is that this doesn't happen for every FASTA file, only a few and I can't tell what is different between the ones that trigger the bug and the ones that don't
#! /bin/gawk -f
function trim(s) {
gsub(/^[ \t]+|[ \t]+$/, "", s)
return s
}
function printFasta(header, seq, outfile, seq_line_max_chars) {
print ">" header > outfile
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars) > outfile
start += seq_line_max_chars
}
}
BEGIN {
min_prot_len = 400
}
BEGINFILE {
tmp_out = FILENAME ".tmp"
}
/^>.+/ {
headerStartIdx = index($0, ">") + 1
header = trim(substr($0, headerStartIdx))
getline
sequence = ""
while ($0 !~ /(^>.+)|(^[[:space:]]*$)/) {
x_matched = match($0, "X")
if (x_matched != 0) {
next
}
gsub("*", "")
sequence = sequence $0
getline
}
if (length(sequence) >= min_prot_len) {
printFasta(header, sequence, tmp_out)
}
}
ENDFILE {
print "move called"
# system(("mv " tmp_out " " FILENAME))
}
I called the script with
$ ./filter_proteins.awk test.faa
When I run this, move called is printed and then it hangs. I tried stepping through with the debugger and I see that it reaches the ENDFILE block having processed all the lines in the input file, but when I type the next command, it jumps to the getline statement on line 44. After several iterations of next and print $0 it seems that the program is stuck reading the last line of the input file till the end of time. Perhaps this is a bug?
I am using GAWK 5.1.0
Edit
A minimal input file.
https://github.com/CuriousTim/pastebin/blob/main/mb.34.faa
When I run the script with only a few sequences, it works, but when I use the whole file, it hangs. I wasn't sure how to make a minimal example without providing the whole file.
We'll know for sure after you provide sample input to test with but my money's on the call to getline within the loop reaching the end of the file and so triggering the ENDFILE condition to be true but you're still in the loop.
Look:
$ cat file
foo
bar
$ cat tst.awk
{
while (1) {
print "about to execute getline"
getline
print "just executed getline:", $0
if (++c == 5) {
exit
}
}
}
ENDFILE {
print "*** in ENDFILE ***"
}
$ awk -f tst.awk file
about to execute getline
just executed getline: bar
about to execute getline
*** in ENDFILE ***
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
Calling getline in a loop is not how you want to write an awk script and what you have in your code is the wrong syntax to use when calling getline at any time - see http://awk.freeshell.org/AllAboutGetline.
I modified my script to remove the getline in case anyone finds it useful.
#! /bin/gawk -f
function printFasta(header, seq, seq_line_max_chars) {
print ">" header
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars)
start += seq_line_max_chars
}
}
BEGIN {
FS = ">"
min_prot_len = 400
max_prot_len = 700
}
NF > 1 {
if (sequence &&
length(sequence) >= min_prot_len &&
length(sequence) <= max_prot_len) {
printFasta(header, sequence)
}
}
{
if (!header) {
next
} else {
x_in_seq = match($0, "X")
if (!x_in_seq) {
gsub("*", "")
sequence = sequence $0
} else {
header = ""
}
}
}
END {
if (header) {
printFasta(header, sequence)
}
}

awk 1 unexpected character '.' suddenly appeared

the script was working. I added some comments and renamed it then submitted it. today my instructor told me it doesnt work and give me the error of awk 1 unexpected character '.'
the script is supposed to read a name in command line and return the student information for the name back.
right now I checked it and surprisingly it gives me the error.
I should run it by the command like this:
scriptName -v name="aname" -f filename
what is this problem and which part of my code make it?
#!/usr/bin/awk
BEGIN{
tmp=name;
nameIsValid;
if (name && tolower(name) eq ~/^[a-z]+$/ )
{
inputName=tolower(name)
nameIsValid++;
}
else
{
print "you have not entered the student name"
printf "Enter the student's name: "
getline inputName < "-"
tmp=inputName;
if (tolower(inputName) eq ~/^[a-z]+$/)
{
tmpName=inputName
nameIsValid++
}
else
{
print "Enter a valid name!"
exit
}
}
inputName=tolower(inputName)
FS=":"
}
{
if($1=="Student Number")
{
split ($0,header,FS)
}
if ($1 ~/^[0-9]+$/ && length($1)==8)
{
split($2,names," ")
if (tolower(names[1]) == inputName || tolower(names[2])==inputName )
{
counter++
for (i=1;i<=NF;i++)
{
printf"%s:%s ",header[i], $i
}
printf "\n"
}
}
}
END{
if (counter == 0 && nameIsValid)
{
printf "There is no record for the %-10s\n" , tmp
}
}
Here are the steps to fix the script:
Get rid of all those spurious NULL statements (trailing semi-colons at the end of lines).
Get rid of the unset variable eq (it is NOT an equality operator!) from all of your comparions.
Cleanup the indenting.
Get rid of that first non-functional nameIsValid; statement.
Change printf "\n" to the simpler print "".
Get rid of the useless ,FS arg to split().
Change name && tolower(name) ~ /^[a-z]+$/ to just the second part of that condition since if that matches then of course name is populated.
Get rid of all of those tolower()s and use character classes instead of explicit a-z ranges.
Get rid of the tmp variable.
Simplify your BEGIN logic.
Get rid of the unnecessary nameIsValid variable completely.
Make the awk body a bit more awk-like
And here's the result (untested since no sample input/output posted):
BEGIN {
if (name !~ /^[[:alpha:]]+$/ ) {
print "you have not entered the student name"
printf "Enter the student's name: "
getline name < "-"
}
if (name ~ /^[[:alpha:]]+$/) {
inputName=tolower(name)
FS=":"
}
else {
print "Enter a valid name!"
exit
}
}
$1=="Student Number" { split ($0,header) }
$1 ~ /^[[:digit:]]+$/ && length($1)==8 {
split(tolower($2),names," ")
if (names[1]==inputName || names[2]==inputName ) {
counter++
for (i=1;i<=NF;i++) {
printf "%s:%s ",header[i], $i
}
print ""
}
}
}
END {
if (counter == 0 && inputName) {
printf "There is no record for the %-10s\n" , name
}
}
I changed the shebang line to:
#!/usr/bin/awk -f
and then in command line didnt use -f. It is working now
Run the script in the following way:
awk -f script_name.awk input_file.txt
This seems to suppress the warnings and errors.
In my case, the problem was resetting the IFS variable to be IFS="," as suggested in this answer for splitting string into an array. So I resetted the IFS variable and got my code to work.
IFS=', '
read -r -a array <<< "$string"
IFS=' ' # reset IFS

awk: catch `exit' in the END block

I'm using awk for formatting an input file in an output file. I have several patterns to fill variables (like "some pattern" in the example). These variables are printed in the required format in the END block. The output has to be done there because the order of appearance in the input file is not guaranteed, but the order in the output file must be always the same.
BEGIN {
FS = "=|,"
}
/some pattern/ {
if ($1 == 8) {
var = $1
} else {
# Incorrect field value
exit 1
}
}
END {
# Output the variables
print var
}
So my problem is the exit statement in the pattern. If there is some error and this command is invoked, there should be no output at all or at the most an error message. But as the gawk manual (here) says, if the exit command is invoked in a pattern block the END block will be executed at least. Is there any way to catch the exit like:
if (!exit_invoked) {
print var
}
or some other way to avoid printing the output in the END block?
Stefan
edit: Used the solution from shellter.
you'll have to handle it explicitly, by setting exit_invoked before exit line, i.e.
BEGIN {
FS = "=|,"
}
/some pattern/ {
if ($1 == 8) {
var = $1
} else {
# Incorrect field value
exit_invoked=1
exit 1
}
}
END {
if (! exit_invoked ) {
# Output the variables
print var
}
}
I hope this helps.
END {
# If here from a main block exit error, it is unlikely to be at EOF
if (getline) exit
# If the input can still be read, exit with the previously set status rather than run the rest of the END block.
......
Being a fan of short syntax and trying to avoid futile {}s or adding them later to pre-existing programs, instead of:
...
else {
exit_invoked=1
exit 1
}
...
END {
if (! exit_invoked ) {
print var
}
}
I use:
else
exit (e=1) # the point
...
END {
if(!e)
print v
}

Reading from file -- awk

I would like to read a file like this
1.23213213
0.12321321
-1.12321321
0.23232322
into a variable, or array to use it somewhere in the main {} code.
But I would like to use it if this file exists. How can I check if it already exists or not, and if not, then do not use that variable or array?
I don't understand completely what you want to achieve, but perhaps something like this can be useful to you:
It process the file line by line and saves each one in an array, the key is the line number so you keep the order. In the END section check how many lines were processed and get if the file had content.
awk '{ line[ FNR ] = $0 } END { if ( FNR > 0 ) { print "File" } else { print "NO file" } }' infile
EDIT to comment:
But in awk you can process many files from command line.
BEGIN {
...
}
## Processing of first file in command line.
FNR == NR {
a[ FNR ] = $0
next
}
## Processing of second file in command line
FNR < NR {
## Check if array 'a' has the values you want and use them
## 'for(...)variable += a[i]' or whatever.
}
Run script like:
awk -f script.awk first_file.txt second_file.txt
But if first_file.txt doesn't exists, awk will complain with an error.