awk compare adjacent lines and print based on if statements

awk compare adjacent lines and print based on if statements - awk

I have one file with multiple lines (reads from a genome) and they are sorted (based on their locations). Now I want to loop over these lines and if multiple lines have the same ID (column 4), I want to keep either keep the first, if column 3 is a plus or the last, if column three is a minus. This is m code but it seems like my variable (lastID) is not properly updated after each line.
Tips are much appreciated.
awk 'BEGIN {lastline=""; lastID=""}
{if ($lastline != "" && $4 != $lastID)
{print $lastline; lastline=""};
if ($3 == "+" && $4 != $lastID)
{print $0; lastline=""}
else if ($3 == "+" && $4 == $lastID)
{lastli=""}
else if ($3 == "-")
{lastline=$0};
lastID=$4
}' file

To access the value of a variable in awk you just use the name of the variable, just like in C and most other Algol-based languages. You don't stick a $ in front of it like you would with shell. Try changing:
$lastline != "" && $4 != $lastID
to:
lastline != "" && $4 != lastID
etc.
This might be what you're trying to do (your BEGIN section was doing nothing useful so I just removed it):
awk '
(lastline != "") && ($4 != lastID) {
print lastline
lastline=""
}
$3 == "+" {
if ($4 == lastID) {
lastli=""
}
else {
print $0
lastline=""
}
}
$3 == "-" {
lastline=$0
}
{ lastID=$4 }
' file
When formatted sensibly like that you can see that lastli is never used anywhere except where it's set to "" so that's probably a bug - maybe it's supposed to lastline in which case it can be made common rather than being set in both the if and else legs?

you may want to utilize awk's own condition{statement} structure. Note that code layout is not universally accepted but I find it easier to read for short statements.
$ awk '$lastline!="" && $4 != $lastID {print lastline; lastline=""}
$3=="+" && $4 != $lastID {print; lastline=""}
$3=="+" && $4 == $lastID {lastli=""}
$3=="-" {lastline=$0}
{lastID=$4}' file

Related

Using If-Statement to assign Variable in awk

I am trying to create an awk script file that takes an input file and converts the fourth column of information for the first three lines into a single row.
For example, if input.txt looks like this:
XX YY val1 1234
XX YY val2 2345
XX YY val3 3456
stuff random garbage junk extrajunk
useless 343059 random3
I want to print the fourth column for rows 1, 2 and 3 into a single row:
1234 2345 3456
I was trying to do this by using if/else-if statements so my file looks like this right now:
#!/usr/bin/awk -f
{
if ($1 == "XX" && $3 == "val1")
{
var1=$4;
}
else if ($1 == "XX" && $3 == "val2")
{
var2=$4;
}
else if ($1 == "XX" && $3 == "val3")
{
var3=$4;
}
}
END{
print var1,var2,var3
and then I would print the variables on one line. However, when I try to implement this, I get syntax errors pointing to the "=" symbol in the var2=$4 line.
EDIT
Solved, in my real file I had named the variables funky (yet descriptive) names and that was messing it all up. - Oops.
Thanks

you can write something like this
$ awk '$1=="XX"{if($3=="val1") var1=$4
else if($3=="val2") var2=$4
else if($3=="val3") var3=$4}
// .. do something with the vars ....
however, if you just want to print the fourth column of the first 3 lines
$ awk '{printf "%s ", $4} NR==3{exit}' ile
1234 2345 3456

Try this instead:
#!/bin/env bash
awk '
$1 == "XX" { var[$3] = $4 }
END { print var["val1"], var["val2"], var["val3"] }
' "$#"
There's almost certainly a much simpler solution depending on your real requirements though, e.g. maybe:
awk '
{ vars = (NR>1 ? vars OFS : "") $4 }
NR == 3 { print vars; exit }
' "$#"
. For ease of future enhancements if nothing else, don't call awk from a shebang, just call it explicitly.

Using an if statment within awk to check if one variable meets a format if so follow this path

I want to be able to check the data held in one variable if the data inside is "B" then use this regex if it contains something else use a different regex
awk '{if ($1 == "B")
($2 ~ /^".+"$/) && (length($2) <= 10) {print "45th field invalid-HEADER-FILE";}
else
($2 ~ /^".+"|""$/) && (length($2) <= 10) {print "45th field invalid-HEADER-FILE";}
'
Sample input
$1 == "B"
$2 == "random string"
Expected output
there should be no output as the regex passed
alt sample input
$1 == "B"
$2 == "null/empty
Expected output
there should be 45th field invalid-HEADER-FILE displayed on screen

Update:
The conditions can combine:
($45 ~ /^".+"$/) && (length($45) <= 2502) to ($45~/^".{1,2500}"$/).
($45 ~ /^".+"|""$/) && (length($45) <= 2502) to ($45~/^".{0,2500}"$/).
Also, if there's no quote inside the quotes (and should be like that), more exactly:
($45~/^"[^"]{1,2500}"$/) and ($45~/^"[^"]{0,2500}"$/).
So you can do the checking like this:
awk '
$44 == "B" && ($45~/^"[^"]{1,2500}"$/) {print "45th field invalid-HEADER-FILE";} # <-- You can add next inside, after the semicolon, if there are no other codes need to execute.
$44 != "B" && ($45~/^"[^"]{0,2500}"$/) {print "45th field invalid-HEADER-FILE";}
'
Since it's simply equal or not, so just AND the different conditions of $44 == "B" and $44 != "B" to other conditions will serve your need.
Or, put them all inside the main block, and quote them correctly, like this:
awk '
{
if ($44 == "B") {
if ($45~/^"[^"]{1,2500}"$/) {
print "45th field invalid-HEADER-FILE";
}
} else {
if ($45~/^"[^"]{0,2500}"$/) {
print "45th field invalid-HEADER-FILE";
}
}
}'
When properly quoted and indented, you can see the structure clearly.
BTW, you can change length($45) <= 2502 to length($45) < 2503 for conciseness, since length returns an integer.

I really wish you'd post some sample data, rather not 45 fields wide and with 2502 chars in any of them. Post sample with 2 fields and reduce the width to something reasonable, like 3:
$ cat file
A ""
A "123"
A "1234"
B ""
B "123"
B "1234"
Script:
$ awk '$1=="B" && $2~/^".{,3}"$/{print $0}' file
And its output (these should be your fail message but for demonstrational purposes):
B ""
B "123"
That would translate roughly to:
$ awk '$44=="B" && $45~/^".{,2500}"$/{print "45th field invalid-HEADER-FILE"}' file
Is this what you wanted?

AWK - if and printf, before or within

I have the following printf
printf ("%-6s\t%6.3f\n",msg,sum[msg]/count[msg])
the arrays to this are as follows:
sum[$2] += $3
count[$2]++
$2 = test1
$3 = 0 - 9 on different lines at random
I am to do a if statement to say if either sum[msg] or count[msg] == 0 then replace with -
I am not sure if I have to do the following:
if (count[msg] == 0 || sum[msg] == 0) {
printf ("%-6s\t-\n",msg)
}
or is it possible to put the if within the printf like the following:
printf ("%-6s\t%6.3f\n",msg,if (count[msg] == 0 || sum[msg] == 0) {print -} else sum[msg]/count[msg])
Also if the actual sum[$2] += $3 has no matching rows will that return 0 or null
all help is greatly appreciated.

printf "%-6s\t%s\n",msg,(sum[msg]*count[msg]?sprintf("%6.3f",sum[msg]/count[msg]):"-")
but please do read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to get a foundation so you don't have to keep asking for help every step of the way.

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.

I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.

$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N

This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file

Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

Convert scientific notation to decimal in multiple fields

The following works great on my data in column 12 but I have over 70 columns that are not all the same and I need to output all of the columns, the converted ones replacing the scientific values.
awk -F',' '{printf "%.41f\n", $12}' $file
Thanks
This is one line..
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,2.600000000000000e+001,4.152327506554059e+001,-7.893523806678388e+001,5.447572631835938e+002,2.093000000000000e+003,5.295000000000000e+003,1,194733,1.647400093078613e+001,31047680,1152540,29895140,4738,1.586914062500000e+000,-1.150000000000000e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,3.606000000000000e+003,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,4.557073364257813e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,11,0.000000000000000e+000,2.000000000000000e+000,0,0,0,0,4.466836981009692e-004,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
UPDATE
This is working for the non converted output. I am having a bit of trouble inserting an else if statement for some reason. Everything seems to give me a syntax error in a file or on cli.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i == 19||i == 20||i == 21||i == 22|| i == 40|| i == 43||i == 44||i == 45||i == 46||i >= 51) printf ($i",")};}' $file
I would like to insert the following statement into the code above??
else if (i == 10) printf ("%.41f", $i)
SOLVED
Got it worked out. Thanks for all the great ideas. I can't seem to make it work in a file with awk -f but on the command line this is working great. I put this one liner in my program.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i >= 19&&i <= 22|| i == 40|| i >= 43&&i <= 46||i >= 51&&i <= 70) printf($i","); else if (i == 10||i == 18) printf("%.2f,", $i); else if (i == 11||i == 12) printf("%.41f,", $i); else if (i == 13) printf("%.1f,", $i); else if (i == 14||i == 15||i >= 24&&i <= 46) printf ("%d,", $i); else if (i == 23) printf("%.4f,", $i); else if (i >= 47&&i <= 50) printf("%.6f,", $i); if (i == 71) printf ($i"\n")};}'
RESULT
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,26.00,41.52327506554058800247730687260627746582031,-78.93523806678388154978165403008460998535156,544.8,2093,5295,1,194733,16.47,31047680,1152540,29895140,4738,1.5869,-115,0,0,0,0,0,0,0,3606,0,0,0,455,0,0,0,11,0,2,0,0,0,0,0.000447,0.000000,0.000000,0.000000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-

You can do regex matching in a loop to choose the format for each field since numbers are also strings in AWK:
#!/usr/bin/awk -f
BEGIN {
d = "[[:digit:]]"
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
if ($i ~ d "e+" d d d "$") {
printf "%s%.41f", delim, $i
}
else {
printf "%s%s", delim, $i
}
delim = OFS
}
printf "\n"
}
Edit:
I've changed the version above so you can see how it would be used in a file as an AWK script. Save it (I'll call it "scinote") and set it as executable chmod u+x scinote, then you can run it like this: ./scinote inputfile
I've also modified the latest version you added to your question to make it a little simpler and so it's ready to go into a script file as above.
#!/usr/bin/awk -f
BEGIN {
plainlist = "16 17 19 20 21 22 40 43 44 45 46"
split(plainlist, arr)
for (i in arr) {
plainfmt[arr[i]] = "%s"
}
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
printf "%s", delim
if (i <= 9 || i in plainfmt || i >= 51) {
printf plainfmt[i], $i
}
else if (i == 10) {
printf "%.41f", $i
}
else if (i == 12) {
printf "%.12f", $i
}
delim = OFS
}
printf "\n"
}
If you had more fields with other formats (rather than just one per), you could do something similar to the plainfmt array.

You could always loop through all of your data fields and use them in your printf. For a simple file just to test the mechanics you could try this:
awk '{for (i=1; i<=NF; i++) printf("%d = %s\n", i, $i);}' data.txt
Note that -F is not set here, so fields will be split by whitepace.
NF is the predefined variable for number of fields on a line, fields start with 1 (e.g., $1, $2, etc until $NF). $0 is the whole line.
So for your example this may work:
awk -F',' '{for (i=1; i<=NF; i++) printf "%.41f\n", $i}' $file
Update based on comment below (not on a system test the syntax):
If you have certain fields that need to be treated differently, you may have to resort to a switch statement or an if-statement to treat different fields differently. This would be easier if you stored your script in a file, let's call it so.awk and invoked it like this:
awk -f so.awk $file
Your script might contain something along these lines:
BEGIN{ FS=',' }
{ for (i=1; i<=NF; i++)
{
if (i == 20 || i == 22|| i == 30)
printf( " .. ", $i)
else if ( i == 13 || i == 24)
printf( " ....", $i)
etc.
}
}
You can of course also use if (i > 2) ... or other ranges to avoid having to list out every single field if possible.
As an alternative to this series of if-statements see the switch statement mentioned above.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk compare adjacent lines and print based on if statements - awk

Related

Using If-Statement to assign Variable in awk

Using an if statment within awk to check if one variable meets a format if so follow this path

AWK - if and printf, before or within

awk totally separate duplicate and non-duplicates

Convert scientific notation to decimal in multiple fields

Categories

Resources