I have one file with multiple lines (reads from a genome) and they are sorted (based on their locations). Now I want to loop over these lines and if multiple lines have the same ID (column 4), I want to keep either keep the first, if column 3 is a plus or the last, if column three is a minus. This is m code but it seems like my variable (lastID) is not properly updated after each line.
Tips are much appreciated.
awk 'BEGIN {lastline=""; lastID=""}
{if ($lastline != "" && $4 != $lastID)
{print $lastline; lastline=""};
if ($3 == "+" && $4 != $lastID)
{print $0; lastline=""}
else if ($3 == "+" && $4 == $lastID)
{lastli=""}
else if ($3 == "-")
{lastline=$0};
lastID=$4
}' file
To access the value of a variable in awk you just use the name of the variable, just like in C and most other Algol-based languages. You don't stick a $ in front of it like you would with shell. Try changing:
$lastline != "" && $4 != $lastID
to:
lastline != "" && $4 != lastID
etc.
This might be what you're trying to do (your BEGIN section was doing nothing useful so I just removed it):
awk '
(lastline != "") && ($4 != lastID) {
print lastline
lastline=""
}
$3 == "+" {
if ($4 == lastID) {
lastli=""
}
else {
print $0
lastline=""
}
}
$3 == "-" {
lastline=$0
}
{ lastID=$4 }
' file
When formatted sensibly like that you can see that lastli is never used anywhere except where it's set to "" so that's probably a bug - maybe it's supposed to lastline in which case it can be made common rather than being set in both the if and else legs?
you may want to utilize awk's own condition{statement} structure. Note that code layout is not universally accepted but I find it easier to read for short statements.
$ awk '$lastline!="" && $4 != $lastID {print lastline; lastline=""}
$3=="+" && $4 != $lastID {print; lastline=""}
$3=="+" && $4 == $lastID {lastli=""}
$3=="-" {lastline=$0}
{lastID=$4}' file
I am trying to create an awk script file that takes an input file and converts the fourth column of information for the first three lines into a single row.
For example, if input.txt looks like this:
XX YY val1 1234
XX YY val2 2345
XX YY val3 3456
stuff random garbage junk extrajunk
useless 343059 random3
I want to print the fourth column for rows 1, 2 and 3 into a single row:
1234 2345 3456
I was trying to do this by using if/else-if statements so my file looks like this right now:
#!/usr/bin/awk -f
{
if ($1 == "XX" && $3 == "val1")
{
var1=$4;
}
else if ($1 == "XX" && $3 == "val2")
{
var2=$4;
}
else if ($1 == "XX" && $3 == "val3")
{
var3=$4;
}
}
END{
print var1,var2,var3
and then I would print the variables on one line. However, when I try to implement this, I get syntax errors pointing to the "=" symbol in the var2=$4 line.
EDIT
Solved, in my real file I had named the variables funky (yet descriptive) names and that was messing it all up. - Oops.
Thanks
you can write something like this
$ awk '$1=="XX"{if($3=="val1") var1=$4
else if($3=="val2") var2=$4
else if($3=="val3") var3=$4}
// .. do something with the vars ....
however, if you just want to print the fourth column of the first 3 lines
$ awk '{printf "%s ", $4} NR==3{exit}' ile
1234 2345 3456
Try this instead:
#!/bin/env bash
awk '
$1 == "XX" { var[$3] = $4 }
END { print var["val1"], var["val2"], var["val3"] }
' "$#"
There's almost certainly a much simpler solution depending on your real requirements though, e.g. maybe:
awk '
{ vars = (NR>1 ? vars OFS : "") $4 }
NR == 3 { print vars; exit }
' "$#"
. For ease of future enhancements if nothing else, don't call awk from a shebang, just call it explicitly.
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
I am testing awk and got this thought .So we know that
raja#badfox:~/Desktop/trails$ cat num.txt
1 2 3 4
1 2 3 4
4 1 2 31
raja#badfox:~/Desktop/trails$ awk '{ if ($1 == '4') print $0}' num.txt
4 1 2 31
raja#badfox:~/Desktop/trails$
so the command going to check for 4 at 1st column in filename num.txt .
So now i want output as there is a 4 at column 4 also and for example if i have 100 column of information and i want get the output as how many columns i have with the term i am searching .
I mean from the above example i want column 4 and column 1 as the output and i am searching for 4 .
If you are trying to find the rows which contain your search item (in this case, the value 4), and you want a count of how many such values appear in the row (as well as the row's data), then you need something like:
awk '{ count=0
for (i = 1; i <= NF; i++) if ($i == 4) count++
if (count) print $i ": " $0
}'
That doesn't quite fit onto one SO line.
If you merely want to identify which columns contain the search value, then you can use:
awk '{ for (i = 1; i <= NF; i++) if ($i == 4) column[i] = 1 }
END { for (i in column) print i }'
This sets the (associative) array element column[i] to 1 for each column that contains the search value, 4. The loop at the end prints the column numbers that contain 4 in an indeterminate (not sorted) order. GNU awk includes sort functions (asort, asorti); POSIX awk does not. If sorted order is crucial, then consider:
awk 'BEGIN { max = 0 }
{ for (i = 1; i <= NF; i++) if ($i == 4) { column[i] = 1; if (i > max) max = i } }
END { for (i = 1; i <= max; i++) if (column[i] == 1) print i }'
You are looking for the NF variable. It's the number of fields in the line.
Here's an example of how to use it:
{
if (NF == 8) {
print $3, $8;
} else if (NF == 9) {
print $3, $9;
}
}
Or inside a loop:
# This will print the line if any field has the value 4
for (i=1; i<=NF; i++) {
if ($i == 4)
print $0
}
The following works great on my data in column 12 but I have over 70 columns that are not all the same and I need to output all of the columns, the converted ones replacing the scientific values.
awk -F',' '{printf "%.41f\n", $12}' $file
Thanks
This is one line..
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,2.600000000000000e+001,4.152327506554059e+001,-7.893523806678388e+001,5.447572631835938e+002,2.093000000000000e+003,5.295000000000000e+003,1,194733,1.647400093078613e+001,31047680,1152540,29895140,4738,1.586914062500000e+000,-1.150000000000000e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,3.606000000000000e+003,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,4.557073364257813e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,11,0.000000000000000e+000,2.000000000000000e+000,0,0,0,0,4.466836981009692e-004,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
UPDATE
This is working for the non converted output. I am having a bit of trouble inserting an else if statement for some reason. Everything seems to give me a syntax error in a file or on cli.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i == 19||i == 20||i == 21||i == 22|| i == 40|| i == 43||i == 44||i == 45||i == 46||i >= 51) printf ($i",")};}' $file
I would like to insert the following statement into the code above??
else if (i == 10) printf ("%.41f", $i)
SOLVED
Got it worked out. Thanks for all the great ideas. I can't seem to make it work in a file with awk -f but on the command line this is working great. I put this one liner in my program.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i >= 19&&i <= 22|| i == 40|| i >= 43&&i <= 46||i >= 51&&i <= 70) printf($i","); else if (i == 10||i == 18) printf("%.2f,", $i); else if (i == 11||i == 12) printf("%.41f,", $i); else if (i == 13) printf("%.1f,", $i); else if (i == 14||i == 15||i >= 24&&i <= 46) printf ("%d,", $i); else if (i == 23) printf("%.4f,", $i); else if (i >= 47&&i <= 50) printf("%.6f,", $i); if (i == 71) printf ($i"\n")};}'
RESULT
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,26.00,41.52327506554058800247730687260627746582031,-78.93523806678388154978165403008460998535156,544.8,2093,5295,1,194733,16.47,31047680,1152540,29895140,4738,1.5869,-115,0,0,0,0,0,0,0,3606,0,0,0,455,0,0,0,11,0,2,0,0,0,0,0.000447,0.000000,0.000000,0.000000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
You can do regex matching in a loop to choose the format for each field since numbers are also strings in AWK:
#!/usr/bin/awk -f
BEGIN {
d = "[[:digit:]]"
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
if ($i ~ d "e+" d d d "$") {
printf "%s%.41f", delim, $i
}
else {
printf "%s%s", delim, $i
}
delim = OFS
}
printf "\n"
}
Edit:
I've changed the version above so you can see how it would be used in a file as an AWK script. Save it (I'll call it "scinote") and set it as executable chmod u+x scinote, then you can run it like this: ./scinote inputfile
I've also modified the latest version you added to your question to make it a little simpler and so it's ready to go into a script file as above.
#!/usr/bin/awk -f
BEGIN {
plainlist = "16 17 19 20 21 22 40 43 44 45 46"
split(plainlist, arr)
for (i in arr) {
plainfmt[arr[i]] = "%s"
}
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
printf "%s", delim
if (i <= 9 || i in plainfmt || i >= 51) {
printf plainfmt[i], $i
}
else if (i == 10) {
printf "%.41f", $i
}
else if (i == 12) {
printf "%.12f", $i
}
delim = OFS
}
printf "\n"
}
If you had more fields with other formats (rather than just one per), you could do something similar to the plainfmt array.
You could always loop through all of your data fields and use them in your printf. For a simple file just to test the mechanics you could try this:
awk '{for (i=1; i<=NF; i++) printf("%d = %s\n", i, $i);}' data.txt
Note that -F is not set here, so fields will be split by whitepace.
NF is the predefined variable for number of fields on a line, fields start with 1 (e.g., $1, $2, etc until $NF). $0 is the whole line.
So for your example this may work:
awk -F',' '{for (i=1; i<=NF; i++) printf "%.41f\n", $i}' $file
Update based on comment below (not on a system test the syntax):
If you have certain fields that need to be treated differently, you may have to resort to a switch statement or an if-statement to treat different fields differently. This would be easier if you stored your script in a file, let's call it so.awk and invoked it like this:
awk -f so.awk $file
Your script might contain something along these lines:
BEGIN{ FS=',' }
{ for (i=1; i<=NF; i++)
{
if (i == 20 || i == 22|| i == 30)
printf( " .. ", $i)
else if ( i == 13 || i == 24)
printf( " ....", $i)
etc.
}
}
You can of course also use if (i > 2) ... or other ranges to avoid having to list out every single field if possible.
As an alternative to this series of if-statements see the switch statement mentioned above.