awk - ternary conditional expression not working - awk

Input (sample)
=== account ===
title,altTitle,platform,url,
title,altTitle,platform,url,
title,altTitle,platform,url,
title,altTitle,platform,url,
title,altTitle,platform,url,
__collate-by-account.awk
#! /usr/bin/awk -f
#
# Group together lines (records) by account name
BEGIN { FS = ":" }
### generate headers ###
{s = $1}
{if (s != p)
print "\n\n=== ", s " ==="
}
{p = s}
### process records ###
# print field $2 to last field
{for (i = 2; i <= NF; ++i)
# {if (i!=NF) printf $i":"; else printf $i}
{ i != NF ? printf $i":" : printf $i }
}
{printf "\n"}
This part works as intended:
{if (i!=NF) printf $i":"; else printf $i}
Why doesn't this work:
{ i != NF ? printf $i":" : printf $i }
Getting the following errors:
awk: scripts/utils/metadata/__collate-by-account.awk:18: { i != NF ? printf $i":" : printf $i }
awk: scripts/utils/metadata/__collate-by-account.awk:18: ^ syntax error
awk: scripts/utils/metadata/__collate-by-account.awk:18: { i != NF ? printf $i":" : printf $i }
awk: scripts/utils/metadata/__collate-by-account.awk:18: ^ syntax error

Solution, thanks to #James Brown:
### process records ###
# print field $2 to last field
{for (i = 2; i <= NF; ++i)
{ printf "%s%s",$i,(i!=NF?":":"") }
}
{printf "\n"}
Explaination:
First off, note that printf can't be inside the ternary operator, neither the conditional expression to be evaluated (for obvious reasons) nor the resulting if-else expressions that will be executed after evaluation.
printf formats and prints the results
%s%s format specifiers, outputs or substitutes the next 2 arguments as strings:
https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html
https://en.wikipedia.org/wiki/Printf_format_string
$i simply output the field that's being looped over, see the above for-loop
(i!=NF?":":"")
output ":" if i is not equal to NF,
otherwise output empty string ""

Related

awk: reformat date in a list of data-files (mass edit)

I want to awk a list of data-files. All records - there is an unknown number of records before - before , e.g.,
/10-12-2014 06:47:59/{p=1}
are to be skipped.
A brief template of one data file looks like this:
data_file_001
0; n records to be skipped
1;10-12-2014 06:47:59;
2;12-12-2014 10:17:44;
3;12-12-2014 10:37:44;
4;14-12-2014 10:00:32;
5;;movefield
6;16-12-2014 04:15:39;
needed Output ($2 datefield reformatted and $3 moved to $4):
colnum;date;col3;col4;col5
2;12.12.14;;
3;12.12.14;;
4;14.12.14;;
5;;;movefield;moved
6;16.12.14;;
My source file is this at the moment:
BEGIN { OFS=FS=";" ; print "colnum;date;col3;col4;col5"}
FNR == 1 { p=0 }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""}
#(x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)}
$2=substr($2,1,11)
p!=0{print};
/10-12-2014 06:47:59/{p=1}
I have problems to reformat the data fields: The pattern-action (x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)} does not work nor $2=substr($2,1,11) in conjunction with the movefield action. Notice that the record where the movefield field appears has no date field.
Please have in mind that the awk is meant to be used on a bunch of files (loop).
With GNU awk for implace editing, no loop required:
awk -i inplace '
BEGIN { OFS=FS=";" ; print "colnum","date","col3","col4","col5" }
FNR==1 { next }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""; print; next }
{ sub(/ .*/,"",$2); gsub(/-/,".",$2); print $0, ""}
' file*
Another in GNU awk:
$ awk '
function refmt(str) { # reformat date for comparing
split(str,d,"[ :-]")
return mktime(d[3] " " d[2] " " d[1] " " d[4] " " d[5] " " d[6])
}
BEGIN {
FS=OFS=";"
start=refmt("10-12-2014 06:47:59") # reformat the threshold date
print "colnum","date","col3","col4" # print header (why 5?)
}
refmt($2)>start || $2=="" { # if date > start or empty
sub(/ .*/,"",$2) # delete time part
gsub(/-/,".",$2) # replace - by .
$4=$3; $3="" # or $3 = OFS $3
print # output
}' file
colnum;date;col3;col4
2;12.12.2014;;
3;12.12.2014;;
4;14.12.2014;;
5;;;movefield
6;16.12.2014;;

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

missing field and extra space after using for loop in awk

I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt

changing the appearance of awk output

I used the following code to extract protein residues from text files.
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == 1 && $4 > 30) {
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt
I got the following output when I used the above code.
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR>1axc
RQTSMTDFYHSKRRLIFS>1bxc
RQTSMTDFYHSKRRLIFSPRR>1axF
RQTSMTDFYHSKRR>1qqt
ARPYQGVRVKEPVKELLRRKRG
I would like to get the output as shown below.How do I change the above code to get the following output?
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR
>1axc
RQTSMTDFYHSKRRLIFS
>1bxc
RQTSMTDFYHSKRRLIFSPRR
>1axF
RQTSMTDFYHSKRR
>1qqt
ARPYQGVRVKEPVKELLRRKRG
This might work for you:
awk '{
if (FNR == 1 ) print newline ">" FILENAME
if ($5 == 1 && $4 > 30) {
newline="\n";
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt
With gawk version 4, you can write:
gawk '
BEGINFILE {print ">" FILENAME}
($5 == 1 && $4 > 30) {printf "%s", $3}
ENDFILE {print ""}
' filename ...
http://www.gnu.org/software/gawk/manual/html_node/BEGINFILE_002fENDFILE.html#BEGINFILE_002fENDFILE

Convert scientific notation to decimal in multiple fields

The following works great on my data in column 12 but I have over 70 columns that are not all the same and I need to output all of the columns, the converted ones replacing the scientific values.
awk -F',' '{printf "%.41f\n", $12}' $file
Thanks
This is one line..
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,2.600000000000000e+001,4.152327506554059e+001,-7.893523806678388e+001,5.447572631835938e+002,2.093000000000000e+003,5.295000000000000e+003,1,194733,1.647400093078613e+001,31047680,1152540,29895140,4738,1.586914062500000e+000,-1.150000000000000e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,3.606000000000000e+003,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,4.557073364257813e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,11,0.000000000000000e+000,2.000000000000000e+000,0,0,0,0,4.466836981009692e-004,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
UPDATE
This is working for the non converted output. I am having a bit of trouble inserting an else if statement for some reason. Everything seems to give me a syntax error in a file or on cli.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i == 19||i == 20||i == 21||i == 22|| i == 40|| i == 43||i == 44||i == 45||i == 46||i >= 51) printf ($i",")};}' $file
I would like to insert the following statement into the code above??
else if (i == 10) printf ("%.41f", $i)
SOLVED
Got it worked out. Thanks for all the great ideas. I can't seem to make it work in a file with awk -f but on the command line this is working great. I put this one liner in my program.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i >= 19&&i <= 22|| i == 40|| i >= 43&&i <= 46||i >= 51&&i <= 70) printf($i","); else if (i == 10||i == 18) printf("%.2f,", $i); else if (i == 11||i == 12) printf("%.41f,", $i); else if (i == 13) printf("%.1f,", $i); else if (i == 14||i == 15||i >= 24&&i <= 46) printf ("%d,", $i); else if (i == 23) printf("%.4f,", $i); else if (i >= 47&&i <= 50) printf("%.6f,", $i); if (i == 71) printf ($i"\n")};}'
RESULT
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,26.00,41.52327506554058800247730687260627746582031,-78.93523806678388154978165403008460998535156,544.8,2093,5295,1,194733,16.47,31047680,1152540,29895140,4738,1.5869,-115,0,0,0,0,0,0,0,3606,0,0,0,455,0,0,0,11,0,2,0,0,0,0,0.000447,0.000000,0.000000,0.000000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
You can do regex matching in a loop to choose the format for each field since numbers are also strings in AWK:
#!/usr/bin/awk -f
BEGIN {
d = "[[:digit:]]"
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
if ($i ~ d "e+" d d d "$") {
printf "%s%.41f", delim, $i
}
else {
printf "%s%s", delim, $i
}
delim = OFS
}
printf "\n"
}
Edit:
I've changed the version above so you can see how it would be used in a file as an AWK script. Save it (I'll call it "scinote") and set it as executable chmod u+x scinote, then you can run it like this: ./scinote inputfile
I've also modified the latest version you added to your question to make it a little simpler and so it's ready to go into a script file as above.
#!/usr/bin/awk -f
BEGIN {
plainlist = "16 17 19 20 21 22 40 43 44 45 46"
split(plainlist, arr)
for (i in arr) {
plainfmt[arr[i]] = "%s"
}
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
printf "%s", delim
if (i <= 9 || i in plainfmt || i >= 51) {
printf plainfmt[i], $i
}
else if (i == 10) {
printf "%.41f", $i
}
else if (i == 12) {
printf "%.12f", $i
}
delim = OFS
}
printf "\n"
}
If you had more fields with other formats (rather than just one per), you could do something similar to the plainfmt array.
You could always loop through all of your data fields and use them in your printf. For a simple file just to test the mechanics you could try this:
awk '{for (i=1; i<=NF; i++) printf("%d = %s\n", i, $i);}' data.txt
Note that -F is not set here, so fields will be split by whitepace.
NF is the predefined variable for number of fields on a line, fields start with 1 (e.g., $1, $2, etc until $NF). $0 is the whole line.
So for your example this may work:
awk -F',' '{for (i=1; i<=NF; i++) printf "%.41f\n", $i}' $file
Update based on comment below (not on a system test the syntax):
If you have certain fields that need to be treated differently, you may have to resort to a switch statement or an if-statement to treat different fields differently. This would be easier if you stored your script in a file, let's call it so.awk and invoked it like this:
awk -f so.awk $file
Your script might contain something along these lines:
BEGIN{ FS=',' }
{ for (i=1; i<=NF; i++)
{
if (i == 20 || i == 22|| i == 30)
printf( " .. ", $i)
else if ( i == 13 || i == 24)
printf( " ....", $i)
etc.
}
}
You can of course also use if (i > 2) ... or other ranges to avoid having to list out every single field if possible.
As an alternative to this series of if-statements see the switch statement mentioned above.