awk printing lines in different order to original file - awk

I have a csv file with 6 columns. Col3 is an ID, and Col4 is a count.
I want to get print Col3, and then convert the Col4 to a frequency.
Col1,Col2,Col3,Col4,Col5,Col6
9,19,9,7,9,6
10,132,10,131,10,65
10.3,0,10.3,0,10.3,1
11,128,11,182,11,82
My command
awk -F"," '{if (NR!=1) f[$3] = $4; SUM += $4} END { for (i in f) { print i, f[i]/SUM } }' myfile.csv > myoutfile.txt
Unexpectedly, its printing the output lines in the wrong order - 10.3 comes before 10.
Is there a way to fix this
9,0.021875
10.3,0
10,0.409375
11,0.56875

Here is one way using awk:
awk 'BEGIN{FS=OFS=","}FNR==1{next}NR==FNR{sum+=$4;next}{print $3,(sum>0?$4/sum:0)}' file file
9,0.021875
10,0.409375
10.3,0
11,0.56875
You do two passes to the file. For both passes we do a check that if it is the first line, we skip it by doing FNR==1{next}. In the first pass, you create a variable sum and keep adding column 4 value to it. In the second pass we just print the 3rd column along with frequency (4th column / sum).
Notice that I have used file file due to two passes. You can use brace expansion and do file{,}

Related

awk script for counting records based on multiple comditions

I have a file with 3 columns as shown below
col1,col2
a,x,1
b,y,2
a,x,0
b,x,2
b,y,0
a,y,0
I am working on an awk script to get following output : (grouped by col1 and col2 and counts of total, condition1, condition2)
col1,col2,total count,count where col3=0, count where col3>0
a,x,2,1,1
a,y,1,1,0
b,x,1,0,1
b,y,2,1,1
I worked out a script to get all 3 separately by using following command :
for case 3 : col3>0
awk -F',' '($3>0)NR>1{arr[$1","$2]++}END{for (a in arr) print a, arr[a]}' file
Similar command for other case as well.
I am unable to create a command/script to solve all 3 case in same go.
Any help is appreciated.
P.S.: This sample file is small so I can run 3 script/command and join them, but for real file is too big to run same thin 3 times.
Here's one:
$ awk '
BEGIN {
FS=OFS="," # field separators
}
NR>1 { # after header
k=$1 OFS $2 # set the key
a[k]++ # total count of unique $1 $2
b[k]+=($3==0) # count where $3==0
c[k]+=($3>0) # count where $3>0
}
END { # after all processing is done
for(i in a) # output values
print i,a[i],b[i],c[i]
}' file
Output (in random order but you can fix that with #Inian's tip in the comments):
a,y,1,1,0
b,x,1,0,1
b,y,2,1,1
a,x,2,1,1

Duplicate Lines 2 times and transpose from row to column

I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.

awk to store field length in variable then use in print

In the awk below I am trying to store the length of $5 in a variable il if the condition is met (in the two lines it is) and then add that variable to $3 in the print statement. The two sub statements are to remove the matching from both $5 and $6. The script as is executes and produces the current output. However, il does not seem to be populated and added in the print. It seems close but I'm not sure why the variable isn't being stored? Thank you :)
awk
awk 'BEGIN{FS=OFS="\t"} # define fs and output
FNR==NR{ # process each field in each line of file
if(length($5) < length($6)) { # condition
il=$(length($5))
echo $il
sub($5,"",$6) && sub($6,"",$5) # removing matching
print $1,$2,$3+$il,$3+$il,"-",$6 # print desired output
next
}
}' in
in tab-delimited
id1 1 116268178 GAAA GAAAA
id2 2 228197304 A AATCC
current output tab-delimited
id1 1 116268178 116268178 - A
id2 2 228197304 228197304 - ATCC
desired output tab-delimited
since `$5` is 4 in line 1 that is added to `$3`
since `$5` is 1 in line 2 that is added to `$3`
id1 1 116268181 116268181 - A
id2 2 228197305 228197305 - ATCC
Following awk may help you here.
awk '{$3+=length($4);$3=$3 OFS $3;sub($4,"",$5);$4="-"} 1' Input_file
Please add BEGIN{FS=OFS="\t"} in case your Input_file is TAB delimited and you require output in TAB delimited form too.

Sum values for similar lines using awk

From the example below I want to sum the scores for the rows where Targets and miRNA are similar: Please see below.
Target miRNA Score
NM_198900 hsa-miR-423-5p -0.244
NM_198900 hsa-miR-423-5p -0.6112
NM_1989230 hsa-miR-413-5p -0.644
NM_1989230 hsa-miR-413-5p -0.912
Output:
NM_198900 hsa-miR-423-5p -0.8552
NM_1989230 hsa-miR-413-5p -1.556
Like this:
awk '{x[$1 " " $2]+=$3} END{for (r in x)print r,x[r]}' file
As it sees each line, it adds the third field ($3) into an array x[] as indexed by joining fields 1 and 2 with a space between them. At the end, it prints all elements of x[].
Following #jaypal's suggestion, you may prefer this which retains your header line (NR==1) and uses TABs as the Output Field Separator
awk 'NR==1{OFS="\t";print;next} {x[$1 OFS $2]+=$3} END{for (r in x)print r,x[r]}' file

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.