I have an input file that contains, per row, a value and two weights.
I would like to generate two output files - where the value in the first column is repeated once per line, according to the weights. This is probably best explained with a short example. If the input file is:
file.in:
35 2 0
37 2 3
38 0 4
Then I would like to generate two output files:
file.out1:
35
35
37
37
file.out2:
37
37
37
38
38
38
38
I will then use these output files to calculate the average and median of first column according to the weights in the second and third column.
This is pretty easy in awk.
awk '{for(i=0;i<$2;i++) print $1;}' file.in > file.out1
generates the first file, and
awk '{for(i=0;i<$3;i++) print $1;}' file.in > file.out2
generates the second
It is not clear from your question whether you know how to compute the mean and median from these files - it seems you just wanted to create these output files. Let me know if the rest is giving your trouble, or whether the above scripts are not clear (I think they are pretty self-explanatory).
If I understood well you need average and median.
Average:
awk '{a+=$1}END{print a/NR}' file.in
36.6667
Median:
cat file.in | awk '{print $1}' | sort | awk '{a[NR]=$1}END{ b=NR/2; b=b%1?int(b)+1:b; print a[b] }'
37
Explanation:
Putting in simple terms NR is a variable which keeps the number of lines, for average you want a sum of every line divided by the number of lines.
For median you want you input sorted and pick the median value, but it's not so simple for your input because I you divide number of lines which is 3 by 2 you will get 1.5 so you need a ceiling function which awk doesn't have so I am doing it with b=NR/2; b=b%1?int(b)+1:b;
I hope this helps.
Related
I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)
How do I get AWK to correctly exchange the values of data file for the string "greater" (than before) or "smaller" (than before)"?
I'm trying:
awk '{if ($1>prev); print ($1="greater"); prev="smaller"}' arraydatafile
arraydatafile file:
2
7
6
1
7
3
Desired output:
smaller
greater ##### because 7 is greater than previous which is 2 ..
smaller ##### because 6 is smaller than previous which is 7 ..
smaller ##### because 1 is smaller than previous which is 1 ..
greater ##### etc etc
smaller
I get a jarbled thing instead.
Extremely grateful for your insight on this.
UPDATE:
Theres a new data file for clarification of the task at hand:
arraydatafile:
2
7
6
6
1
7
3
desired output
smaller
greater
smaller
equal ##### will remove it later, would be nice if this could be done right from the script or one liner preferrably though
smaller
greater
smaller
How do I get a one liner that does this in AWK? Its just comparing the previous ones to the next ones and telling what they are in comparison to them. Ill then delete the "equal" output lines with another one liner instead of complicating this little script, in order to simplify the current task at hand.
The script replace.awk:
#!/bin/awk -f
NR>1 {
if($1==p){
# Skip identical lines
next
}
if($1>p){
print "smaller"
}else{
print "greater"
}
}
# Store previous value
{p=$1}
ran on the arraydatafile
7
6
6
1
7
3
yields the following
root#debian:/home/user/Documents/# awk -f replace.awk arraydatafile
smaller
greater
greater
smaller
greater
How do I get the desired output instead?
Like this:
NR>1 {
if($1==p){
# Skip identical lines
next
}
if($1>p){
print "smaller"
}else{
print "greater"
}
}
# Store previous value
{p=$1}
Is there a way to print the output of an awk script to an existing file as a new field every time?
Hi!
I'm very new at awk (so my terminology might not be correct, sorry about that!) and I'm trying to print the output of a script that will operate on several hundred files to the same file, in different fields.
For example, my data files have this structure:
#File1
1
Values, 2, Hanna
20
15
Values, 2, Josh
30
56
Values, 2, Anna
50
70
#File2
2
Values, 2, Hanna
45
60
Values, 2, Josh
98
63
Values, 2, Anna
10
56
I have several of these files, which are divided by numbered month, with the same names, but different values. I want files that are named by the name of the person, and the values in fields by month, like so:
#Hanna
20 45
15 60
#Josh
30 98
56 63
#Anna
50 10
70 56
In my script, I search for the word "values", and determine which records to print (based on the number after "value"). This works fine. Then I want to print these values. It works fine for one file, with the command:
Print $0 > name #the varible name have I saved to be = $3 of the correct row
This creates three files correctly named "Hanna", "Josh" and "Anna", with their values. However, I would like to run the script for all my datafiles, and append them to only one "Hanna"-file etc, in a new field.
So what I'm looking for is something like print $0 > $month name, reading out like "print the record to the field corresponding to the month"
I have tried to find a solution, but most solutions either just paste temporary files together or append the values after the existing ones (so that they all are in field 1). I want to avoid the temporary files and have them in different fields (so that I get a kind of matrix-structure).
Thank you in advance!
try following, though I have not checked all permutations and combinations and only considered your post. Also your output Josh column is not consistent also(Or please do let us know if more conditions are there for same too). Let me know how it goes then.
awk 'FNR==NR{if($0 ~ /^Values/){Q=$NF;B[$NF]=$NF;i="";next};A[Q,++i]=$0;next} /^Values/{V=$NF;print "#"B[V];i="";next} B[V]{print A[V,++i],$0}' file1 file2
EDIT: Adding a non-one liner form of solution too.
awk 'FNR==NR{
if($0 ~ /^Values/){
Q=$NF;
B[$NF]=$NF;
i="";
next
};
A[Q,++i]=$0;
next
}
/^Values/{
V=$NF;
print "#"B[V];
i="";
next
}
B[V]{
print A[V,++i],$0
}
' file1 file2
EDIT2: Adding explanation too now for same.
awk 'FNR==NR{ ###Checking condition FNR==NR where this condition will be TRUE only when first file named file1 is being read. FNR and NR both indicate number of lines in a Input_file, only difference between them is FNR value will be RESET whenever there is next Input_file is being read and NR value will be keep on incresing till all the Input_files are read.
if($0 ~ /^Values/){ ###Checking here if any line starts from string Values if yes then perform following operations.
Q=$NF; ###Creating a variable named Q whose value is the last field of the line.
B[$NF]=$NF;###Creating an array named B whose index is $NF(last field of the line) and value is same too.
i=""; ###Making variable i value to NULL now.
next ###using next here, it is built-in keyword for awk and it will skip all further statements now.
};
A[Q,++i]=$0; ###Creating an array named A whose index is Q and variable i with increasing value with 1 to it, each time it comes on this statement.
next ###Using next will skip all further statements now.
}
/^Values/{ ###All statements from here will be executed when second file named file2 is being read. So I am checking here if a line starts from string Values then do following.
V=$NF; ###create variable V whose value is $NF of current line.
print "#"B[V]; ###printing the string # then value of array B whose index is variable V.
i=""; ###Nullifying the variable i value here.
next ###next will sip all the further statements now.
}
B[V]{ ###Checking here if array B with index V is having a value in it, then perform following on it too.
print A[V,++i],$0 ###printing the value of array A whose index is variable V and variable i increasing value with 1 and current line.
}
' file1 file2 ###Mentioning the Input_files here named file1 and file2.
I'm having trouble figuring out if the current line number is greater then the next row then it should print something like for example " the number 53 is greater than 23 " and then compares the next two lines "the number 54 is less than 76". I thinking something along the lines NR%2, but not sure what to do after that. Any hints or suggestions on how this would be done would be greatly appreciated thanks.
An example of this file is:
53
23
54
76
12
42
Expected outcome
the number 53 is greater than 23
the number 54 is less than 76
the number 12 is less than 42
this would be what you want:
awk '!(NR%2){print p>=$0?p">="$0:p"<"$0;next}{p=$0}' file
output:
53>=23
54<76
12<42
output with your new input file:
53>=23
54<76
12<42
43>=4
1<63
34<56
you can adjust the text ("greater/less than"). also handle the == case if you want.
Just for fun here is one way of doing it with coreutils, bc and sed:
<infile paste -d' ' - - <( <infile paste -d'<' - - | bc ) |
sed 's/1$/less/; s/0$/greater/; s/([0-9]+) ([0-9]+) (.*)/the number \1 is \3 than \2/'
Output:
the number 53 is greater than 23
the number 54 is less than 76
the number 12 is less than 42
Explanation
The inner paste pipes n1<n2 to bc with returns a binary vector. The outer paste columnates this vector with pairs of numbers from the input. sed reorganizes its input based on the binary vector.
So if you were only interested in knowing if pairs of lines are greater or less than each other this bit would be enough:
<infile paste -d'<' - - | bc
Output:
0
1
1
I have a file with the lines as:
5 3 6 4 2 3 5
1 4 3 2 6 5 8
..
I want to get the min on each line, so for example with the input given above, I should get:
min of first line: 2
min of second line: 1
..
How can I use awk to do this for any arbitrary number of columns in each line?
If you don't mind the output using digits instead of words you can use this one liner:
$ awk '{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print "min of line",NR": ",m}' file
min of line 1: 2
min of line 2: 1
If you really do want to count in ordinal numbers:
BEGIN {
split("first second third fourth",count," ")
}
{
min=$1
for(i=1;i<=NF;i++)
if($i<min)
min=$i
print "min of",count[NR],"line: \t",min
}
Save this to script.awk and run like:
$ awk -f script.awk file
min of first line: 2
min of second line: 1
Obviously this will only work for files with upto 4 lines but just increase the ordinal numbers list to the maximum number you think you will need. You should be able to find a list online pretty easily.
Your problem is pretty simple. All you need to do is to define a variable min in the BEGIN part of your script, and at each line, you just have to perform a simple C-like algorithm for minimum element (set the first field as min, and then perform a check with the next field, and so on until you reach the final field of the line). The total number of fields in the line will be known to you because of the variable NF. So its just a matter of writing a for loop. Once the for loop is fully executed for the line, you will have the minimum element with you, and you could just print it.