Finding greater number using awk - awk

I'm having trouble figuring out if the current line number is greater then the next row then it should print something like for example " the number 53 is greater than 23 " and then compares the next two lines "the number 54 is less than 76". I thinking something along the lines NR%2, but not sure what to do after that. Any hints or suggestions on how this would be done would be greatly appreciated thanks.
An example of this file is:
53
23
54
76
12
42
Expected outcome
the number 53 is greater than 23
the number 54 is less than 76
the number 12 is less than 42

this would be what you want:
awk '!(NR%2){print p>=$0?p">="$0:p"<"$0;next}{p=$0}' file
output:
53>=23
54<76
12<42
output with your new input file:
53>=23
54<76
12<42
43>=4
1<63
34<56
you can adjust the text ("greater/less than"). also handle the == case if you want.

Just for fun here is one way of doing it with coreutils, bc and sed:
<infile paste -d' ' - - <( <infile paste -d'<' - - | bc ) |
sed 's/1$/less/; s/0$/greater/; s/([0-9]+) ([0-9]+) (.*)/the number \1 is \3 than \2/'
Output:
the number 53 is greater than 23
the number 54 is less than 76
the number 12 is less than 42
Explanation
The inner paste pipes n1<n2 to bc with returns a binary vector. The outer paste columnates this vector with pairs of numbers from the input. sed reorganizes its input based on the binary vector.
So if you were only interested in knowing if pairs of lines are greater or less than each other this bit would be enough:
<infile paste -d'<' - - | bc
Output:
0
1
1

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

How to replace a content on specific patter using sed without losing info?

I have a text file with a bunch of data. I was able to extract exactly what I want using sed; but
I need to replaced only the specific pattern I searched without losing the other content from the file.
Im using the following sed command; but I need to the replacement; but dont know how to do it.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
The sed searches for the following pattern: ## ## ## ## ## ##, but I want to replace that pattern like this: ######-######.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
Output:
1 | ec eb b8 7b e3 c0 47
9 | 90 20 c2 f6 3d c0 1/1/1
25 | 00 fd 45 3d a7 80 31
Desired Output:
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31
Thanks
With your shown samples please try following awk program.
awk '
BEGIN{ FS=OFS="|" }
{
$2=substr($2,1,3) substr($2,5,2) substr($2,8,2)"-" substr($2,11,2) substr($2,14,2) substr($2,17,2) substr($2,19)
}
1
' Input_file
Explanation: Simple explanation would be, setting FS and OFS as | for awk program. Then in 2nd field using awk's substr function keeping only needed values as per shown samples of OP. Where substr function works on method of printing specific indexes/position number values(eg: from which value to which value you need to print). Then saving required values into 2nd field itself and printing current line then.
With awk:
awk '{$3=$3$4$5"-"$6$7$8; print $1"\t",$2,$3,$NF}' file
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31
This might work for you (GNU sed):
sed -E 's/ (\S\S) (\S\S) (\S\S)/ \1\2\3/;s//-\1\2\3/' file
Pattern match 3 spaced pairs twice, removing the spaces between the 2nd and 3rd pairs and replacing the first space in the 2nd match by -.
If you want to extract specific substrings, you'll need to write a more specific regex to pull out exactly those.
sed -rn '/([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s/\1\2\3-\4\5\6/' file.txt > output.txt
Notice also how easy it is to avoid the useless cat.
\s is generally not portable; the POSIX equivalent is [[:space:]].

Compare and Replace inline using Awk/Sed command

I have a file (fixed length) in which searching for consecutive 2 lines starting with number 30 and then comparing value in position 183-187 and if both matching printing the line number. I am able to achieve the desired results up to this stage. But I would like to replace the value present in the line number with empty spaces without tampering the fixed length.
awk '/^30*/a=substr($0,183,187);getline;b=substr($0,183,187); if(a==b) print NR}' file
Explanation of the command above:
line number start with 30*
assign value to a from position between 183 to 187
get next line
assign value to b from position between 183 to 187
compare a & b - if matches it proves the value between 183 to 187 in 2 consecutive lines which starts with 30.
print line number (this is the 2nd match line number)
Above command is working as expected and printing the line number.
Example Record (just for explanation purpose hence not used fixed length)
10 ABC
20 XXX
30 XYZ
30 XYZ
30 XYZ
30 XYZ
40 YYY
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
With above command I could able to get line number 3 and 4 but unable to replace the 4th line output with empty spaces (inline replace) so that fixed width will not get compromised
Expected Output
10 ABC
20 XXX
30 XYZ
30
30
30
40 YYY
10 ABC
20 XXX
30 XYZ
30
40 YYY
Length of all the above lines should be 255 chars - when replace happens it has to be inline without adding it as new spaces.
Any help will be highly appreciated. Thanks.
I would use GNU AWK and treat every character as field, consider following example, let file.txt content be
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
then
awk 'BEGIN{FPAT=".";OFS=""}prev~/^30*/{a=substr(prev,4,3);b=substr($0,4,3);if(a==b){$4=$5=$6=" "}}{print}{prev=$0}' file.txt
output
10 ABC
20 XXX
30 XYZ
30
40 YYY
Explanation: I elected to use storing whole line in variable called prev rather than using getline, thus I do {prev=$0} as last action. I set FPAT to . indicating that any single character should be treated as field and OFS (output field separator) to empty string so no unwanted character will be added during line preparing. If prev (previous line or empty string for first line) starts with 3 I get substring with characters 4,5,6 from previous line (prev) and store it in variable a and get substring with characters 4,5,6 from current line ($0) and store it in variable b, if a and b are equal I change 4th, 5th and 6th character to space each. No matter it was changed or not I print line. Disclaimer: this assume that you want to deal with at most 2 subsequent lines having equal substring. Note /^30*/ does not check if string starts with 30 but rather if it does start with 3 e.g. it will match 312, you should probably use /^30/ instead, I elected to use your pattern unchanged as you imply it does work as intended for your data.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed -E '/^30/{N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /}' file
Match on a line beginning 30 and append the following line.
Using pattern matching, if the 5 characters from 183-187 for both lines are the same, replace the second group of 5 characters with 5 spaces.
For multiple adjacent lines use:
sed -E '/^30/{:a;N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /;ta}' file
Or alternative:
sed -E ':a;$!N;s/^(30.{180}(\S{5}).*\n.{182})\2/\1 /;ta;P;D' file
It sounds like this is what you want, using any awk in any shell on every Unix box:
$ awk -v tgt=30 -v beg=11 -v end=13 '
($1==tgt) && seen[$1]++ { $0=substr($0,1,beg-1) sprintf("%*s",end-beg,"") substr($0,end+1) }
1' file
10 ABC
20 XXX
30 XYZ
30
40 YYY
Just change -v beg=11 -v end=13 to beg=183 -v end=187 for your real data.
If you're ever again tempted to use getline make sure to read awk.freeshell.org/AllAboutGetline first as it's usually the wrong approach.

Extract all numbers from string in list

Given some string 's' I would like to extract only the numbers from that string. I would like the outputted numbers to be each be separated by a single space.
Example input -> output
....IN:1,2,3
OUT:1 2 3
....IN:1 2 a b c 3
OUT:1 2 3
....IN:ab#35jh71 1,2,3 kj$d3kjl23
OUT:35 71 1 2 3 3 23
I have tried combinations of grep -o [0-9] and grep -v [a-z] -v [A-Z] but the issue is that other chars like - and # could be used between the numbers. Regardless of the number of non-numeric characters between the numbers I need them to be replaced with a single space.
I have also been experimenting with awk and sed but have had little luck.
Not sure about spaces in your expected output, based on your shown samples, could you please try following.
awk '{gsub(/[^0-9]+/," ")} 1' Input_file
Explanation: Globally substituting anything apart from digit with spaces. Mentioning 1 will print current line.
In case you want to remove initial/starting space and ending space in output then try following.
awk '{gsub(/[^0-9]+/," ");gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Globally substituting everything apart from digits with space in current line and then globally substituting starting and ending spaces with NULL in current line. Mentioning 1 will print edited/non-edited current line.
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | grep -o '[[:digit:]]*'
35
71
1
2
3
3
23
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | tr -sc '[:digit:]' ' '
35 71 1 2 3 3 23

Copy lines by rows in awk

I have an input file that contains, per row, a value and two weights.
I would like to generate two output files - where the value in the first column is repeated once per line, according to the weights. This is probably best explained with a short example. If the input file is:
file.in:
35 2 0
37 2 3
38 0 4
Then I would like to generate two output files:
file.out1:
35
35
37
37
file.out2:
37
37
37
38
38
38
38
I will then use these output files to calculate the average and median of first column according to the weights in the second and third column.
This is pretty easy in awk.
awk '{for(i=0;i<$2;i++) print $1;}' file.in > file.out1
generates the first file, and
awk '{for(i=0;i<$3;i++) print $1;}' file.in > file.out2
generates the second
It is not clear from your question whether you know how to compute the mean and median from these files - it seems you just wanted to create these output files. Let me know if the rest is giving your trouble, or whether the above scripts are not clear (I think they are pretty self-explanatory).
If I understood well you need average and median.
Average:
awk '{a+=$1}END{print a/NR}' file.in
36.6667
Median:
cat file.in | awk '{print $1}' | sort | awk '{a[NR]=$1}END{ b=NR/2; b=b%1?int(b)+1:b; print a[b] }'
37
Explanation:
Putting in simple terms NR is a variable which keeps the number of lines, for average you want a sum of every line divided by the number of lines.
For median you want you input sorted and pick the median value, but it's not so simple for your input because I you divide number of lines which is 3 by 2 you will get 1.5 so you need a ceiling function which awk doesn't have so I am doing it with b=NR/2; b=b%1?int(b)+1:b;
I hope this helps.