Compare and Replace inline using Awk/Sed command - awk

I have a file (fixed length) in which searching for consecutive 2 lines starting with number 30 and then comparing value in position 183-187 and if both matching printing the line number. I am able to achieve the desired results up to this stage. But I would like to replace the value present in the line number with empty spaces without tampering the fixed length.
awk '/^30*/a=substr($0,183,187);getline;b=substr($0,183,187); if(a==b) print NR}' file
Explanation of the command above:
line number start with 30*
assign value to a from position between 183 to 187
get next line
assign value to b from position between 183 to 187
compare a & b - if matches it proves the value between 183 to 187 in 2 consecutive lines which starts with 30.
print line number (this is the 2nd match line number)
Above command is working as expected and printing the line number.
Example Record (just for explanation purpose hence not used fixed length)
10 ABC
20 XXX
30 XYZ
30 XYZ
30 XYZ
30 XYZ
40 YYY
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
With above command I could able to get line number 3 and 4 but unable to replace the 4th line output with empty spaces (inline replace) so that fixed width will not get compromised
Expected Output
10 ABC
20 XXX
30 XYZ
30
30
30
40 YYY
10 ABC
20 XXX
30 XYZ
30
40 YYY
Length of all the above lines should be 255 chars - when replace happens it has to be inline without adding it as new spaces.
Any help will be highly appreciated. Thanks.

I would use GNU AWK and treat every character as field, consider following example, let file.txt content be
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
then
awk 'BEGIN{FPAT=".";OFS=""}prev~/^30*/{a=substr(prev,4,3);b=substr($0,4,3);if(a==b){$4=$5=$6=" "}}{print}{prev=$0}' file.txt
output
10 ABC
20 XXX
30 XYZ
30
40 YYY
Explanation: I elected to use storing whole line in variable called prev rather than using getline, thus I do {prev=$0} as last action. I set FPAT to . indicating that any single character should be treated as field and OFS (output field separator) to empty string so no unwanted character will be added during line preparing. If prev (previous line or empty string for first line) starts with 3 I get substring with characters 4,5,6 from previous line (prev) and store it in variable a and get substring with characters 4,5,6 from current line ($0) and store it in variable b, if a and b are equal I change 4th, 5th and 6th character to space each. No matter it was changed or not I print line. Disclaimer: this assume that you want to deal with at most 2 subsequent lines having equal substring. Note /^30*/ does not check if string starts with 30 but rather if it does start with 3 e.g. it will match 312, you should probably use /^30/ instead, I elected to use your pattern unchanged as you imply it does work as intended for your data.
(tested in gawk 4.2.1)

This might work for you (GNU sed):
sed -E '/^30/{N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /}' file
Match on a line beginning 30 and append the following line.
Using pattern matching, if the 5 characters from 183-187 for both lines are the same, replace the second group of 5 characters with 5 spaces.
For multiple adjacent lines use:
sed -E '/^30/{:a;N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /;ta}' file
Or alternative:
sed -E ':a;$!N;s/^(30.{180}(\S{5}).*\n.{182})\2/\1 /;ta;P;D' file

It sounds like this is what you want, using any awk in any shell on every Unix box:
$ awk -v tgt=30 -v beg=11 -v end=13 '
($1==tgt) && seen[$1]++ { $0=substr($0,1,beg-1) sprintf("%*s",end-beg,"") substr($0,end+1) }
1' file
10 ABC
20 XXX
30 XYZ
30
40 YYY
Just change -v beg=11 -v end=13 to beg=183 -v end=187 for your real data.
If you're ever again tempted to use getline make sure to read awk.freeshell.org/AllAboutGetline first as it's usually the wrong approach.

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

Extract all numbers from string in list

Given some string 's' I would like to extract only the numbers from that string. I would like the outputted numbers to be each be separated by a single space.
Example input -> output
....IN:1,2,3
OUT:1 2 3
....IN:1 2 a b c 3
OUT:1 2 3
....IN:ab#35jh71 1,2,3 kj$d3kjl23
OUT:35 71 1 2 3 3 23
I have tried combinations of grep -o [0-9] and grep -v [a-z] -v [A-Z] but the issue is that other chars like - and # could be used between the numbers. Regardless of the number of non-numeric characters between the numbers I need them to be replaced with a single space.
I have also been experimenting with awk and sed but have had little luck.
Not sure about spaces in your expected output, based on your shown samples, could you please try following.
awk '{gsub(/[^0-9]+/," ")} 1' Input_file
Explanation: Globally substituting anything apart from digit with spaces. Mentioning 1 will print current line.
In case you want to remove initial/starting space and ending space in output then try following.
awk '{gsub(/[^0-9]+/," ");gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Globally substituting everything apart from digits with space in current line and then globally substituting starting and ending spaces with NULL in current line. Mentioning 1 will print edited/non-edited current line.
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | grep -o '[[:digit:]]*'
35
71
1
2
3
3
23
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | tr -sc '[:digit:]' ' '
35 71 1 2 3 3 23

How can I pass a predefined variable into an awk column function?

I'd like to pass a predefined variable as the column number for an awk script. I've stripped out the unnecessary bits and below is an example of what I'd like to get done. Further below is a portion of what I've tried so far.
Reason: This is a semi-long script that currently works though I'd like to define the columns early in the script as this would make the script much easier to update as columns change.
I'd like for the "state" variable to be passed on to awk's column identifier, eg:
#/bin/bash
export state='$6'
cat ~/file | awk -v column="$state" 'state!="FAILED"'
Running the above code produces rows that do indeed have column 6 as "FAILED", so there must be something wrong. While awk '$6!="FAILED"' works as expected
Different things I've tried so far:
defining $state as 6 rather than '$6' and including the $ in the awk != command.
awk '{ENVIRON["state"]!="FAILED"}', with the same modifications as 1
This should work:
state=6
cat ~/file | awk -v column="$state" '$column != "FAILED"'
$var in awk will get the field specified by the value of variable var.
So, $NF will get the last field. Note that the awk variable here is column, not state.
For example:
% seq 1 20 | paste - - - -
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
% seq 1 20 | paste - - - - | awk -v column=3 '{print $column}'
3
7
11
15
19

I want to split a merged field delimited by tab using awk for window?

the fields in column 5 and 17 were merged together and i want to split the merged ones and put in separate fields.
my data looks like this
326502010-12-10 320100807
368902010-12-14 420100716
But i want to see like this
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
Using awk,
$ awk -vOFS="\t" '{sub(/.{5}/, "&\t", $1); sub(/./, "&\t", $2)}1' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
sub(/.{5}/, "&\t", $1) Substitutes the first 5 characters with itself followed by \t on the first field.
sub(/./, "&\t", $2)} Substitutes for the second field.
1 This evaluates to true always, awk prints the input line as default action.
In case the length of the number preceeding the date varies, use this:
$ awk '{sub(/....-..-../,"\t&",$1); sub(/^./,"&\t",$2)} 1' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
sub replaces the date part with a tab (\t) and the matching part (&) ie. the date. About the same with the latter part for $2.
Better use sed to split by characters:
$ sed -r 's/^(.{5})(.{18})/\1\t\2\t/' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
This captures the given characters and prints them back with a tab in between them.
You can also use cut for this:
$ cut --output-delimiter=$'\t' -c 1-5,6-17,18- file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
With -c option you can set a list representing the part of the line you want to cut. The comma , is replaced by the --output-delimiter which is set as a tab.

AWK - find min value of each row with arbitrary size

I have a file with the lines as:
5 3 6 4 2 3 5
1 4 3 2 6 5 8
..
I want to get the min on each line, so for example with the input given above, I should get:
min of first line: 2
min of second line: 1
..
How can I use awk to do this for any arbitrary number of columns in each line?
If you don't mind the output using digits instead of words you can use this one liner:
$ awk '{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print "min of line",NR": ",m}' file
min of line 1: 2
min of line 2: 1
If you really do want to count in ordinal numbers:
BEGIN {
split("first second third fourth",count," ")
}
{
min=$1
for(i=1;i<=NF;i++)
if($i<min)
min=$i
print "min of",count[NR],"line: \t",min
}
Save this to script.awk and run like:
$ awk -f script.awk file
min of first line: 2
min of second line: 1
Obviously this will only work for files with upto 4 lines but just increase the ordinal numbers list to the maximum number you think you will need. You should be able to find a list online pretty easily.
Your problem is pretty simple. All you need to do is to define a variable min in the BEGIN part of your script, and at each line, you just have to perform a simple C-like algorithm for minimum element (set the first field as min, and then perform a check with the next field, and so on until you reach the final field of the line). The total number of fields in the line will be known to you because of the variable NF. So its just a matter of writing a for loop. Once the for loop is fully executed for the line, you will have the minimum element with you, and you could just print it.