Using SED/AWK to replace letters after a certain position - awk

I have a file with words (1 word per line). I need to censor all letters in the word, except the first five, with a *.
Ex.
Authority -> Autho****
I'm not very sure how to do this.

If you are lucky, all you need is
sed 's/./*/6g' file
When I originally posted this, I believed this to be reasonably portable; but as per #ghoti's comment, it is not.

Perl to the rescue:
perl -pe 'substr($_, 5) =~ s/./*/g' -- file
-p reads the input line by line and prints each line after processing
substr returns a substring of the given string starting at the given position.
s/./*/g replaces any character with an asterisk. The g means the substitution will happen as many times as possible, not just once, so all the characters will be replaced.
In some versions of sed, you can specify which substitution should happen by appending a number to the operation:
sed -e 's/./*/g6'
This will replace all (again, because of g) characters, starting from the 6th position.

Here's a portable solution for sed:
$ echo abcdefghi | sed -e 's/\(.\{5\}\)./\1*/;:x' -e 's/\*[a-z]/**/;t x'
abcde****
Here's how it works:
's/\(.\{5\}\)./\1*/' - preserve the first five characters, replacing the 6th with an asterisk.
':x' - set a "label", which we can branch back to later.
's/\*[a-z]/**/ - ' - substitute the letter following an asterisk with an asterisk.
't x' - if the last substitution succeeded, jump back to the label "x".
This works equally well in GNU and BSD sed.
Of course, adjust the regexes to suit.

Following awk may help you in same.
Solution 1st: awk solution with substr and gensub.
awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' Input_file
Solution 2nd:
awk 'NF{len=length($0);if(len>5){i=6;while(i<=len){val=val?val "*":"*";i++};print substr($0,1,5) val};val=i=""}' Input_file
Autho****
EDIT: Adding a non-one liner form of solution too now. Adding explanation with it too now.
awk '
NF{ ##Checking if a line is NON-empty.
len=length($0); ##Taking length of the current line into a variable called len here.
if(len>5){ ##Checking if length of current line is greater than 5 as per OP request. If yes then do following.
i=6; ##creating variable named i whose value is 6 here.
while(i<=len){ ##staring a while loop here which runs from value of variable named i value to till the length of current line.
val=val?val "*":"*"; ##creating variable named val here whose value will be concatenated to its own value, it will add * to its value each time.
i++ ##incrementing variable named i value with 1 each time.
};
print substr($0,1,5) val##printing value of substring from 1st letter to 5th letter and then printing value of variable val here too.
};
val=i="" ##Nullifying values of variable val and i here too.
}
' Input_file ##Mentioning Input_file name here.

Personally I'd just use sed for this (see #triplee's answer) but if you want to do it in awk it'd be:
$ awk '{t=substr($0,1,5); gsub(/./,"*"); print t substr($0,6)}' file
Autho****
or with GNU awk for gensub():
$ awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' file
Autho****

It is also possible and quite straightforward with sed:
sed 's/./\*/6;:loop;s/\*[^\*]/\**/;/\*[^\*]/b loop' file_to_censor.txt
output:
explanation:
s/./\*/6 #replace the 6th character of the chain by *
:loop #define an label for the goto
s/\*[^\*]/\**/ #replace * followed by non * char by **
/\*[^\*]/b loop #then loop until it does not exist a * followed by a non * char

Here is a pretty straightforward sed solution (that does not require GNUsed):
sed -e :a -e 's/^\(.....\**\)[^*]/\1*/;ta' filename

Related

Using awk(or sed) to replace specific group

For example, if I want to change 424, or any number, to 1 from below string.
<revision>424</revison>
I usually do this sed -i 's|<revision>.*</revision>|<revision>777</revision>|g and it works.
But I have to do a lot of similar commands
and I want to know if I can group like <revision>(.*)</revision> and replace only \1 to 777. How do I do this?
With GNU awk and with your shown samples, please try following awk program. Simple explanation would be, using match function of awk and creating 4 capturing groups in it, where 1st group captures <revision>, 2nd one captures all Digits, 3rd one captures <\/revison> and 4th one(if there are any other values) everything. If this match function is true then printing 1st element of arr followed by newVal(awk variable which contains new value) followed by 3rd and 4th element value of arr.
awk -v newVal="777" '
match($0,/(<revision>)([0-9]+)(<\/revison>)(.*)/,arr){
print arr[1] newVal arr[3] arr[4]
}
' Input_file
Using gnu-sed you can use back-reference of a captured group in pattern matching like:
s='<revision>424</revision>'
sed -E 's~<(revision)>[0-9]*</\1>~<\1>777</\1>~g' <<< "$s"
<revision>777</revision>
However if you want to give perl a chance then you can even shorten it further with the use of look around assertions:
perl -pe 's~(?<=<(revision)>)\d*(?=</\1>)~777~g' <<< "$s"
<revision>777</revision>

How to extract (First match)text between two words

I have a file having the following structure
destination list
move from station d-435-435 to point place1
move from station d-435-435 to point place2
move from mainpoint
I want to extract the word "d-435-435"(Only the first match, this need not be same value always) in between the words "from station" and "to point"
How can I achieve this?
What I have tried so far?
id=$(sed 's/.*from station \(.*\) to.*/\1/' input.txt)
But this returns the following value: destination list d-435-435 move from mainpoint
1st solution: With your shown samples, please try following GNU awk code. Using match function of awk program here to match regex rom station\s+\S+\s+to point to get requested value by OP then removing from station\s+ and \s+to point from matched value and printing required value.
awk '
match($0,/from station\s+\S+\s+to point/){
val=substr($0,RSTART,RLENGTH)
gsub(/from station\s+|\s+to point/,"",val)
print val
exit
}
' Input_file
2nd solution: Using GNU grep please try following. Using -oP option to print matched portion and enabling PCRE regex respectively here. Then in main grep program matching string from station followed by space(s) then using \K option will make sure matched part before \K is forgotten(since e don't need this in output), Then matching \S+(non space values) followed by space(s) to point string(using positive look ahead here to make sure it only checks its present or not but doesn't print that).
grep -oP -m1 'from station\s+\K\S+(?=\s+to point)' Input_file
If GNU sed is available, how about:
id=$(sed -nE '0,/from station.*to/ s/.*from station (.*) to.*/\1/p' input.txt)
The -n option suppress the print unless the substitution succeeds.
The condition 0,/pattern/ is a flip-flop operator and it returns false
after the pattern match succeeds. The 0 address is a GNU sed extension which
makes the 1st line to match against the pattern.
With awk you can write the before and after conditions of
field $4, where d-435-435 is, and then print this field only the first match and exit with exit after print statement:
awk '$2=="from" && $3=="station" && $5=="to" && $6=="point" {print $4; exit}' file
d-435-435
or using GNU awk for the 3rd arg to match():
awk 'match($0,/from station\s+(.*)\s+to point/,a){print a[1];exit}' file
d-435-435
The regexp contains a parenthesis, so the integer-indexed element of array a[1] contain the portion of string between from station followed by space(s) \s+ and space(s) \s+ followed byto point.
This might work for you (GNU sed):
sed -nE '/.*station (\S+) to point.*/{s//\1/;H;x;/\n(\S+)\n.*\1/{s/\n\S+$//;x;d};x;p}' file
Turn off implicit printing and on extended regexps command line options -nE.
If a line matches the required criteria, extract the required string, append a copy to the hold space, check if the match has already been seen and if not print it. If the match has been seen, remove it from the hold space.
Otherwise, do not print anything.
This should work in any sed:
sed -e '/.*from station \([^ ]*\) to .*/!d' -e 's//\1/' -e q file

Can I delete a field in awk?

This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

Merge the next line if the current line contains a pattern at the end

I have a very big text file. I want to merge the next line into the current line if the current line has a word OR in the end.
Eg. Like in the lines below
somerandomstring OR
someotherrandomstring
The above 2 lines should become
somerandomstring OR someotherrandomstring
Only those lines should change. Rest of the lines must be kept as they are. Thanks in advance.
Allow me to extend the question a bit further.
I want to also see if the next line starts with OR and the OR is not in the end of the current line, then how to achieve the above case and this case together?
With GNU sed:
sed '/ OR$/{N;s/\n/ /}' file
Search white space followed byOR at end of line ($) and if found then read next line (N) to pattern space and replace newline in pattern space (with s///) by one white space.
If you want to edit your file "in place" use sed's option -i.
You can do this in awk by assigning the "OR" row to a variable and printing whatever is stored in the variable when there is no "OR" found.
awk '$NF=="OR"{buffer=buffer$0" "} $NF!="OR"{print buffer$0;buffer=""}' test.txt
This also works on multiple rows that may have "OR" by concatenating the row to the buffer variable until it finds a non-OR row, prints it and clears the buffer variable.
Another option in awk is to use printf on the OR rows and print on the non-or rows (which is kind of similar to the GNU sed example by #cyrus, but in awk)
awk '$NF=="OR"{printf "%s ", $0} $NF!="OR"{print $0}' test.txt
And this is the same beast, but using a ternary operator within printf:
awk '{printf $NF=="OR"?"%s ":"%s\n", $0}' test.txt
another awk
$2$ awk -v RS='^$' -v k='OR' '{gsub(k ORS,k FS); gsub(ORS k,FS k); printf "%s",$0}' file
x OR y
x OR y
for test input file
$ cat file
x OR
y
x
OR y