How can I replace all middle characters with '*'? - awk

I would like to replace middle of word with ****.
For example :
ifbewofiwfib
wofhwifwbif
iwjfhwi
owfhewifewifewiwei
fejnwfu
fehiw
wfebnueiwbfiefi
Should become :
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
So far I managed to replace all but the first 2 chars with:
sed -e 's/./*/g3'
Or do it the long way:
grep -o '^..' file > start
cat file | sed 's:^..\(.*\)..$:\1:' | awk -F. '{for (i=1;i<=length($1);i++) a=a"*";$1=a;a=""}1' > stars
grep -o '..$' file > end
paste -d "" start stars > temp
paste -d "" temp end > final

I would use Awk for this, if you have a GNU Awk to set the field separator to an empty string (How to set the field separator to an empty string?).
This way, you can loop through the chars and replace the desired ones with "*". In this case, replace from the 3rd to the 3rd last:
$ awk 'BEGIN{FS=OFS=""}{for (i=3; i<=NF-2; i++) $i="*"} 1' file
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi

If perl is okay:
$ perl -pe 's/..\K.*(?=..)/"*" x length($&)/e' ip.txt
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
..\K.*(?=..) to match characters other than first/last two characters
See regex lookarounds section for details
e modifier allows to use Perl code in replacement section
"*" x length($&) use length function and string repetition operator to get desired replacement string

You can do it with a repetitive substitution, e.g.:
sed -E ':a; s/^(..)([*]*)[^*](.*..)$/\1\2*\3/; ta'
Explanation
This works by repeating the substitution until no change happens, that is what the :a; ...; ta bit does. The substitution consists of 3 matched groups and a non-asterisk character:
(..) the start of the string.
([*]*) any already replaced characters.
[^*] the character to be replaced next.
(.*..) any remaining characters to replace and the end of the string.
Alternative GNU sed answer
You could also do this by using the hold space which might be simpler to read, e.g.:
h # save a copy to hold space
s/./*/g3 # replace all but 2 by *
G # append hold space to pattern space
s/^(..)([*]*)..\n.*(..)$/\1\2\3/ # reformat pattern space
Run it like this:
sed -Ef parse.sed input.txt
Output in both cases
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi

Following awk may help you on same. It should work in any kind of awk versions.
awk '{len=length($0);for(i=3;i<=(len-2);i++){val=val "*"};print substr($0,1,2) val substr($0,len-1);val=""}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
len=length($0);
for(i=3;i<=(len-2);i++){
val=val "*"};
print substr($0,1,2) val substr($0,len-1);
val=""
}
' Input_file
Explanation: Adding explanation now for above code too.
awk '
{
len=length($0); ##Creating variable named len whose value is length of current line.
for(i=3;i<=(len-2);i++){ ##Starting for loop which starts from i=3 too till len-2 value and doing following:
val=val "*"}; ##Creating a variable val whose value is concatenating the value of it within itself.
print substr($0,1,2) val substr($0,len-1);##Printing substring first 2 chars and variable val and then last 2 chars of the current line.
val="" ##Nullifying the variable val here, so that old values should be nullified for this variable.
}
' Input_file ##Mentioning the Input_file name here.

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Add additional fields based on field count

I have data in below format in a file
"123","XYZ","M","N","P,Q"
"345",
"987","MNO","A,B,C"
I always want to have 5 entries in the row , so if the count of fields in 2 then 3 extra ("") needs to be added.
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
I looked upto the solution on the page
Add Extra Strings Based on count of fields- Sed/Awk
which has very similar requirement but when I try it fails as I have comma (,) within the field also.
Thanks.
In GNU awk with your shown samples, please try following code.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program from here setting FPAT to csv file parsing here.
BEGIN{ OFS="," } ##Starting BEGIN section of this program setting OFS to comma here.
FNR==NR{ ##Checking condition FNR==NR here, which will be true for first time file reading.
nof=(NF>nof?NF:nof) ##Create nof to get highest NF value here.
next ##next will skip all further statements from here.
}
NF<nof{ ##checking if NF is lesser than nof then do following.
val="" ##Nullify val here.
i=($0~/,$/?NF:NF+1) ##Setting value of i as per condition here.
for(;i<=nof;i++){ ##Running loop till value of nof matches i here.
val=(val?val OFS:"")s1 s1 ##Creating val which has value of "" in it.
}
sub(/,$/,"") ##Removing ending , here.
$0=$0 OFS val ##Concatinate val here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
EDIT: Adding this code here, where keeping a variable named nof where we can give our number of fields value which should be added minimum in all missing lines, in case any line is having more than minimum field values then it will take that value to add those many number of fields in missing field line.
awk -v s1="\"" -v nof="5" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Here is one for GNU awk using FPAT when [you] always want to have 5 entries in the row :
$ awk '
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
NF=5 # set NF to limit too long records
for(i=1;i<=NF;i++) # iterate to NF and set empties to ""
if($i=="")
$i="\"\""
}1' file
Output:
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
Here is a an awk command that would work with any version of awk:
awk -v n=5 -v ef=',""' -F '","' '
{
sub(/,+$/, "")
for (i=NF; i<n; ++i)
$0 = $0 ef
} 1' file
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
With perl, assuming every field is double quoted:
$ perl -pe 's/,$//; s/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
# if the , at the end of line isn't present
$ perl -pe 's/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
s|","|$&|g will search for "," and replace it back. The return value is number of replacements, which is then used to determine how many fields have to be appended.
The e flag allows you to use Perl code in the replacement section.
q operator helps to use different delimiter for single quoted string.
Here's an alternate solution that creates an array and then adds empty fields if necessary.
perl -lne '#f = /"[^"]+"|[^,]+/g; print join ",", #f, qw("") x (4 - $#f)'
/"[^"]+"|[^,]+/g defines fields as double quoted strings (with no double quote inside, so escaped quotes won't work with this solution) or non , characters (at least one, so , at end of line will be ignored).
qw("") x (4 - $#f) determines the extra fields to be appended. qw("") creates an array with single element of value "" which is then multiplied using the x operator.
Another perl way using -a for autosplit and -F to set the separator:
perl -lanF'/"*,*"/' -e 'print join ",", map "\"$_\"", #F[1..5]'
-F'/"*,*"/' - this uses an autosplit separator of double quote optionally preceeded by commas and quotes
-a uses that separator to autosplit into #F
-l adds linebreaks to print and -n will process input in stream mode w/o printing unless explicitly told to
map "\"$_\"", #F[1..5] takes exactly 5 fields, even undefined ones, and adds double quotes
print join ",", map ... takes the results of the map above, joins into a string with commas, and prints
(Note: because each line starts with a field delimiter, I'm ignoring the empty $F[0] element)
This might work for you (GNU sed):
sed ':a;s/"[^"]*"/&/5;t;s/$/,""/;ta' file
If there are 5 fields, bail out.
Otherwise, append an empty field and repeat.

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

Replacing with Awk and preserving the FS to OFS

I have a file with input text below (this is not the original file and just example of input text ) and I want to replace all the 2 letter string to numeric 100 . In this file FS can be :,| or " " (space) , I have no other choice but to treat all three of them as FS, and I want to preserve these field separators at the original position (as in input file) in the output
A:B C|D
AA:C EE G
BB|FF XX1 H
DD:MM:YY K
I have tried
awk -F"[:| ]" '{gsub(/[A-Z]{2}/,"100");print}'
but this does not seem to work , please suggest.
Desired output:
A:B C|D
100:C 1000 G
100|100 1001 H
100:100:100 K
There is no functionality in POSIX awk to retain the strings that match the string defined by RS (POSIX) or regexp defined by FS. Since in POSIX RS is just a string there's no need for such functionality and to do it for every FS matching string would be unnecessarily inefficient given it's rarely needed.
With GNU awk where RS can be a regexp, not just a string, you can retain the string that matched the regexp RS with RT but there is no functionality that retains the values that match FS for the same efficiency reason that POSIX doesn't do it. Instead in GNU awk they added a 4th arg to split() so you can retain the strings that match FS in an array yourself if you want it (seps[] below):
$ awk -v FS='[:| ]' '{
split($0,flds,FS,seps)
gsub(/[A-Z]{2}/,"100")
for (i=1;i<=NF;i++) {
printf "%s%s", $i, seps[i]
}
print ""
}' file
A:B C|D
100:C 100 G
100|100 1001 H
100:100:100 K
Look up split() in the GNU awk manual for more info.
in this case
sed 's/[A-Z]\{2\}/100/g' YourFile
awk '{gsub(/[A-Z]{2}/, "100"); print}' YourFile
no need of field separation in this case, change all group of upper letter by "100", unless you specify other constraint than in OP (like other element in the string, you than need to specify what is possible and idealy, add a sample of expected result to be univoq)
Now you certainly have lot more thing around, so this code will certainly failed by changing thing like ABC:DEF with 100C:100F that is certainly not expected
in this case
awk -F '[[:blank:]:|]+' '
{
split( $0, aS, /[^[:blank:]:|]+/)
for( i=1;i<=NF;i++){
if( $i ~ /^[A-Z][A-Z]$/) $i = "100"
printf( "%s%s", $i, aS[i+1])
}
printf( "\n" )
} ' YourFile
Give this sed one-liner a try:
kent$ sed -r 's/(^|[:| ])[A-Z][A-Z]([:| ]|$)/\1100\2/g' file
A:B C|D
100:C 100 G
100|FF XX1 H
100:MM:100 K
Note:
this will search and replace pattern: exact two [A-Z] between two delimiters. If this is not what you want exactly, paste the desired output.
Your code seems to work just fine with my Gnu awk:
A:B C|D
100:C 100 G # even the typo in this record got fixed.
100|100 1001 H
100:100:100 K
I'd say the problem is that the regex /[A-Z]{2}/ should be written /[A-Z][A-Z]/.