Keep first 3 characters of every word containing a character - awk

I have a large text file with lines like:
01 81118 9164.47 0/0:6,0:6:18:.:.:0,18,172:. 0/0:2,0:2:6:.:.:0,6,74:. 0/1:4,5:9:81:.:.:148,0,81:.
What I need is to keep just the first three characters of all the columns containing a colon, i.e.
01 81118 9164.47 0/0 0/0 0/1
Where the number of chars after the first 3 can vary. I started here by removing everything after a colon, but that removes the entire rest of the line, rather than per word:
sed 's/:.*//g' file.txt
Alternately, I've been trying to bring in the word boundary (\b) and hack away at removing everything after colons several times:
sed 's/\b:[^ ]//g' file.txt | sed 's/\b:[^ ]//g'
But this is not a good way to go about it. What's the best approach?

Using a sed that has a -E are to enable EREs (e.g. GNU or BSD/OSX sed):
$ sed -E 's/([^[:space:]]{3}):[^[:space:]]+/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
With a POSIX sed:
$ sed 's/\([^[:space:]]\{3\}\):[^[:space:]]\{1,\}/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
The above will work regardless of whether the spaces in your input are blanks or tabs or both.

Using awk. Print only 3 first characters of any field containing colon, print the rest as is.
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
substr() is one of the GNU awk string functions.
1 at the end of the statement is equivalent to action {print} the whole line.
Regarding output format, if input is tab separated and you want to keep the tabs, you can run:
awk 'BEGIN{OFS=FS="\t"} { for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
or another idea is to pretty-print with column -t (does not insert real \t but appropriate number of spaces between fields)
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file |column -t

If, as in your example, the colon is not part of the string which should be preserved, try
sed 's/\(\(^\| \)[^ :][^ :][^ :]\)[^ :]*:[^ ]*/\1/g' file
The literal spaces in the character classes may need to be augmented with tabs and possibly other whitespace characters.
(The regex could be prettier if your sed supports extended regex with -E or -r or some such nonstandard option; but this ugly sucker should be portable most anywhere.)

Using GNU sed with regular expression extensions, a one-liner could be:
sed -E 's/(\S{3})\S*:\S*/\1/g' file
\S matches non-whitespace characters (a GNU extension).

This might work for you (GNU sed):
sed -E 's/\S*:/\n&/g;s/\n(\S{3})\S*/\1/g;s/\n//g' file
Prepend a newline to any non-whitespaced strings which contains a :.
If these strings contain at least 3 non-whitespaced characters, remove all but the first 3 characters.
Clean up any strings with :'s which were not 3 non-whitespaced characters in length.

optional : set _ = "[[:space:]]*"
if u wanna use the formal POSIX regex class
echo "${input}" |
mawk 'BEGIN { __ = OFS ="\f\r\t"
FS = "^"(_ = "[ \t]*")"|(:"(_)")?"(_)
_ = sub("[(]..", "&^", FS) } $_ = __$_'
01
81118
9164.47
0/0
0/0
0/1
tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macos nawk
The ultra brute force method would be like :
mawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t'
01 81118 9164.47 0/0 0/0 0/1
to handle leading/trailing edge spaces+tabs in brute-force approach:
gawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t' | column -t
01 81118 9164.47 0/0 0/0 0/1

Related

Extracting and rearranging columns

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file
If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file
With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;
Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.
Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

Extract substring from a field with single awk in AIX

I have a file file with content like:
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
I have to get a substring from field 2 (first two values with the dot).
I am currently doing cat file | awk '{print $2}' | awk -F. '{print $1"."$2}' and getting the expected output:
8.0
12.01
The query is how to do this with single awk?
I have tried with match() but not seeing an option for a back reference.
Any help would be appreciated.
You can do something like this.
$ awk '{ split($2,str,"."); print str[1]"."str[2] }' file
8.0
12.01
Also, keep in mind that your cat is not needed. Simply give the file directly to awk.
With GNU grep please try following command once.
grep -oP '^\S+\s+\K[[:digit:]]+\.[[:digit:]]+' Input_file
Explanation: Using GNU grep here. Using its -oP options to print matched part and enable PCRE with -P option here. In main program, matching from starting non-space characters followed by 1 or more spaces, then using \K option to forget that match. Then matching 1 or more digits occurrences followed by a dot; which is further followed by digits. If a match is found then it prints matched value.
I would use GNU AWK's split function as follow, let file.txt content be
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
then
awk '{split($2,arr,".");print arr[1]"."arr[2]}' file.txt
output
8.0
12.01
Explantion: split at . 2nd field and put elements into array arr.
(tested in gawk 4.2.1)
You could match digits . digits from the second column and print if there is a match:
awk 'match($2, /^[[:digit:]]+\.[[:digit:]]+/) {
print substr($2, RSTART, RLENGTH)
}
' file
Output
8.0
12.01
Also with GNU awk and gensub():
awk '{print gensub(/([[:digit:]]+[.][[:digit:]]+)(.*)/,"\\1","g",$2)}' file
8.0
12.01
gensub() provides the ability to specify components of a regexp in the replacement text using parentheses in the regexp to mark the components and then specifying \\n in the replacement text, where n is a digit from 1 to 9.
You should perhaps not use awk at all (or any other external program, for that matter) but rely on the field-splitting capabilities of the shell and some variable expansion. For instance:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "=%s= =%s= =%s=\n" "$first" "$second" "$third"
done
=stringa= =8.0.1.2= =stringx=
=stringb= =12.01.0.0= =stringx=
As you can see the value is captured in the variable "$second" already and you just need to further isolate the parts you want to see - the first and second part separated by a dot. You can do that either with parameter expansion:
# variable="8.0.1.2"
# echo ${variable%.*.*}
8.0
or like this:
# variable="12.01.0.0"
# echo ${variable%${variable#*.*.}}
12.01
or you can use a further read-statement to separate the parts and then put them back together:
# variable="12.01.0.0"
# echo ${variable} | IFS=. read parta partb junk
# echo ${parta}.${partb}
12.01
So, putting all together:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "%s\n" "$second" | IFS=. read parta partb junk
printf "%s.%s\n" "$parta" "$partb"
done
8.0
12.01

How to convert only certain columns of whitespace into tab?

I know I can use sed 's/[[:blank:]]/,/g' to convert blank spaces into commas or anything of my choosing in my file, but is there a way to somehow set it so that, only the first 5 instances of whitespace convert them into a comma?
This is because my last column has a lot of information written out, so it is annoying when sed coverts all the spaces in that column into commas.
Sample input file:
sample1 gi|11| 123 33 97.23 This is a sentence
sample2 gi|22| 234 33 97.05 Keep these spaces
And the output I was looking for:
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,Keep these spaces
Only the first 5 chains of whitespace are changed to a comma.
With GNU awk for the 3rd arg to match():
$ awk '{ match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,",",a[1]); print a[1] a[3] }' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
but I'd recommend you actually turn it into a valid CSV (i.e. one that conforms to RFC 4180) such as could be read by MS-Excel and other tools since "This is a sentence" (and possibly other fields) can presumably include commas and double quotes:
$ awk '{
gsub(/"/,"\"\"");
match($0,/((\S+\s+){5})(.*)/,a)
gsub(/\s+/,"\",\"",a[1])
print "\"" a[1] a[3] "\""
}' file
"sample1","gi|11|","123","33","97.23","This is a sentence"
"sample2","gi|22|","234","33","97.05","This is a sentence"
For example given this input:
$ cat file
sample1 gi|11| 123 33 97.23 This is a sentence
a,b,sample2 gi|22| 234 33 97.05 This is, "typically", a sentence
The output from the first script is not valid CSV:
$ awk '{ match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,",",a[1]); print a[1] a[3] }' file
sample1,gi|11|,123,33,97.23,This is a sentence
a,b,sample2,gi|22|,234,33,97.05,This is, "typically", a sentence
while the output from the 2nd script IS valid CSV:
$ awk '{ gsub(/"/,"\"\""); match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,"\",\"",a[1]); print "\"" a[1] a[3] "\"" }' file
"sample1","gi|11|","123","33","97.23","This is a sentence"
"a,b,sample2","gi|22|","234","33","97.05","This is, ""typically"", a sentence"
perl's split can limit the number of fields with its third argument:
$ perl -lnE 'say join ",", split(" ",$_,6)' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
If fields might require quoting:
perl -lnE 'say join ",", map { s/"/""/g || y/,// ? "\"$_\"" : $_ } split(" ",$_,6)
' file
Ruby has a str.split that can take a limit:
ruby -ne 'puts $_.split(/\s+/,6).join(",")' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
As does Perl:
perl -lnE 'say join ",", split /\s+/,$_,6 ' file
# same
This might work for you (GNU sed):
sed -E 's/\s+/&\n/5;h;s/\n.*//;s/\s+/,/g;G;s/\n.*\n//' file
Append a newline to the 5th occurrence of a group of whitespace.
Make a copy of the amended line in the hold space.
Remove the section from the inserted newline to the end of the line.
Translate groups of whitespace into commas.
Append the copy.
Remove the section between newlines.
Thus the first five groups of whites space are converted to commas and the remaining groups are untouched.
Here is a way to do with 3 sed commands. This requires GNU sed, which supports the /ng (e.g. /6g) pattern flag. That will only apply the substitution from the nth occurrence on. Also note: this method will compress multiple spaces in the last column permanently.
sed 's/ \+/␣/6g' | sed 's/ \+/,/g' | sed 's/␣/ /g'
Another variation: Do the multiple space compression as a separate step with tr -s ' '. This may be more readable.
tr -s ' ' | sed 's/ /␣/6g' | sed 's/ /,/g' | sed 's/␣/ /g'
Another variation: Compress all whitespaces, not just spaces with \s
sed 's/\s\+/␣/6g' | sed 's/\s\+/,/g' | sed 's/␣/ /g'
Explanation:
The first step "protects" the spaces from the 6th occurrence on, by converting them to a special character. I've used ␣ here (unicode U+2423), but you could use any character that doesn't exist in the source data, such as \x00, {space}, etc.
sed 's/ \+/␣/6g'
The second step converts the remaining spaces to commas.
sed 's/ \+/,/g'
The third step converts the "protected" spaces back to spaces.
sed 's/␣/ /g'

Replace character sequence with several characters using AWK gsub()

I'm trying to convert text by replacing several identical letters (more than 3) with several characters (two *).
My input:
ffffOOOOuuuurrrr
fffffiiiiivvvvveeeee
What should I get:
**OOOO****
********
My test command is:
awk '{gsub(/[a-z]{4}/,"*"); print}' textfile
I don't understand how to transform {4} into 'more than 3'.
Also how to print * two times (like multiply it).
I'm also sure that the condition 'more than three' will convert input into:
**OOOO**
**
Is there any way to avoid this (replace a sequence of identical letters)?
Or it is not possible to fit in one small command.
POSIX awk or sed don't support back-references. You will need to use gnu-sed or perl:
sed -E 's/([a-z])\1{3,}/**/g' file
**OOOO****
********
or using perl:
perl -pe 's/([a-z])\1{3,}/**/g' file
RegEx Details:
([a-z]): Match [a-z] and capture in group #1
\1: Back-reference of the letter captured in group #1
{3,}: Repeat 3 or more times
You mentioned sed as an option in the tags:
echo "fffffiiiiivvvvveeeee" | sed 's/\([A-Za-z]\)\1\1\1\+/\1/g'
five
echo "fffffiiiiivvveeeee" | sed 's/\([A-Za-z]\)\1\1\1\+/\1/g'
fivvve
Here's how to do it with any awk assuming a locale where lower case letters are a-z = ASCII 97-122:
$ cat tst.awk
{
for (i=97; i<122; i++) {
gsub(sprintf("%c{3,}",i),"**")
}
print
}
$ awk -f tst.awk file
**OOOO****
********
otherwise with GNU awk for the ord() function:
$ cat tst.awk
#load "ordchr"
{
for (i=ord("a"); i<=ord("z"); i++) {
gsub(sprintf("%c{3,}",i),"**")
}
print
}
$ awk -f tst.awk file
**OOOO****
********
or iyou can use a different numeric loop range or split("abc...z",...) or whatever else to get the loop but the point is - you need to loop on each character.

How to extract something occuring after some common paths?

I want to filter out anything that occurs after some common paths. Example, Print out the next word that occurs every pytests/ OR after src/
for "src/cs-test/test_bugcheck_0001.py"
awk -F"/" '{print $2}' works
for "metadata/pytests/ipa-cert.yaml"
awk -F"/pytest/" '{print $2}' | awk -F"." '{print $1}' works
But I want to have these in one awk statement.
metadata/pytests/ipa-cert.yaml
src/cs-test/test_bugcheck_0001.py
Expected result:
ipa-cert
cs-test
I suggest using
sed -E 's,^(.*/pytests/|[^/]+/)([^/.]+).*,\2,' file > newfile
See the online sed demo and the regex demo (not proof).
POSIX ERE pattern details
^ - start of line
(.*/pytests/|[^/]+/) - Group 1: either of the two alternatives:
.*/pytests/ - any 0+ chars as many as possible and then /pytests/ string
| - or
[^/]+/ - a negated bracket expression matching 1+ chars other than / and then a /
([^/.]+) - Group 2: a negated bracket expression matching 1 or more chars other than / and .
.* - any 0 or more chars up to the line end.
The , chars are used as delimiters in the sed command so as not to overescape the pattern that has many / chars.
Simple substitutions on individual strings is what sed is designed to do. With GNU or OSX/BSD sed for -E:
$ sed -E 's:(^|.*/)(pytests|src)/([^/.]+).*:\3:' file
ipa-cert
cs-test
or if you really want to use awk for some reason then with GNU awk for gensub():
$ awk '{print gensub(/(^|.*\/)(pytests|src)\/([^/.]+).*/,"\\3",1)}' file
ipa-cert
cs-test
and with any awk:
$ awk 'match($0,/(^|.*\/)(pytests|src)\/[^/.]+/){$0=substr($0,1,RLENGTH); sub(/.*\//,"")} 1' file
ipa-cert
cs-test