How to extract multiple strings with single regex expression in Awk - awk

I have the following strings:
Mike has XXX cats and XXXXX dogs.
MikehasXXXcatsandXXXXXdogs
I would like to replace Xs with the digits corresponding to the number of Xs:
I tried:
awk '{ match($0, /[X]+/);
a = length(substr($0, RSTART, RLENGTH));
gsub(/[X]+/, a) }1'
But it captures only the first match.
Expected output:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs

With your shown samples, could you please try following. Written and tested in GNU awk(should work in any awk).
awk '{for(i=1;i<=NF;i++){if($i~/^X+$/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
Sample output will be:
Mike has 3 cats and 5 dogs.
Explanation: Going through all the fields(space delimited) and checking if field starts from X and has only X till end of current field, if yes then globally substituting it with its own value(to get the count) and saving into current field itself. Then mentioning 1 will print current line.
NOTE: As per Ed sir's comment(under question section), in case your fields may have values other X too then try(this will even cover XXX456 value in any column too):
awk '{for(i=1;i<=NF;i++){if($i~/X/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
EDIT: Since OP's samples are changed so adding this solution here, written and tested with GNU awk.
awk -v RS='X+' '{ORS=(RT ? gsub(/./,"",RT) : "")} 1' Input_file
OR
awk -v RS='X+' '{ORS=(RT ? length(RT) : "")} 1' Input_file
Output will be as follows for above code:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs

another awk
$ awk '{for(i=1;i<=NF;i++) if($i~/^X+$/) $i=length($i)}1' file
Mike has 3 cats and 5 dogs.

$ awk '{while( match($0,/X+/) ) $0=substr($0,1,RSTART-1) RLENGTH substr($0,RSTART+RLENGTH)} 1' file
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs

If Perl is okay:
$ perl -pe 's/X+/length $&/ge' ip.txt
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
The e flag allows Perl code in replacement section. $& will have the matched portion.

Here's the cleanest awk-based solution i can think of
{mawk/mawk2/gawk} 'BEGIN { FS = "^$" } /X/ {
while(match($0, /[X]+/)) { sub(/[X]+/, RLENGTH) } } 1'
downside of this is having to use regex engine twice for every replacmeent. upside is that it avoids a bunch of substr( ) ops.

Related

print specific value from 7th column using pattern matching along with first 6 columns

file1
1 123 ab456 A G PASS AC=0.15;FB=1.5;BV=45; 0|0 0|0 0|1 0|0
4 789 ab123 C T PASS FB=90;AC=2.15;BV=12; 0|1 0|1 0|0 0|0
desired output
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
I used
awk '{print $1,$2,$3,$4,$5,$6,$7}' file1 > out1.txt
sed -i 's/;/\t/g' out1.txt
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' out1.txt
output generated
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS FB=90
I want to print first 6 columns along with value of AC=(*) from 7th column.
With your shown samples, please try following awk code.
awk '
{
val=""
while(match($7,/AC=[^;]*/)){
val=(val?val:"")substr($7,RSTART,RLENGTH)
$7=substr($7,RSTART+RLENGTH)
}
print $1,$2,$3,$4,$5,$6,val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
val="" ##Nullifying val here.
while(match($7,/AC=[^;]*/)){ ##Running while loop to use match function to match AC= till semi colon all occurrences here.
val=(val?val:"")substr($7,RSTART,RLENGTH) ##Creating val and keep adding matched regex value to it, from 7th column.
$7=substr($7,RSTART+RLENGTH) ##Assigning rest pending values to 7th column itself.
}
print $1,$2,$3,$4,$5,$6,val ##Printing appropriate columns required by OP along with val here.
}
' Input_file ##Mentioning Input_file name here.
$ awk '{
n=split($7,a,/;/) # split $7 on ;s
for(i=1;i<=n&&a[i]!~/^AC=/;i++); # just loop looking for AC
print $1,$2,$3,$4,$5,$6,a[i] # output
}' file
Output:
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
If AC= was not found, and empty field is outputed instead.
Any time you have tag=value pairs in your data I find it best to first populate an array (f[] below) to hold those tag-value mappings so you can print/test/rearrange those values by their tags (names).
Using any awk in any shell on every Unix box:
$ cat tst.awk
{
split($7,tmp,/[=;]/)
for (i=1; i<NF; i+=2) {
f[tmp[i]] = tmp[i] "=" tmp[i+1]
}
sub(/[[:space:]]*[^[:space:]]+;.*/,"")
print $0, f["AC"]
}
$ awk -f tst.awk file
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
This might work for you (GNU sed):
sed -nE 's/^((\S+\s){6})\S*;?(AC=[^;]*);.*/\1\3/p' file
Turn off implicit printing -n and add easier regexp -E.
Match the first six fields and their delimiters and append the AC tag and its value from the next.
With only GNU sed:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/' file1
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
Lines without a AC=... part in the 7th field will be printed without modification. If you prefer removing the 7th field and the end of the line, use:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/;t;s/\S+;.*//' file1

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

awk script to parse between two strings having same name

Lets say we I have an text like this
Hello, 12345
This is going to be fun
ABC:172-1345,
172-1323
There is more string here.
Hello, 34567
This is not going to be fun
ABC:172-2345
There is more string here
Output Should be
12345 ABC:172-1345
34567 ABC:172-2345
Can we achieve this in awk?
We also have to consider the last Hello, as it would not be having another Hello to have the end parse string.
Most simply:
awk -v RS=Hello, 'NR != 1 { print $1, $NF }'
This splits the file into records delimited by Hello, and prints the first and last token in each record. NR == 1 is excluded because it is the empty bit before the first Hello,.
Note that multi-character RS is not strictly POSIX-conforming, although the most common awks (mawk and gawk) accept it.
$ awk -v RS= '{print $2,$NF}' file
12345 ABC:172-1345
34567 ABC:172-2345

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.