return matching but not exact same strings - awk

Is there any way to find a word that contains a given string but is not the exact match. For e.g.
# cat t.txt
first line
ind is a shortform of india
I am trying to return the word "india" because it contains the string "ind" but I do not need the exact match. I have tried this...
# grep -o 'ind' t.txt
ind
ind

Would you please try the following:
grep -Eo '[A-Za-z]+ind|ind[A-Za-z]+' t.txt
Output:
india
The regex [A-Za-z]+ind|ind[A-Za-z]+ matches ind including the preceding or following alphabets.

$ grep -Eo '[[:alpha:]]+ind[[:alpha:]]*|[[:alpha:]]*ind[[:alpha:]]+' file
india
fooindbar
the above was run on this input file (note the added test case of ind appearing in the middle of a string instead of just the start or end):
$ cat file
first line
ind is a shortform of india
this fooindbar is the mid-word text
You can do the same with GNU awk (for multi-char RS, RT, and \s shorthand for [[:space:]]) if you prefer:
$ awk -v RS='\\s+' '/[[:alpha:]]+ind[[:alpha:]]*|[[:alpha:]]*ind[[:alpha:]]+/' file
india
fooindbar
or:
$ awk -v RS='[[:alpha:]]+ind[[:alpha:]]*|[[:alpha:]]*ind[[:alpha:]]+' 'RT{print RT}' file
india
fooindbar

I would use GNU AWK for this task following way, let file.txt content be
first line
ind is a shortform of india
then
awk 'BEGIN{RS="[[:space:]]+"}match($0,/ind/)&&length>RLENGTH{print}' file.txt
output
india
Explanation: I inform GNU AWK that row separator (RS) is one or more whitespaces, this way every word will be treated as row. Then for every row (that is every word) I use match function which return 1 if found else 0 and set RSTART and RLENGTH values. If match is found I check if length of current row (that is word) is greater than that of match, if it is so I print said word. Note that every word is outputted at own line so for example if input file content would be
india ind india ind india
then output would be
india
india
india
(tested in gawk 4.2.1)

Related

Extracting lines from a file if a certain substring is present

I have a file that looks like:
Stef NY ID=1;CITY=NY
John SE ID=0;CITY=SE
Stef SE ID=2;CITY=SE
I want to extract only those lines where ID in third column is greater than 1 so the expected output becomes:
Stef SE ID=2;CITY=SE
The bash script I have take care of removing either ID=1 or ID=0 but I don't know how to do it together. This is what I have:
awk '$3 !~ /^ID=1;/' file.txt > output.txt
But this gives me an output:
John SE ID=0;CITY=SE
Stef SE ID=2;CITY=SE
How can I add ID=0 in my bash statement above? Insights will be appreciated.
It's a bit fragile, but you could try:
$ cat input
Stef NY ID=1;CITY=NY
John SE ID=0;CITY=SE
Stef SE ID=2;CITY=SE
$ awk '$2>1' FS='[=;]' input
Stef SE ID=2;CITY=SE
That is, split the line on the = and ; so that the number you're looking to compare is in field 2.
awk '$3 !~ /^ID=1;/' file.txt > output.txt
How it works
Your AWK command (anything between the quotes) works like a filter.
$3 !~ // filters by a condition on the 3rd field ($3). The condition is a not matching (!~) regular expression (between the slashes //).
^ID=1; is a regular expression matching all lines starting with (^) ID=1.
Adjust the regex
As Charles Duffy commented you could simply change the constant literal pattern ID=1 for a more flexible one like either of those:
ID=[01]; the ID can be any char inside character-set (the set inside square brackets []), so either 0 or 1
similar set defined as range: ID=[0-1]; (from 0 to 1)
or even distinct alternatives ID=(0|1); whereas alternatives are listed in a group (wrapped inside parentheses) separated by a pipe symbol (| often meaning logical-OR)
All above will match 2 cases.
The expression is a regular expression so you can use:
awk '$3 !~ /^ID=[01];/' file.txt > output.txt
Here is a way to do numerical comparison by stripping out all unwanted characters from last field:
awk '{val=$NF; gsub(/(^|.*;)ID=|;.*/, "", val)} val+0 > 1' file
Stef SE ID=2;CITY=SE
This will also work correctly for input like this:
Stef NY ID=1;CITY=NY
Stef NY ID=01;CITY=NY
John SE ID=0;CITY=SE
Stef SE ID=2;CITY=SE
Stef SE ID=04;CITY=SE
Another possibility with awk could be:
awk '$NF ~ /^ID=[[:digit:]]+/ {split($NF,a,/=|;/);if (a[1]=="ID" && a[2] > 1) print $0}' file
Stef SE ID=2;CITY=SE
Initial condition: only if last field begins by the sequence
of characteres of the regexp /^ID=[[:digit:]]+/
action: split the field with the separator = or ; and then check the condition (a[1]=="ID" && a[2] > 1) If true, print the current line.

How to extract multiple strings with single regex expression in Awk

I have the following strings:
Mike has XXX cats and XXXXX dogs.
MikehasXXXcatsandXXXXXdogs
I would like to replace Xs with the digits corresponding to the number of Xs:
I tried:
awk '{ match($0, /[X]+/);
a = length(substr($0, RSTART, RLENGTH));
gsub(/[X]+/, a) }1'
But it captures only the first match.
Expected output:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
With your shown samples, could you please try following. Written and tested in GNU awk(should work in any awk).
awk '{for(i=1;i<=NF;i++){if($i~/^X+$/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
Sample output will be:
Mike has 3 cats and 5 dogs.
Explanation: Going through all the fields(space delimited) and checking if field starts from X and has only X till end of current field, if yes then globally substituting it with its own value(to get the count) and saving into current field itself. Then mentioning 1 will print current line.
NOTE: As per Ed sir's comment(under question section), in case your fields may have values other X too then try(this will even cover XXX456 value in any column too):
awk '{for(i=1;i<=NF;i++){if($i~/X/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
EDIT: Since OP's samples are changed so adding this solution here, written and tested with GNU awk.
awk -v RS='X+' '{ORS=(RT ? gsub(/./,"",RT) : "")} 1' Input_file
OR
awk -v RS='X+' '{ORS=(RT ? length(RT) : "")} 1' Input_file
Output will be as follows for above code:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
another awk
$ awk '{for(i=1;i<=NF;i++) if($i~/^X+$/) $i=length($i)}1' file
Mike has 3 cats and 5 dogs.
$ awk '{while( match($0,/X+/) ) $0=substr($0,1,RSTART-1) RLENGTH substr($0,RSTART+RLENGTH)} 1' file
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
If Perl is okay:
$ perl -pe 's/X+/length $&/ge' ip.txt
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
The e flag allows Perl code in replacement section. $& will have the matched portion.
Here's the cleanest awk-based solution i can think of
{mawk/mawk2/gawk} 'BEGIN { FS = "^$" } /X/ {
while(match($0, /[X]+/)) { sub(/[X]+/, RLENGTH) } } 1'
downside of this is having to use regex engine twice for every replacmeent. upside is that it avoids a bunch of substr( ) ops.

How do I delete a line that contains a specific string at a specific location using sed or awk?

I want to find and delete all lines that have a specific string, with a specific length, at a specific location.
One line in my data set looks something like this:
STRING 1234567 1234567 7654321 6543217 5432176
Notes:
Entries have field widths of 8
Identification numbers can be repeated in the same line
Identification numbers can be repeated on a different line, but at a different location - these lines should not be deleted
In this example, I want to find lines containing "1234567" located at column 17 and spanning to column 24 (i.e. the third field) and delete them. How can I do this with sed or awk?
I have used the following, but it deletes lines that I want to keep:
sed -i '/1234567/d' ./file_name.dat
Cheers!
You may use
sed -i '/^.\{17\}1234567/d' ./file_name.dat
Details
^ - start of a line
.{17} - any 17 chars
1234567 - a substring.
See the online sed demo:
s="STRING 1234567 1234567 7654321 6543217 5432176
STRING 1234567 5534567 7654321 6543217 5432176"
sed '/^.\{17\}1234567/d' <<< "$s"
# => STRING 1234567 5534567 7654321 6543217 5432176
with awk, print lines except the substring match.
$ awk 'substr($0,17,7)=="1234567"{next}1' file > output_file
or perhaps inverse logic is easier
$ awk 'substr($0,17,7)!="1234567"' file > output_file

awk: identify column by condition, change value, and finally print all columns

I want to extract the value in each row of a file that comes after AA. I can do this like so:
awk -F'[;=|]' '{for(i=1;i<=NF;i++)if($i=="AA"){print toupper($(i+1));next}}'
This gives me the exact information I need and converts to uppercase, which is exactly what I want to do. How can I do this and then print the entire row with this altered value in its previous position? I am essentially trying to do a find and replace where the value is changed to uppercase.
EDIT:
Here is a sample input line:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP
and here is a how I would like the output to look:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=G|||;VT=SNP
All that has changed is the g after AA= is changed to uppercase.
Following awk may help you on same.
awk '
{
match($0,/AA=[^|]*/);
print substr($0,1,RSTART+2) toupper(substr($0,RSTART+3,RLENGTH-3)) substr($0,RSTART+RLENGTH)
}
' Input_file
With GNU sed and perl, using word boundaries
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | sed 's/\bAA=[^;=|]*\b/\U&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | perl -pe 's/\bAA=[^;=|]*\b/\U$&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
\U will uppercase string following it until end or \E or another case-modifier
use g modifier if there can be more than one match per line

awk script to parse between two strings having same name

Lets say we I have an text like this
Hello, 12345
This is going to be fun
ABC:172-1345,
172-1323
There is more string here.
Hello, 34567
This is not going to be fun
ABC:172-2345
There is more string here
Output Should be
12345 ABC:172-1345
34567 ABC:172-2345
Can we achieve this in awk?
We also have to consider the last Hello, as it would not be having another Hello to have the end parse string.
Most simply:
awk -v RS=Hello, 'NR != 1 { print $1, $NF }'
This splits the file into records delimited by Hello, and prints the first and last token in each record. NR == 1 is excluded because it is the empty bit before the first Hello,.
Note that multi-character RS is not strictly POSIX-conforming, although the most common awks (mawk and gawk) accept it.
$ awk -v RS= '{print $2,$NF}' file
12345 ABC:172-1345
34567 ABC:172-2345