awk match ONLY X and Y - awk

I want to do strict matching on a text file so that it only returns the patterns I have anded. So for example in a file:
xyz
xy
yx
zyx
I want to run a command similar to:
awk '/x/ && /y/' filename.txt
and I would like it to return only the lines.
yx
xy
and ignore the others because although they do contain an x and a y, they also have a z so they are ignored.
Is this possible in awk?

I'd just keep it clear and simple, e.g. depending on your requirements for matching lines that only contain x or only contain y which you didn't include in your example:
$ awk '/^[xy]+$/' file
xy
yx
or:
$ awk '/x/ && /y/ && !/[^xy]/' file
xy
yx

This /x/ && /y/ matches when there is an x and Y present.
Edit:
To allow the same chars in the whole string, you can use a repeated character class and assert the start and end of the string:
awk '/^[xy]+$/' file
If you also want to allow matching spaces, uppercase X and Y and do not want to match empty lines:
awk '/^[[:space:]]*[xyXY][[:space:]xyXY]*$/' file
The pattern matches:
^ Start of string
[[:space:]]* Match optional spaces
[xyXY] Match a single char x y X Y
[[:space:]xyXY]* Match optional spaces or x y X Y
$ End of string

Assumptions:
user provides a list of characters to match on (x and y in the provided example)
lines of interest are those that contain only said characters (plus white space)
matches should be case insensitive, ie, x will match on both x and X
blank/empty lines, and lines with only white space, are to be ignored
Adding more lines to the sample input:
$ cat filename.txt
xyz
xy
yx
zyx
---------
xxx
abc def xy
Xy xY XY
z x yy z
x; y; X; Y:
xxyYxy XXyxyy yx # tab delimited
# 1 space
# blank/empty line
NOTE: comments added for clarification; file does not contain any comments
One awk idea:
awk -v chars='xY' ' # provide list of characters (in the form of a string) to match on
BEGIN { regex="[" tolower(chars) "]" } # build regex of lowercase characters, eg: "[xy]"
{ line=tolower($0) # make copy of all lowercase line
gsub(/[[:space:]]/,"",line) # remove all white space
if (length(line) == 0) # if length of line==0 (blank/empty lines, lines with only white space) then ...
next # skip to next line of input
gsub(regex,"",line) # remove all characters matching regex
if (length(line) == 0) # if length of line == 0 (ie, no other characters) then ...
print $0 # print current line to stdout
}
' filename.txt
This generates:
xy
yx
xxx
Xy xY XY
xxyYxy XXyxyy yx
NOTE: the last 2 input lines (1 space, blank/empty) are ignored

This awk solution applies the condition on the main block to process only lines containing 'x' and 'y' using /x/&&/y/.
Inside the action block the record $0 is assigned to a variable named temp which then has the 'x' and 'y' occurrences removed using gsub(/[xy]/, "",temp). A conditional block then determines the length of temp after the substitution: if the length is 0, the line could only have contained 'x' and 'y' characters, so the line is printed.
awk '/x/&&/y/ { temp=$0; gsub(/[xy]/, "",temp); if (length(temp)==0){print $0}}' input.txt
tested with input.txt file:
xyz
xy
yx
zyx
y
x
xxy
yyx
result:
xy
yx
xxy
yyx

You can treat the strings as a set of characters and do a set equality on the two strings.
awk -v set='xy' '
function cmp(s1, s2) {
# turns s1 and s2 into associative arrays to do a set equality comparison
# cmp("xy", "xyxyxyxy") returns 1; cmp("xy", "xyz") returns 0
split("", a1); split("", a2) # clear the arrays from last use
split(s1, tmp, ""); for (i in tmp) a1[tmp[i]]
split(s2, tmp, ""); for (i in tmp) a2[tmp[i]]
if (length(a1) != length(a2)) return 0
for (e in a1) if (!(e in a2)) return 0
return 1
}
cmp(set, $1)' file
Prints:
xy
yx

Related

Force grep to grab matching line twice

Assume that I have:
Z 10
Z 11
Y 10
I used:
$ grep "Z" <above_file> -A 1
Z 10
Z 11
Y 10
How can I get it to return:
Z 10
Z 11
Z 11
Y 10
In essence if grep sees that the next line also matches the pattern, I want it duplicated. Is the best/only solution to manually go through line by line or uses a complex awk statement with conditionals? There's further processing after this step, but this is the edge case that is holding me up.
Try:
$ awk 'f{print; f=0} /Z/{print; f=1}' file
Z 10
Z 11
Z 11
Y 10
How it works
Awk implicitly reads through the input file one line at a time. The script uses a single variable, f, which is true (non-zero) if the previous line matched Z.
f{print; f=0}
If f is non-zero, print this line and set f=0.
/Z/{print; f=1}
If this line matches the regex Z, then print this line and set f=1.
Note that there is no need to initialize f. In awk, undefined variables default to either zero (in a numeric context) or an empty string (in a character context). In either case, an undefined variable is logical-false.
Maybe this would do too:
$ sed -n ':check /^Z/ {p; n; h; p; x; b check}' file
-- :check is a label for branching, for lines matching /^Z/ (so, starting with Z) sed goes through loop:
print the line (= print the matched line)
go to next one
copy it to the hold buffer
print it (= print the line after matched)
exchange the line, i.e. move hold buffer back (= return the line after matched one)
branch to check to repeat the whole process if the line matches ^Z (= check it)
In principle, sed should be good with this kind of recursion (sed doesn't store any stacks, right?), but it may not be.
Also I'm not sure if the script is really correct :)

Print lines containing exact strings in several columns awk ubuntu

How do I print out the lines in awk, that contain certain strings in certain columns e.g. str = "x" in first column and str = "y" in second column?
x y
d y
f o
x o
So that in this example only the first line is printed?
Thanks in advance!
$ awk '$1=="x" && $2=="y"' file
x y
How it works
awk statements consist of conditions and actions. In this case, the condition is that the first column equals x and the second column equals y. Since we don't specify an action, awk performs its default action which is to print the line.
In other words, $1=="x" && $2=="y" is a condition. && means logical-and. Thus, this condition is true only if both $1=="x" and $2=="y" are true.

Extracting ID and sequence from a FASTQ file

I'm trying to manipulate a Fastq file.
It looks like this:
#HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B#=4A9#:6#96:1??9;>##########
#HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA#B7BB:7?#####################
#HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=#CB#?BA/#BA;6>BBA8A6A<?A4?B=
...
...
...
#HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B#BA#AB#B:74:;8=>7
My expected output is:
#HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
#HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
#HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
So, the ID line are those starting with #HWUSI (i.e #HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence.
Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C)
With Sed I can obtain all the sequence reverse and complementary with the command
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
How can I obtain also the corresponding ID?
With sed:
sed -n '/#HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
This works as follows:
/#HWUSI/ { # If a line starts with #HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\2\n\1/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p with a P (upper case instead of lower case). Where p prints the whole pattern space, P only prints the stuff before the first newline.
$ sed -n '/^[^#]/y/ATCG/TAGC/;/^#/p;/^[ATCGN]*$/p' file
#HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
#HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
#HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
#HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
Explanation
/^[^#]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an #
/^#/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.

Delete matching nth line until blank line in awk/sed/grep

I need to delete the nth matching line in a file from the match up to the next blank line (i.e. one chunk of blank line delimited text starting with the nth match).
This will delete a chunk of text that starts and ends with a blank line starting with the fourth blank line. It also deletes those delimiting lines.
sed -n '/^$/!{p;b};H;x;/^\(\n[^\n]*\)\{4\}/{:a;n;/^$/!ba;d};x;p' inputfile
Change the first /^$/ to change the start match. Change the second one to change the end match.
Given this input:
aaa
---
bbb
---
ccc
---
ddd delete me
eee delete me
===
fff
---
ggg
This version of the command:
sed -n '/^---$/!{p;b};H;x;/^\(\n[^\n]*\)\{3\}/{:a;n;/^===$/!ba;d};x;p' inputfile
would give this as the result:
aaa
---
bbb
---
ccc
fff
---
ggg
Edit:
I removed an extraneous b instruction from the sed commands above.
Here's a commented version:
sed -n ' # don't print by default
/^---$/!{ # if the input line doesn't match the begin block marker
p; # print it
b}; # branch to end of script and start processing next input line
H; # line matches begin mark, append to hold space
x; # swap pattern space and hold space
/^\(\n[^\n]*\)\{3\}/{ # if what was in hold consists of 3 lines
# in other words, 3 copies of the begin marker
:a; # label a
n; # read the next line
/^===$/!ba; # if it's not the end of block marker, branch to :a
d}; # otherwise, delete it, d branches to the end automatically
x; # swap pattern space and hold space
p; # print the line (it's outside the block we're looking for)
' inputfile # end of script, name of input file
Any unambiguous pattern should work for the begin and end markers. They can be the same or different.
perl -00 -pe 'if (/pattern/) {++$count == $n and $_ = "$`\n";}' file
-00 is to read the file in "paragraph" mode (record separator is one or more blank lines)
$` is Perl's special variable for the "prematch" (text in front of the matching pattern)
In AWK
/m1/ {i++};
(i==3) {while (getline temp > 0 && temp != "" ){}; if (temp == "") {i++;next}};
{print}
Transforms this:
m1 1
first
m1 2
second
m1 3
third delete me!
m1 4
fourth
m1 5
last
into this:
m1 1
first
m1 2
second
m1 4
fourth
m1 5
last
deleting the third block of "m1" ...
Running on ideone here
HTH!
Obligatory awk script. Just change n=2 to whatever your nth match should be.
n=2; awk -v n=$n '/^HEADER$/{++i==n && ++flag} !flag; /^$/&&flag{flag=0}' ./file
Input
$ cat ./file
HEADER
line1a
line2a
line3a
HEADER
line1b
line2b
line3b
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d
Output
$ n=2; awk -v n=$n '/^HEADER$/{++i==n&&++flag} !flag; /^$/&&flag{flag=0}' ./file
HEADER
line1a
line2a
line3a
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d