Assume that I have:
Z 10
Z 11
Y 10
I used:
$ grep "Z" <above_file> -A 1
Z 10
Z 11
Y 10
How can I get it to return:
Z 10
Z 11
Z 11
Y 10
In essence if grep sees that the next line also matches the pattern, I want it duplicated. Is the best/only solution to manually go through line by line or uses a complex awk statement with conditionals? There's further processing after this step, but this is the edge case that is holding me up.
Try:
$ awk 'f{print; f=0} /Z/{print; f=1}' file
Z 10
Z 11
Z 11
Y 10
How it works
Awk implicitly reads through the input file one line at a time. The script uses a single variable, f, which is true (non-zero) if the previous line matched Z.
f{print; f=0}
If f is non-zero, print this line and set f=0.
/Z/{print; f=1}
If this line matches the regex Z, then print this line and set f=1.
Note that there is no need to initialize f. In awk, undefined variables default to either zero (in a numeric context) or an empty string (in a character context). In either case, an undefined variable is logical-false.
Maybe this would do too:
$ sed -n ':check /^Z/ {p; n; h; p; x; b check}' file
-- :check is a label for branching, for lines matching /^Z/ (so, starting with Z) sed goes through loop:
print the line (= print the matched line)
go to next one
copy it to the hold buffer
print it (= print the line after matched)
exchange the line, i.e. move hold buffer back (= return the line after matched one)
branch to check to repeat the whole process if the line matches ^Z (= check it)
In principle, sed should be good with this kind of recursion (sed doesn't store any stacks, right?), but it may not be.
Also I'm not sure if the script is really correct :)
How do I print out the lines in awk, that contain certain strings in certain columns e.g. str = "x" in first column and str = "y" in second column?
x y
d y
f o
x o
So that in this example only the first line is printed?
Thanks in advance!
$ awk '$1=="x" && $2=="y"' file
x y
How it works
awk statements consist of conditions and actions. In this case, the condition is that the first column equals x and the second column equals y. Since we don't specify an action, awk performs its default action which is to print the line.
In other words, $1=="x" && $2=="y" is a condition. && means logical-and. Thus, this condition is true only if both $1=="x" and $2=="y" are true.
I'm trying to manipulate a Fastq file.
It looks like this:
#HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B#=4A9#:6#96:1??9;>##########
#HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA#B7BB:7?#####################
#HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=#CB#?BA/#BA;6>BBA8A6A<?A4?B=
...
...
...
#HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B#BA#AB#B:74:;8=>7
My expected output is:
#HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
#HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
#HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
So, the ID line are those starting with #HWUSI (i.e #HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence.
Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C)
With Sed I can obtain all the sequence reverse and complementary with the command
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
How can I obtain also the corresponding ID?
With sed:
sed -n '/#HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
This works as follows:
/#HWUSI/ { # If a line starts with #HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\2\n\1/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p with a P (upper case instead of lower case). Where p prints the whole pattern space, P only prints the stuff before the first newline.
$ sed -n '/^[^#]/y/ATCG/TAGC/;/^#/p;/^[ATCGN]*$/p' file
#HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
#HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
#HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
#HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
Explanation
/^[^#]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an #
/^#/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines
In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.
I need to delete the nth matching line in a file from the match up to the next blank line (i.e. one chunk of blank line delimited text starting with the nth match).
This will delete a chunk of text that starts and ends with a blank line starting with the fourth blank line. It also deletes those delimiting lines.
sed -n '/^$/!{p;b};H;x;/^\(\n[^\n]*\)\{4\}/{:a;n;/^$/!ba;d};x;p' inputfile
Change the first /^$/ to change the start match. Change the second one to change the end match.
Given this input:
aaa
---
bbb
---
ccc
---
ddd delete me
eee delete me
===
fff
---
ggg
This version of the command:
sed -n '/^---$/!{p;b};H;x;/^\(\n[^\n]*\)\{3\}/{:a;n;/^===$/!ba;d};x;p' inputfile
would give this as the result:
aaa
---
bbb
---
ccc
fff
---
ggg
Edit:
I removed an extraneous b instruction from the sed commands above.
Here's a commented version:
sed -n ' # don't print by default
/^---$/!{ # if the input line doesn't match the begin block marker
p; # print it
b}; # branch to end of script and start processing next input line
H; # line matches begin mark, append to hold space
x; # swap pattern space and hold space
/^\(\n[^\n]*\)\{3\}/{ # if what was in hold consists of 3 lines
# in other words, 3 copies of the begin marker
:a; # label a
n; # read the next line
/^===$/!ba; # if it's not the end of block marker, branch to :a
d}; # otherwise, delete it, d branches to the end automatically
x; # swap pattern space and hold space
p; # print the line (it's outside the block we're looking for)
' inputfile # end of script, name of input file
Any unambiguous pattern should work for the begin and end markers. They can be the same or different.
perl -00 -pe 'if (/pattern/) {++$count == $n and $_ = "$`\n";}' file
-00 is to read the file in "paragraph" mode (record separator is one or more blank lines)
$` is Perl's special variable for the "prematch" (text in front of the matching pattern)
In AWK
/m1/ {i++};
(i==3) {while (getline temp > 0 && temp != "" ){}; if (temp == "") {i++;next}};
{print}
Transforms this:
m1 1
first
m1 2
second
m1 3
third delete me!
m1 4
fourth
m1 5
last
into this:
m1 1
first
m1 2
second
m1 4
fourth
m1 5
last
deleting the third block of "m1" ...
Running on ideone here
HTH!
Obligatory awk script. Just change n=2 to whatever your nth match should be.
n=2; awk -v n=$n '/^HEADER$/{++i==n && ++flag} !flag; /^$/&&flag{flag=0}' ./file
Input
$ cat ./file
HEADER
line1a
line2a
line3a
HEADER
line1b
line2b
line3b
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d
Output
$ n=2; awk -v n=$n '/^HEADER$/{++i==n&&++flag} !flag; /^$/&&flag{flag=0}' ./file
HEADER
line1a
line2a
line3a
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d