How I can print line every four lines [duplicate] - awk

This question already has answers here:
Print every n lines from a file
(4 answers)
Closed 3 years ago.
I have a file that contains 8000 lines, I want to print lines 1,4,8,12,...,7996.
I tried this code
for j in {1 .. 8000}
do
k= $((4 * $j))
print k
sed -n $k P test.dat >> test.dat1
done
but this error appears:
./test.csh: line 3: 4 * {1: syntax error: operand expected (error token is "{1")
what is the problem, how can I do this?

Use awk command:
awk 'NR%4==1{print}' input.txt
Explanation:
NR % 4 == 1 { # for every input line, which line number (NR) modulo 4 is 1
print $0; # print the line
}

If you prefer sed
sed -n '0~4p'
Interesting enough, this is the exact sample shown in man sed
'FIRST~STEP'
This GNU extension matches every STEPth line starting with line
FIRST. In particular, lines will be selected when there exists a
non-negative N such that the current line-number equals FIRST + (N
* STEP). Thus, one would use '1~2' to select the odd-numbered
lines and '0~2' for even-numbered lines; to pick every third line
starting with the second, '2~3' would be used; to pick every fifth
line starting with the tenth, use '10~5'; and '50~0' is just an
obscure way of saying '50'.
The following commands demonstrate the step address usage:
$ seq 10 | sed -n '0~4p'
4
8

You seem to be attempting to use Bash syntax even though your question is tagged csh.
Even so, this has multiple errors. You can't have a space after the assignment operator, and you need to quote the sed script (otherwise it thinks P is the first file name ... Or I guess you actually mean p).
But the idea that you need to loop over the file 8000 times and print one line out of it in each iteration is also flawed. You can tell sed to print every fourth line with a single script like
sed -n -e 1p -e 5p -e 9p ... filename
I would switch to Awk for this, though.
awk 'FNR%4==1' filename

Related

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

How to print lines only after specific line (pattern) in awk? [duplicate]

Question: I'd like to print a single line directly following a line that contains a matching pattern.
My version of sed will not take the following syntax (it bombs out on +1p) which would seem like a simple solution:
sed -n '/ABC/,+1p' infile
I assume awk would be better to do multiline processing, but I am not sure how to do it.
Never use the word "pattern" in this context as it is ambiguous. Always use "string" or "regexp" (or in shell "globbing pattern"), whichever it is you really mean. See How do I find the text that matches a pattern? for more about that.
The specific answer you want is:
awk 'f{print;f=0} /regexp/{f=1}' file
or specializing the more general solution of the Nth record after a regexp (idiom "c" below):
awk 'c&&!--c; /regexp/{c=1}' file
The following idioms describe how to select a range of records given a specific regexp to match:
a) Print all records from some regexp:
awk '/regexp/{f=1}f' file
b) Print all records after some regexp:
awk 'f;/regexp/{f=1}' file
c) Print the Nth record after some regexp:
awk 'c&&!--c;/regexp/{c=N}' file
d) Print every record except the Nth record after some regexp:
awk 'c&&!--c{next}/regexp/{c=N}1' file
e) Print the N records after some regexp:
awk 'c&&c--;/regexp/{c=N}' file
f) Print every record except the N records after some regexp:
awk 'c&&c--{next}/regexp/{c=N}1' file
g) Print the N records from some regexp:
awk '/regexp/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.
f is short for found. Its a boolean flag that I'm setting to 1 (true) when I find a string matching the regular expression regexp in the input (/regexp/{f=1}). The other place you see f on its own in each script it's being tested as a condition and when true causes awk to execute its default action of printing the current record. So input records only get output after we see regexp and set f to 1/true.
c && c-- { foo } means "if c is non-zero then decrement it and if it's still non-zero then execute foo" so if c starts at 3 then it'll be decremented to 2 and then foo executed, and on the next input line c is now 2 so it'll be decremented to 1 and then foo executed again, and on the next input line c is now 1 so it'll be decremented to 0 but this time foo will not be executed because 0 is a false condition. We do c && c-- instead of just testing for c-- > 0 so we can't run into a case with a huge input file where c hits zero and continues getting decremented so often it wraps around and becomes positive again.
It's the line after that match that you're interesting in, right? In sed, that could be accomplished like so:
sed -n '/ABC/{n;p}' infile
Alternatively, grep's A option might be what you're looking for.
-A NUM, Print NUM lines of trailing context after matching lines.
For example, given the following input file:
foo
bar
baz
bash
bongo
You could use the following:
$ grep -A 1 "bar" file
bar
baz
$ sed -n '/bar/{n;p}' file
baz
I needed to print ALL lines after the pattern ( ok Ed, REGEX ), so I settled on this one:
sed -n '/pattern/,$p' # prints all lines after ( and including ) the pattern
But since I wanted to print all the lines AFTER ( and exclude the pattern )
sed -n '/pattern/,$p' | tail -n+2 # all lines after first occurrence of pattern
I suppose in your case you can add a head -1 at the end
sed -n '/pattern/,$p' | tail -n+2 | head -1 # prints line after pattern
And I really should include tlwhitec's comment in this answer (since their sed-strict approach is the more elegant than my suggestions):
sed '0,/pattern/d'
The above script deletes every line starting with the first and stopping with (and including) the line that matches the pattern. All lines after that are printed.
awk Version:
awk '/regexp/ { getline; print $0; }' filetosearch
If pattern match, copy next line into the pattern buffer, delete a return, then quit -- side effect is to print.
sed '/pattern/ { N; s/.*\n//; q }; d'
Actually sed -n '/pattern/{n;p}' filename will fail if the pattern match continuous lines:
$ seq 15 |sed -n '/1/{n;p}'
2
11
13
15
The expected answers should be:
2
11
12
13
14
15
My solution is:
$ sed -n -r 'x;/_/{x;p;x};x;/pattern/!s/.*//;/pattern/s/.*/_/;h' filename
For example:
$ seq 15 |sed -n -r 'x;/_/{x;p;x};x;/1/!s/.*//;/1/s/.*/_/;h'
2
11
12
13
14
15
Explains:
x;: at the beginning of each line from input, use x command to exchange the contents in pattern space & hold space.
/_/{x;p;x};: if pattern space, which is the hold space actually, contains _ (this is just a indicator indicating if last line matched the pattern or not), then use x to exchange the actual content of current line to pattern space, use p to print current line, and x to recover this operation.
x: recover the contents in pattern space and hold space.
/pattern/!s/.*//: if current line does NOT match pattern, which means we should NOT print the NEXT following line, then use s/.*// command to delete all contents in pattern space.
/pattern/s/.*/_/: if current line matches pattern, which means we should print the NEXT following line, then we need to set a indicator to tell sed to print NEXT line, so use s/.*/_/ to substitute all contents in pattern space to a _(the second command will use it to judge if last line matched the pattern or not).
h: overwrite the hold space with the contents in pattern space; then, the content in hold space is ^_$ which means current line matches the pattern, or ^$, which means current line does NOT match the pattern.
the fifth step and sixth step can NOT exchange, because after s/.*/_/, the pattern space can NOT match /pattern/, so the s/.*// MUST be executed!
This might work for you (GNU sed):
sed -n ':a;/regexp/{n;h;p;x;ba}' file
Use seds grep-like option -n and if the current line contains the required regexp replace the current line with the next, copy that line to the hold space (HS), print the line, swap the pattern space (PS) for the HS and repeat.
Piping some greps can do it (it runs in POSIX shell and under BusyBox):
cat my-file | grep -A1 my-regexp | grep -v -- '--' | grep -v my-regexp
-v will show non-matching lines
-- is printed by grep to separate each match, so we skip that too
If you just want the next line after a pattern, this sed command will work
sed -n -e '/pattern/{n;p;}'
-n supresses output (quiet mode);
-e denotes a sed command (not required in this case);
/pattern/ is a regex search for lines containing the literal combination of the characters pattern (Use /^pattern$/ for line consisting of only of “pattern”;
n replaces the pattern space with the next line;
p prints;
For example:
seq 10 | sed -n -e '/5/{n;p;}'
Note that the above command will print a single line after every line containing pattern. If you just want the first one use sed -n -e '/pattern/{n;p;q;}'. This is also more efficient as the whole file is not read.
This strictly sed command will print all lines after your pattern.
sed -n '/pattern/,${/pattern/!p;}
Formatted as a sed script this would be:
/pattern/,${
/pattern/!p
}
Here’s a short example:
seq 10 | sed -n '/5/,${/5/!p;}'
/pattern/,$ will select all the lines from pattern to the end of the file.
{} groups the next set of commands (c-like block command)
/pattern/!p; prints lines that doesn’t match pattern. Note that the ; is required in early versions, and some non-GNU, of sed. This turns the instruction into a exclusive range - sed ranges are normally inclusive for both start and end of the range.
To exclude the end of range you could do something like this:
sed -n '/pattern/,/endpattern/{/pattern/!{/endpattern/d;p;}}
/pattern/,/endpattern/{
/pattern/!{
/endpattern/d
p
}
}
/endpattern/d is deleted from the “pattern space” and the script restarts from the top, skipping the p command for that line.
Another pithy example:
seq 10 | sed -n '/5/,/8/{/5/!{/8/d;p}}'
If you have GNU sed you can add the debug switch:
seq 5 | sed -n --debug '/2/,/4/{/2/!{/4/d;p}}'
Output:
SED PROGRAM:
/2/,/4/ {
/2/! {
/4/ d
p
}
}
INPUT: 'STDIN' line 1
PATTERN: 1
COMMAND: /2/,/4/ {
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 2
PATTERN: 2
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: }
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 3
PATTERN: 3
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: /4/ d
COMMAND: p
3
COMMAND: }
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 4
PATTERN: 4
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: /4/ d
END-OF-CYCLE:
INPUT: 'STDIN' line 5
PATTERN: 5
COMMAND: /2/,/4/ {
COMMAND: }
END-OF-CYCLE:

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.

How to print the 'nth + x' lines after a match is found?

I have a file which contains the output below. I want only the lines which contain the actual vm_id number.
I want to match pattern 'vm_id' and print 2nd line + all other lines until 'rows' is reached.
FILE BEGIN:
vm_id
--------------------------------------
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
(6 rows)
datacenter=
FILE END:
So the resulting output would be;
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
Also, the number of VM Id's will vary, this example has 6 while others could have 3 or 300.
I have tried the following but they only output a single line that's specified;
awk 'c&&!--c;/vm_id/{c=2}'
and
awk 'c&&!--c;/vm_id/{c=2+1}'
$ awk '/rows/{f=0} f&&(++c>2); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
If you wanted that first line of hex(?) printed too then just change the starting number to compare c to from 2 to 1 (or 3 or 127 or however many lines you want to skip after hitting the vm_id line):
$ awk '/rows/{f=0} f&&(++c>1); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
What about this:
awk '/vm_id/{p=1;getline;next}/\([0-9]+ rows/{p=0}p'
I'm setting the p flag on vm_id and resetting it on (0-9+ rows).
Also sed comes in mind, the command follows basically the same logic as the awk command above:
sed -n '/vm_id/{n;:a;n;/([0-9]* rows)/!{p;ba}}'
Another thing, if it is safe that the only GUIDs in your input file are the vm ids, grep might be the tool of choise:
grep -Eo '([0-9a-f]+-){4}([0-9a-f]+)'
It's not 100% bullet proof in this form, but it should be good enough for the most use cases.
Bullet proof would be:
grep -Eoi '[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}'

Regarding duplicate entries from a file [duplicate]

Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.