Extract lines after a pattern - awk

I have 50 files in a folder and all have a common pattern "^^". I want to print everything after "^^" and append the filename and print out all the extracted lines to one output file. While my code works fine with a single file it doesn't work on all the files.
awk '/\^^/{getline; getline; print FILENAME; print}' *.txt > output
Example
1.txt
ghghh hghg
ghfg hghg hjg
jhhkjh
kjhkjh kjh
^^
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
hghjhg hgj
jhgj
jhgjh kjgh
jhg
^^
bbbbbbbbbbbbbbbbbbbbbbb
Desired output.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
My actual output
1.txt
ghghh hghg
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzz

To print the line after ^^, try:
$ awk 'f{print FILENAME ORS $0; f=0} /\^\^/{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
How it works:
f{print FILENAME ORS $0; f=0}
If variable f is true (nonzero), print the filename, the output record separator, and the current line. Then set f back to zero.
/\^\^/{f=1}
If the current line contains ^^, set f to one.

$ awk 'FNR==1{print FILENAME; f=0} f; $1=="^^"{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb

I like a more "bash(ish)" approach.
grep -Hn '^^' *.txt |
cut -d: -f1,2 --output-delimiter=' ' |
while read f n; do echo $f; tail $f -n+$((n+1)); done
grep -Hn will tell the line number of your pattern.
With cut we get only the needed fields, as we need.
In a loop we read the two informations into variables, to use they freely as we need.
The tail can read not only the last N lines, but also all lines from +N if you use the plus signal.
We can do arithmetic operation inside $((...)) to jump the pattern line.
And it solves your issue. And it can print all lines after the pattern, not only the next one.

use awk:
awk 'FNR==1{print FILENAME} FNR==1,/\^\^/{next}1' *.txt
Where:
print FILENAME when FNR == 1
FNR==1,/\^\^/{next}: all lines between FNR==1 and the first line matching ^^ will be skipped
1 at the end to print the rest of lines after the matched ^^ line

The following outputs only if we have a file that matches our pattern:
awk 'FNR==1 { f=0 }; f; /\^\^/ { f=1; print FILENAME }' *.txt > output
Reset flag f on every new file.
Print if f is set.
Set f and print FILENAME if we match our pattern.
This one prints out the FILENAME regardless of matching pattern:
awk 'FNR==1 { f=0; print FILENAME }; f; /\^\^/ { f=1 }' *.txt > output
We can adjust the pattern matching in step 3 in accord with whatever is required... exact matching for instance can be done with $0=="^^".

let your files' name are 1 to 50 with txt type
for f in {1..50}.txt
{
sed -nE "/^\^\^\s*$/{N;s/.+\n(.+)/$f\n\1/p}" $f>$f.result.txt
}

Stealing from some answers and comments to your previous question on this topic, you can also use grep -A and format the output with sed.
$ grep -A100 '^^' *.txt | sed '/\^^/d;/--/d;s/-/\n/'
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
Assuming 100 lines is sufficient, and that you don't have hyphens of your own.
If you only need one line, use -A1

This might work for you (GNU sed):
sed -s '1,/^^^/{/^^^/F;d}' file1 file2 file3 ... >fileOut

Related

How to delete top and last non empty lines of the file

I want to delete top and last non empty line of the file.
Example:
cat test.txt
//blank_line
abc
def
xyz
//blank_line
qwe
mnp
//blank_line
Then output should be:
def
xyz
//blank_line
qwe
I have tried with commands
sed "$(awk '/./{line=NR} END{print line}' test.txt)d" test.txt
to remove last non empty line. At here there are two command, (1) sed and (2) awk. But I want to do by single command.
Reading the whole file in memory at once with GNU sed for -E and -z:
$ sed -Ez 's/^\s*\S+\n//; s/\n\s*\S+\s*$/\n/' test.txt
def
xyz
qwe
or with GNU awk for multi-char RS:
$ awk -v RS='^$' '{gsub(/^\s*\S+\n|\n\S+\s*$/,"")} 1' test.txt
def
xyz
qwe
Both GNU tools accept \s and \S as shorthand for [[:space:]] and [^[:space:]] respectively and GNU sed accepts the non-POSIX-sed-standard \n as meaning newline.
This is a double pass method:
awk '(NR==FNR) { if(NF) {t=FNR;if(!h) h=FNR}; next}
(h<FNR && FNR<t)' file file
The integers h and t keep track of the head and the tail. In this case, empty lines can also contain blanks. You could replace if(NF) by if(length($0)==0) to be more strict.
This one reads everything into memory and does a simple replace at the end:
$ awk '{b=b RS $0}
END{ sub(/^[[:blank:]\n]*[^\n]+\n/,"",b);
sub(/\n[^\n]+[[:blank:]\n]*$,"",b);
print b }' file
A single-pass, fast and relatively memory-efficient approach utilising a buffer:
awk 'f {
if(NF) {
printf "%s",buf
buf=""
}
buf=(buf $0 ORS)
next
}
NF {
f=1
}' file
here is a golfed version of #kvantour's solution
$ awk 'NR==(n=FNR){e=!NF?e:n;b=!b?e:b}b<n&&n<e' file{,}
This might work for you (GNU sed):
sed -E '0,/\S/d;H;$!d;x;s/.(.*)\n.*\S.*/\1/' file
Use a range to delete upto and including the first line containing a non-space character. Then copy the remains of the file into the hold space and at the end of file use substitution to remove the last line containing a non-space character and any empty lines to the end of the file.
Alternative:
sed '0,/\S/d' file | tac | sed '0,/\S/d'| tac

How can I print only lines that are immediately preceeded by an empty line in a file using sed?

I have a text file with the following structure:
bla1
bla2
bla3
bla4
bla5
So you can see that some lines of text are preceeded by an empty line.
I understand that sed has the concept of two buffers, a pattern space buffer and a hold space buffer, so I'm guessing these need to come in to play here, but I'm unclear how to specify them to accomplish what I need.
In my contrived example above, I'd expect to see the following lines outputted:
bla3
bla5
sed is for doing s/old/new on individual lines, that is all. Any time you start talking about buffers or doing anything related to multi-lines comparisons you're using the wrong tool.
You could do this with awk:
$ awk -v RS= -F'\n' 'NR>1{print $1}' file
bla3
bla5
but it would fail to print the first non-empty line if the first line(s) in the file were empty so this may be what you want if you want lines of all space chars considered to be empty lines:
$ awk 'NF && !p{print} {p=NF}' file
bla3
bla5
and this otherwise:
$ awk '($0!="") && (p==""){print} {p=$0}' file
bla3
bla5
All of the above will work even if there are multiple empty lines preceding any given non-empty line.
To see the difference between the 3 approaches (which you won't see given the sample input in the question):
PS1> printf '\nfoo\n \nbar\n\netc\n' | cat -E
$
foo$
$
bar$
$
etc$
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk -v RS= -F'\n' 'NR>1{print $1}'
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk 'NF && !p{print} {p=NF}'
foo
bar
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk '($0!="") && (p==""){print} {p=$0}'
foo
etc
You can use the hold buffer easily to print the line before the blank like this:
sed -n -e '/^$/{x; p;}' -e h input
But I don't see an easy way to use it for your use case. For your case, instead of using the hold buffer, you could do:
sed -n -e '/^$/ba' -e d -e :a -e n -e p input
But I would do this with awk.
awk 'NR!=1{print $1}' RS= FS=\\n input-file
awk 'p;{p=/^$/}' file
above command does these for each line:
if p is 1, print line;
if line is empty, set p to 1.
if lines consisting of one or more spaces are also considered empty:
awk 'p;{p=!NF}' file
to print non-empty lines each coming right after an empty line, you can use this:
awk 'p*!(p=/^$/)' file
if p is 1 and this line is not empty (1*!(0) = 1*1 = 1), print this line;
otherwise (1*!(1) = 1*0 = 0, 0*anything = 0), don't print anything.
note that this one may not work with all awks, a portable version of this would look like:
awk 'p*(/./);{p=/^$/}' file
if lines consisting of one or more spaces are also considered empty:
awk 'p*NF;{p=!NF}' file
see them online here, and here.
If sed/awk is not mandatory, you can do it with grep:
grep -A 1 '^$' input.txt | grep -v -E '^$|--'
You can use sed to match a range of lines and do sub-matches inside the matches, like so:
# - use the "-n" option to omit printing of lines
# - match lines between a blank line (/^$/) and a non-blank one (/^./),
# then print only the line that contains at least a character,
# i.e, the non-blank line.
sed -ne '
/^$/,/^./ {
/^./{ p; }
}' input.txt
tested by gnu sed, your data in 'a':
$ sed -nE '/^$/{N;s/\n(.+)/\1/p}' a
bla3
bla5
add -i option precedes -n to real editing

sed delete lines matching pattern between 2 files

Hey i'm still a beginner on sed and i'm trying to sed script to only output the lines not found of 1.txt to 2.txt if the line has the /pattern/ . I have the following :
1.txt
demo#example.de:boo
demo2#example.com:foo
demo3#example.nl:foo
2.txt
#example.de
#example.com
The desired output would be
demo3#example.nl:foo
I've tryed those commands looks not working
$ grep -f 2.txt 1.txt
$ cat 2.txt | xargs -I {} sed -n "/{}/p" 1.txt
You can do this using following awk command.
awk -F '[#:]' 'NR == FNR { blacklist[$2]; next } !($2 in blacklist)' 2.txt 1.txt
Explanation:
-F '[#:]' tells awk that fields in input lines are separated by a # or :. (demo#example.com:foo -> $1 = demo, $2 = example.com, $3 = foo)
NR == FNR <action> means do the following action only while processing the first file given as an argument to awk.
blacklist[$2] registers a key in array blacklist with the domain name in the current line.
next means skip to next line.
!($2 in blacklist) means print the current line if the domain name in it does not exist in array blacklist.
You can use -v option of grep, no need to use sed:
grep -vFf 2.txt 1.txt
demo3#example.nl:foo

Why does awk not filter the first column in the first line of my files?

I've got a file with following records:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt;2;CLI001
depots/import/HDN1YYAA_20102018.txt;32;CLI001
depots/import/HDN1YYAA_25102018.txt;1;CAB001
depots/import/HDN1YYAA_50102018.txt;1;CAB001
depots/import/HDN1YYAA_65102018.txt;1;CAB001
depots/import/HDN1YYAA_80102018.txt;2;CLI001
depots/import/HDN1YYAA_93102018.txt;2;CLI001
When I execute following oneliner awk:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR==1){print $1}}END {}'
the output is not the expected:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
While I am suppose get only the frist column:
If I run it through all the records:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR>0){print $1}}END {}'
then it will start filtering only after the second line and I get the following output:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_25102018.txt
depots/import/HDN1YYAA_50102018.txt
depots/import/HDN1YYAA_65102018.txt
depots/import/HDN1YYAA_80102018.txt
depots/import/HDN1YYAA_93102018.txt
Does anybody knows why awk is skiping the first line only.
I tried deleting first record but the behaviour is the same, it will skip the first line.
First, it should be
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}END {}' filename
You can omit the END block if it is empty:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}' filename
You can use the -F command line argument to set the field delimiter:
awk -F';' '{if(NR==1){print $1}}' filename
Furthermore, awk programs consist of a sequence of CONDITION [{ACTIONS}] elements, you can omit the if:
awk -F';' 'NR==1 {print $1}' filename
You need to specify delimiter in either BEGIN block or as a command-line option:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}'
awk -F ';' '{ if(NR==1){print $1}}'
cut might be better suited here, for all lines
$ cut -d';' -f1 file
to skip the first line
$ sed 1d file | cut -d';' -f1
to get the first line only
$ sed 1q file | cut -d';' -f1
however at this point it's better to switch to awk
if you have a large file and only interested in the first line, it's better to exit early
$ awk -F';' '{print $1; exit}' file

Concatenating multiple files into a single line in linux

I have 3 fasta files like following
>file_1_head
haszhaskjkjkjkfaiezqbsga
>file_1_body
loizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdja
>file_1_tail
mnnbasnbdnztoaosdhgas
I would like to concatenate them into a single like following
>file_1
haszhaskjkjkjkfaiezqbsgaloizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdjamnnbasnbdnztoaosdhgas
I tried with cat command cat file_1_head.fasta file_1_body.fasta file_1_tail.fasta but it didnt concatenates into a single line like above. Is it possible with "awk" Kindly guide me.
Do you mean your three files have the content
file_1_head.fasta
>file_1_head
haszhaskjkjkjkfaiezqbsga
file_1_body.fasta
>file_1_body
loizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdja
and file_1_tail.fasta
>file_1_tail
mnnbasnbdnztoaosdhgas
including the name of each of them within them as the first line?
Then you could do
(echo ">file_1"; tail -qn -1 file_1_{head,body,tail}.fasta | tr -d "\n\t ") > file_1.fasta
to get file_1.fasta as
>file_1
haszhaskjkjkjkfaiezqbsgaloizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdjamnnbasnbdnztoaosdhgas
This also removes some extra whitespace at the end of the lines in your input that I got when I copied them verbatim.
You can do this simply with
cat file1 file2 file3 | tr -d '\n' > new_file
tr deletes the newline character.
EDIT:
For your specific first line just do
echo file_1 > new_file
cat file1 file2 file3 | tr -d '\n' >> new_file
The first command creates the file with one line file_1 in it. Then the cat... command just appends to this file.
What about this?
awk 'BEGIN { RS=""} {for (i=1;i<=NF;i++) { printf "%s",$i } }' f1_head f1_body f1_tail