sed delete lines matching pattern between 2 files - awk

Hey i'm still a beginner on sed and i'm trying to sed script to only output the lines not found of 1.txt to 2.txt if the line has the /pattern/ . I have the following :
1.txt
demo#example.de:boo
demo2#example.com:foo
demo3#example.nl:foo
2.txt
#example.de
#example.com
The desired output would be
demo3#example.nl:foo
I've tryed those commands looks not working
$ grep -f 2.txt 1.txt
$ cat 2.txt | xargs -I {} sed -n "/{}/p" 1.txt

You can do this using following awk command.
awk -F '[#:]' 'NR == FNR { blacklist[$2]; next } !($2 in blacklist)' 2.txt 1.txt
Explanation:
-F '[#:]' tells awk that fields in input lines are separated by a # or :. (demo#example.com:foo -> $1 = demo, $2 = example.com, $3 = foo)
NR == FNR <action> means do the following action only while processing the first file given as an argument to awk.
blacklist[$2] registers a key in array blacklist with the domain name in the current line.
next means skip to next line.
!($2 in blacklist) means print the current line if the domain name in it does not exist in array blacklist.

You can use -v option of grep, no need to use sed:
grep -vFf 2.txt 1.txt
demo3#example.nl:foo

Related

awk :: how to find matching words in two files

Some good folk here on StackOverflow helped me find common lines in two files using awk:
awk 'NR==FNR{a[tolower($0)]; next} tolower($0) in a' 1.txt 2.txt
But how to find common words in two files where words are in line?
For example, let's say that I have 1.txt with these words:
apple
orange
butter
flower
And then 2.txt with these words:
dog cat Butter tower
How to return butter or Butter?
I just want to find the common words.
This grep should do the job:
grep -oiwFf 1.txt 2.txt
Butter
Or else this simple gnu awk would also work:
awk -v RS='[[:space:]]+' 'NR==FNR {w[tolower($1)]; next} tolower($1) in w' 1.txt 2.txt
Butter
Given:
$ cat file1
apple
orange
butter
flower
$ cat file2
dog cat Butter tower
I would write it this way:
awk 'FNR==NR{for(i=1;i<=NF;i++) words[tolower($i)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
Note there is a field by field loop in the case of FNR==NR that handles files that may have more than one word per line. If you know that that is not the case, you can simplify to:
awk 'FNR==NR{words[tolower($1)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
If this is not working on Windows it may be an issue with \r\n line endings. If awk is using a RS=[\n] value then the the \r is left on all words at the end of a line; butter\r does not match butter.
Try:
awk -v RS='[ \r\n\t]' 'FNR==NR{words[tolower($0)]; next}
tolower($0) in words' file1 file2
Comments on your WSL comments in the link:
Your workarounds for Unix files on DOS are many.
Create file1 with DOS line endings this way:
$ printf 'apple\r\norange\r\nbutter\r\nflower\r\n' >file1
Now you can test / see the file has those line endings with cat -v:
$ cat -v file1
apple^M
orange^M
butter^M
flower^M
You can also remove those line endings with sed, perl, awk, etc. Here is a awk removing the \r from the files:
$ cat -v <(awk 1 RS='\r\n' ORS='\n' file1)
apple
orange
butter
flower
A sed and perl:
$ cat -v <(sed 's/\r$//' file1)
#same
or
$ cat -v <(perl -0777 -lpe 's/\r\n/\n/g' file1)
etc. Then use that same construct with awk-on-windows:
awk 'your_awk_program' <(awk 1 RS='\r\n' ORS='\n' file1) <(awk 1 RS='\r\n' ORS='\n' file2)
The downside: While each input is treated as a different logical file, so the FNR==NR awk test still works, the awk special variable FILENAME is lost in the process. If you want to keep FILENAME associated with the actual file, you need to preprocess the files prior to feeding to awk or deal with the \r inside your awk script.
You need to loop over every field per line (of 2.txt) and check:
awk 'NR==FNR{a[tolower($0)];next}{for(i=1;i<=NF;i++){if(tolower($i) in a){print $i}}}' \
1.txt 2.txt
An alternative way to do this in awk would be to add whitespace to the input record separator when processing the 2nd file:
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' 1.txt RS="[\n ]" 2.txt

Extract lines after a pattern

I have 50 files in a folder and all have a common pattern "^^". I want to print everything after "^^" and append the filename and print out all the extracted lines to one output file. While my code works fine with a single file it doesn't work on all the files.
awk '/\^^/{getline; getline; print FILENAME; print}' *.txt > output
Example
1.txt
ghghh hghg
ghfg hghg hjg
jhhkjh
kjhkjh kjh
^^
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
hghjhg hgj
jhgj
jhgjh kjgh
jhg
^^
bbbbbbbbbbbbbbbbbbbbbbb
Desired output.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
My actual output
1.txt
ghghh hghg
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzz
To print the line after ^^, try:
$ awk 'f{print FILENAME ORS $0; f=0} /\^\^/{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
How it works:
f{print FILENAME ORS $0; f=0}
If variable f is true (nonzero), print the filename, the output record separator, and the current line. Then set f back to zero.
/\^\^/{f=1}
If the current line contains ^^, set f to one.
$ awk 'FNR==1{print FILENAME; f=0} f; $1=="^^"{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
I like a more "bash(ish)" approach.
grep -Hn '^^' *.txt |
cut -d: -f1,2 --output-delimiter=' ' |
while read f n; do echo $f; tail $f -n+$((n+1)); done
grep -Hn will tell the line number of your pattern.
With cut we get only the needed fields, as we need.
In a loop we read the two informations into variables, to use they freely as we need.
The tail can read not only the last N lines, but also all lines from +N if you use the plus signal.
We can do arithmetic operation inside $((...)) to jump the pattern line.
And it solves your issue. And it can print all lines after the pattern, not only the next one.
use awk:
awk 'FNR==1{print FILENAME} FNR==1,/\^\^/{next}1' *.txt
Where:
print FILENAME when FNR == 1
FNR==1,/\^\^/{next}: all lines between FNR==1 and the first line matching ^^ will be skipped
1 at the end to print the rest of lines after the matched ^^ line
The following outputs only if we have a file that matches our pattern:
awk 'FNR==1 { f=0 }; f; /\^\^/ { f=1; print FILENAME }' *.txt > output
Reset flag f on every new file.
Print if f is set.
Set f and print FILENAME if we match our pattern.
This one prints out the FILENAME regardless of matching pattern:
awk 'FNR==1 { f=0; print FILENAME }; f; /\^\^/ { f=1 }' *.txt > output
We can adjust the pattern matching in step 3 in accord with whatever is required... exact matching for instance can be done with $0=="^^".
let your files' name are 1 to 50 with txt type
for f in {1..50}.txt
{
sed -nE "/^\^\^\s*$/{N;s/.+\n(.+)/$f\n\1/p}" $f>$f.result.txt
}
Stealing from some answers and comments to your previous question on this topic, you can also use grep -A and format the output with sed.
$ grep -A100 '^^' *.txt | sed '/\^^/d;/--/d;s/-/\n/'
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
Assuming 100 lines is sufficient, and that you don't have hyphens of your own.
If you only need one line, use -A1
This might work for you (GNU sed):
sed -s '1,/^^^/{/^^^/F;d}' file1 file2 file3 ... >fileOut

How can I print only lines that are immediately preceeded by an empty line in a file using sed?

I have a text file with the following structure:
bla1
bla2
bla3
bla4
bla5
So you can see that some lines of text are preceeded by an empty line.
I understand that sed has the concept of two buffers, a pattern space buffer and a hold space buffer, so I'm guessing these need to come in to play here, but I'm unclear how to specify them to accomplish what I need.
In my contrived example above, I'd expect to see the following lines outputted:
bla3
bla5
sed is for doing s/old/new on individual lines, that is all. Any time you start talking about buffers or doing anything related to multi-lines comparisons you're using the wrong tool.
You could do this with awk:
$ awk -v RS= -F'\n' 'NR>1{print $1}' file
bla3
bla5
but it would fail to print the first non-empty line if the first line(s) in the file were empty so this may be what you want if you want lines of all space chars considered to be empty lines:
$ awk 'NF && !p{print} {p=NF}' file
bla3
bla5
and this otherwise:
$ awk '($0!="") && (p==""){print} {p=$0}' file
bla3
bla5
All of the above will work even if there are multiple empty lines preceding any given non-empty line.
To see the difference between the 3 approaches (which you won't see given the sample input in the question):
PS1> printf '\nfoo\n \nbar\n\netc\n' | cat -E
$
foo$
$
bar$
$
etc$
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk -v RS= -F'\n' 'NR>1{print $1}'
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk 'NF && !p{print} {p=NF}'
foo
bar
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk '($0!="") && (p==""){print} {p=$0}'
foo
etc
You can use the hold buffer easily to print the line before the blank like this:
sed -n -e '/^$/{x; p;}' -e h input
But I don't see an easy way to use it for your use case. For your case, instead of using the hold buffer, you could do:
sed -n -e '/^$/ba' -e d -e :a -e n -e p input
But I would do this with awk.
awk 'NR!=1{print $1}' RS= FS=\\n input-file
awk 'p;{p=/^$/}' file
above command does these for each line:
if p is 1, print line;
if line is empty, set p to 1.
if lines consisting of one or more spaces are also considered empty:
awk 'p;{p=!NF}' file
to print non-empty lines each coming right after an empty line, you can use this:
awk 'p*!(p=/^$/)' file
if p is 1 and this line is not empty (1*!(0) = 1*1 = 1), print this line;
otherwise (1*!(1) = 1*0 = 0, 0*anything = 0), don't print anything.
note that this one may not work with all awks, a portable version of this would look like:
awk 'p*(/./);{p=/^$/}' file
if lines consisting of one or more spaces are also considered empty:
awk 'p*NF;{p=!NF}' file
see them online here, and here.
If sed/awk is not mandatory, you can do it with grep:
grep -A 1 '^$' input.txt | grep -v -E '^$|--'
You can use sed to match a range of lines and do sub-matches inside the matches, like so:
# - use the "-n" option to omit printing of lines
# - match lines between a blank line (/^$/) and a non-blank one (/^./),
# then print only the line that contains at least a character,
# i.e, the non-blank line.
sed -ne '
/^$/,/^./ {
/^./{ p; }
}' input.txt
tested by gnu sed, your data in 'a':
$ sed -nE '/^$/{N;s/\n(.+)/\1/p}' a
bla3
bla5
add -i option precedes -n to real editing

Why does awk not filter the first column in the first line of my files?

I've got a file with following records:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt;2;CLI001
depots/import/HDN1YYAA_20102018.txt;32;CLI001
depots/import/HDN1YYAA_25102018.txt;1;CAB001
depots/import/HDN1YYAA_50102018.txt;1;CAB001
depots/import/HDN1YYAA_65102018.txt;1;CAB001
depots/import/HDN1YYAA_80102018.txt;2;CLI001
depots/import/HDN1YYAA_93102018.txt;2;CLI001
When I execute following oneliner awk:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR==1){print $1}}END {}'
the output is not the expected:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
While I am suppose get only the frist column:
If I run it through all the records:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR>0){print $1}}END {}'
then it will start filtering only after the second line and I get the following output:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_25102018.txt
depots/import/HDN1YYAA_50102018.txt
depots/import/HDN1YYAA_65102018.txt
depots/import/HDN1YYAA_80102018.txt
depots/import/HDN1YYAA_93102018.txt
Does anybody knows why awk is skiping the first line only.
I tried deleting first record but the behaviour is the same, it will skip the first line.
First, it should be
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}END {}' filename
You can omit the END block if it is empty:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}' filename
You can use the -F command line argument to set the field delimiter:
awk -F';' '{if(NR==1){print $1}}' filename
Furthermore, awk programs consist of a sequence of CONDITION [{ACTIONS}] elements, you can omit the if:
awk -F';' 'NR==1 {print $1}' filename
You need to specify delimiter in either BEGIN block or as a command-line option:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}'
awk -F ';' '{ if(NR==1){print $1}}'
cut might be better suited here, for all lines
$ cut -d';' -f1 file
to skip the first line
$ sed 1d file | cut -d';' -f1
to get the first line only
$ sed 1q file | cut -d';' -f1
however at this point it's better to switch to awk
if you have a large file and only interested in the first line, it's better to exit early
$ awk -F';' '{print $1; exit}' file

awk field separator not working for first line

echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'FS="_length" {print $1}'
Obtained output:
NODE_1_length_317516_cov_18.568_ID_4005
Expected output:
NODE_1
How is that possible? I'm missing something.
When you are going through lines using Awk, the field separator is interpreted before processing the record. Awk reads the record according the current values of FS and RS and goes ahead performing the operations you ask it for.
This means that if you set the value of FS while reading a record, this won't have effect for that specific record. Instead, the FS will have effect when reading the next one and so on.
So if you have a file like this:
$ cat file
1,2 3,4
5,6 7,8
And you set the field separator while reading one record, it takes effect from the next line:
$ awk '{FS=","} {print $1}' file
1,2 # FS is still the space!
5
So what you want to do is to set the FS before starting to read the file. That is, set it in the BEGIN block or via parameter:
$ awk 'BEGIN{FS=","} {print $1}' file
1,2 # now, FS is the comma
5
$ awk -F, '{print $1}' file
1
5
There is also another way: make Awk recompute the full record with {$0=$0}. With this, Awk will take into account the current FS and act accordingly:
$ awk '{FS=","} {$0=$0;print $1}' file
1
5
awk Statement used incorrectly
Correct way is
awk 'BEGIN { FS = "#{delimiter}" } ; { print $1 }'
In your case you can use
awk 'BEGIN { FS = "_length" } ; { print $1 }'
Inbuilt variables like FS, ORS etc must be set within a context i.e in 1 of the following blocks: BEGIN, condition blocks or END.
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'BEGIN{FS="_length"} {print $1}'
NODE_1
$
You can also pass the delimiter using -F switch like this:
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk -F "_length" '{print $1}'
NODE_1
$