How do I extract specific lines based on a comparison of two files with sed and/or awk? - awk

I need to extract all the lines from file2.txt that do not match the string until the first dot in any line in file1.txt. I am interested in a solution that stays as close to my current approach as possible so it is easy for me to understand, and uses only sed and/or awk in linux bash.
file1.h
apple.sweet banana
apple.tasty banana
apple.brown banana
orange_mvp.rainy day.here
orange_mvp.ear nose.png
lemon_mvp.ear ring
tarte_mvp_rainy day.here
file2.h
orange_mvp
lemon_mvp
lemon_mvp
tarte_mvp
cake_mvp
result desired
tarte_mvp
cake_mvp
current wrong approach
$ awk '
NR==FNR { sub(/mvp(\..*)$/,""); a[$0]; next }
{ f=$0; sub(/mvp(\..*)$/,"", f) }
!(f in a)
' file2.h file1.h
apple.sweet banana
apple.tasty banana
apple.brown banana
orange_mvp.rainy day.here
orange_mvp.ear nose.png
lemon_mvp.ear ring
tarte_mvp_rainy day.here

Using awk
$ awk -F. 'NR==FNR {a[$1]=$1;next} a[$1] != $0' file1.h file2.h
tarte_mvp
cake_mvp

The answer by #HatLess is very nice and idiomatic. If you find it a bit cryptic, you can also consider this one, in program program.awk:
BEGIN {
while(getline prefix < "file1.txt") {
gsub("[.].*", "", prefix)
ignore[prefix]
}
}
!($0 in ignore) {
print($0)
}
Invoked with awk -f program.awk file2.txt.
In the BEGIN block we read all the lines from file1.txt and store the prefixes we want to ignore as keys of an hash table.
Then we process the file2.txt and print all the lines which are selected (not ignored).

Related

awk remove endings of words ending with patterns

I have a large dataset and am trying to lemmatize a column ($14) with awk, which is I need to remove 'ing', 'ed', 's' in words if it ends with one of those pattern. So asked, asks, asking would be just 'ask' after all.
Let's say I have this dataset (the column I want to make modifications is $2:
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
With that, expected output is:
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
I have tried following regex with awk, but it didnt work.
awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",$2); print}' file.txt
#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)
Please help, I'm new to awk and still exploring it.
Using GNU awk for gensub() and \> for word boundaries:
$ awk 'BEGIN{FS=OFS="\t"} {$2=gensub(/(ing|ed|s)\>/,"","g",$2)} 1' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
Using any awk with gsub you could do:
awk -F'\t' -v OFS="\t" '
{ gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
}1
' file
Example Input File
$ cat file
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
four Just a normal sentence.
Example Use/Output
$ awk -F'\t' -v OFS="\t" '
> { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
> match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
> }1
> ' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
four Just a normal sentence.
(note: last line added as example of a sentence unchanged)
If you use GNU awk you were not far from it:
$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",$2); print}' file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
Note -v OFS='\t' to use tabs also as output field separators.
But if your awk uses the kind of obsolete regular expressions without word boundaries (like the default awk that comes with macOS) things are more complicated. One option is to iteratively use match and substr. Example:
# foo.awk
BEGIN {
n = split(prefix, word, /,/)
for(i = 1; i <= n; i++) {
len[i] = length(word[i])
}
}
{
for(i = 1; i <= n; i++) {
re = word[i] "[^[:alnum:]]"
while(m = match($2, re)) {
if(m == 1) {
$2 = substr($2, len[i]+1, length($2))
} else {
$2 = substr($2, 1, m-1) substr($2, m+len[i], length($2))
}
}
}
print
}
And then:
$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.

Find line in file and replace it with line from another file

My goal is to find a string within a file (file1) and replace its whole line with the content of a specific line (in this example line 3) from another file (file2). As I understand, I need to use RegEx to do the first part and then use a second sed command to store the contents of file2. sed is definitely not my strong suit, so I hope someone here can help a rookie out!
So far I have:
sed -i '/^matching.string.here*/s' <(sed '3!d' file2) file1
Edit
Example file1:
string one
string two
matching.string.here.
string three
Example file2:
alt string one
alt string two
alt string three
Expected Result in file1:
string one
string two
alt string three
string three
Your sed attempt contains several unexplained errors; it's actually hard to see what you are in fact trying to do.
You probably want to do something along the lines of
sed '3!d;s%.*%s/^matching\.string\.here.*/&/%' file2 |
sed -f - -i file1
It's unclear what you hope for the /s to mean; does your sed have a flag with this name?
This creates a sed script from the third line of file2; take out the pipeline to sed -f - to see what the generated script looks like. (If your sed does not allow you to pass in a script on standard input, you will have to write it to a temporary file, and then pass that to the second sed.)
Anyway, this is probably both simpler and more robust using Awk.
awk 'NR==3 && NR==FNR { keep=$0; next }
/^matching\.string\.here/ { $0 = keep } 1' file2 file1
This writes the new content to standard output. If you have GNU Awk, you can explore its -i inplace option; otherwise, you will need to write the result to a file, then move it back to file1.
This might work for you (GNU sed):
sed -n '3s#.*#sed "/matching\\.string\\.here\\./c&" file1#ep' file2
Focus on line 3 of file2.
Manufacture a sed script which changes a matching line in file1 to the contents of the line in focus and print the result.
N.B. The periods in the match must be escaped twice so as not to match an arbitrary character.
This is tailor made job for awk and bonus is that you can avoid any regex:
awk -v s='matching.string.here' 'FNR == NR {a[FNR] = $0; next} index($0, s) {$0 = a[FNR]} 1' file2 file1
string one
string two
alt string three
string three
A more readable version:
awk -v s='matching.string.here' '
FNR == NR {
a[FNR] = $0
next
}
index($0, s) {
$0 = a[FNR]
} 1' file2 file1

Replace a letter with another from the last word from the last two lines of a text file

How could I possibly replace a character with another, selecting the last word from the last two lines of a text file in shell, using only a single command? In my case, replacing every occurrence of a with E from the last word only.
Like, from a text file containing this:
tree;apple;another
mango.banana.half
monkey.shelf.karma
to this:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
I tried using sed -n 'tail -2 'mytext.txt' -r 's/[a]+/E/*$//' but it doesn't work (my error: sed expression #1, char 10: unknown option to 's).
Could you please try following, tac + awk solution. Completely based on OP's samples only.
tac Input_file |
awk 'FNR<=2{if(/;/){FS=OFS=";"};if(/\./){FS=OFS="."};gsub(/a/,"E",$NF)} 1' |
tac
Output with shown samples is:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
NOTE: Change gsub to sub in case you want to substitute only very first occurrence of character a in last field.
This might work for you (GNU sed):
sed -E 'N;${:a;s/a([^a.]*)$/E\1/mg;ta};P;D' file
Open a two line window throughout the length of the file by using the N to append the next line to the previous and the P and D commands to print then delete the first of these. Thus at the end of the file, signified by the $ address the last two lines will be present in the pattern space.
Using the m multiline flag on the substitution command, as well as the g global flag and a loop between :a and ta, replace any a in the last word (delimited by .) by an E.
Thus the first pass of the substitution command will replace the a in half and the last a in karma. The next pass will match nothing in the penultimate line and replace the a in karmE. The third pass will match nothing and thus the ta command will fail and the last two lines will printed with the required changes.
If you want to use Sed, here's a solution:
tac input_file | sed -E '1,2{h;s/.*[^a-zA-Z]([a-zA-Z]+)/\1/;s/a/E/;x;s/(.*[^a-zA-Z]).*/\1/;G;s/\n//}' | tac
One tiny detail. In your question you say you want to replace a letter, but then you transform karma in kErme, so what is this? If you meant to write kErma, then the command above will work; if you meant to write kErmE, then you have to change it just a bit: the s/a/E/ should become s/a/E/g.
With tac+perl
$ tac ip.txt | perl -pe 's/\w+\W*$/$&=~tr|a|E|r/e if $.<=2' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
\w+\W*$ match last word in the line, \W* allows any possible trailing non-word characters to be matched as well. Change \w and \W accordingly if numbers and underscores shouldn't be considered as word characters - for ex: [a-zA-Z]+[^a-zA-Z]*$
$&=~tr|a|E|r change all a to E only for the matched portion
e flag to enable use of Perl code in replacement section instead of string
To do it in one command, you can slurp the entire input as single string (assuming this'll fit available memory):
perl -0777 -pe 's/\w+\W*$(?=(\n.*)?\n\z)/$&=~tr|a|E|r/gme'
Using GNU awk forsplit() 4th arg since in the comments of another solution the field delimiter is every sequence of alphanumeric and numeric characters:
$ gawk '
BEGIN {
pc=2 # previous counter, ie how many are affected
}
{
for(i=pc;i>=1;i--) # buffer to p hash, a FIFO
if(i==pc && (i in p)) # when full, output
print p[i]
else if(i in p) # and keep filling
p[i+1]=p[i] # above could be done using mod also
p[1]=$0
}
END {
for(i=pc;i>=1;i--) {
n=split(p[i],t,/[^a-zA-Z0-9\r]+/,seps) # split on non alnum
gsub(/a/,"E",t[n]) # replace
for(j=1;j<=n;j++) {
p[i]=(j==1?"":p[i] seps[j-1]) t[j] # pack it up
}
print p[i] # output
}
}' file
Output:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Would this help you ? on GNU awk
$ cat file
tree;apple;another
mango.banana.half
monkey.shelf.karma
$ tac file | awk 'NR<=2{s=gensub(/(.*)([.;])(.*)$/,"\\3",1);gsub(/a/,"E",s); print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;next}1' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Better Readable version :
$ tac file | awk 'NR<=2{
s=gensub(/(.*)([.;])(.*)$/,"\\3",1);
gsub(/a/,"E",s);
print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;
next
}1' | tac
With GNU awk you can set FS with the two separators, then gsub for the replacement in $3, the third field, if NR>1
awk -v FS=";|[.]" 'NR>1 {gsub("a", "E",$3)}1' OFS="." file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
With GNU awk for the 3rd arg to match() and gensub():
$ awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/(.*[^[:alnum:]])(.*)/,a)
print a[1] gensub(/a/,"E","g",a[2])
}
}
' file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
or with any awk:
awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/.*[^[:alnum:]]/)
lastWord = substr(p[i],1+RLENGTH)
gsub(/a/,"E",lastWord )
print substr(p[i],1,RLENGTH) lastWord
}
}
' file
If you want to do it for the last 50 lines of a file instead of the last 2 lines just change -v n=2 to -v n=50.
The above assumes there are at least n lines in your input.
You can let sed repeat changing an a into E only for the last word with a label.
tac mytext.txt| sed -r ':a; 1,2s/a(\w*)$/E\1/; ta' | tac

Using file redirects to input a variable search pattern to awk

I'm attempting to write a small script in bash. The script's purpose is to pull out a search pattern from file1.txt and to print the line number of the matching search from file2.txt. I know the exact place of the pattern that I want in file1.txt, and I can pull that out quite easily with sed and awk e.g.
sed -n 3p file1.txt | awk '{print $4}'
The part that I'm having trouble with is passing that information again to awk to use as a search pattern in file2.txt. Something along the lines of:
awk '/search_pattern/{print NR}' file2.txt
I was able to get this code working in two lines of code by storing the output of the first line as a variable, and passing that variable to awk in the second line,
myVariable=`sed -n 3p file1.txt | awk '{print $4}'`
awk '/'"$myVariable"'/{print NR}' file2.txt
but this seems "inelegant". I was hoping there was a way to do this in one line of code using file redirects (or something similar?). Any help is greatly appreciated!
You can avoid sed | awk with
awk 'NR==3{print $4; exit 0}' file1.txt
You can do your search with:
search=$(awk 'NR==3{print $4; exit 0}' file1.txt)
awk -v search="$search" '$0 ~ search { print NR }' file2.txt
You could even write that all on one line, but I don't recommend that; clarity is more important than brevity.
In principle, you could use:
awk 'NR==3{search = $4; next} FNR!=NR && $0 ~ search {print NR}' file1.txt file2.txt
This scans file1.txt and finds the search pattern; then it scans file2.txt and finds the lines that match. One line — even moderately clear. There'll be lots of matches if there isn't a column 4 on line 3 of file1.txt.

Modifying a number value in text

I have a text coming in as
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
I wonder how to change it as
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
using Awk. First problem is multiple delimiter. Second is, how to get the C-Value and Decrement by 2.
awk solution:
$ awk -F"." '{$2=substr($2,0,1)""substr($2,2)-2;}1' OFS="." file
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
I was wondering wether awk regexp would do the job, but apparently, awk cannot capture pattern. This is why I suggest perl solution:
$ cat data.txt
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
$ perl -pe 's/C([0-9]+)/"C" . ($1-2)/ge;' data.txt
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
Admittedly, I probably would have done this using the substr() function like Guru has shown:
awk 'BEGIN { FS=OFS="." } { $2 = substr($2,0,1) substr($2,2) - 2 }1' file
I do also like Aif's answer using Perl probably just a little more. Shorter is sweeter, isn't it? However, GNU awk can capture pattens. Here's how:
awk 'BEGIN { FS=OFS="." } match($2, /(C)(.*)/, a) { $2 = a[1] a[2] - 2}1' file