how to perform a search for words in one file against another file and display the first matching word in a line - awk

I have an annoying problem. I have two files.
$ cat file1
Sam
Tom
$ cat file2
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
File 1 is a word list file whereas file2 is a file containing varying number of columns. I want to perform a search using file 1 against file2, display all possible the first matching word that appear in each line of file2. Thus the result needs to be the following:
Sam (line 1 match)
Tom (line 2 match)
Tom (line 3 match)
If the f2 is the following,
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
I am Tom. Sam I am.
I am Sam. Tom I am.
I am Sammy.
It needs to display the following:
Sam (1st line match)
Tom (2nd line match)
Tom (3rd line match)
Tom (4th line match)
Sam (4th line match)
Sam (5th line match)
Tom (5th line match)
Sam (6th line match)
I think I need an awk solution since the command "grep -f file1 file2" won't work.

Seems like you want first match from each line:
$ cat f1
Sam
Tom
$ cat f2
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
I am Tom. Sam I am.
I am Sam. Tom I am.
$ grep -Fnof f1 f2 | sort -t: -u -k1,1n
1:Sam
2:Tom
3:Tom
4:Tom
5:Sam
-n option to display line number which is later used to remove duplicates
-F option to match search terms literally and not as regex
-o to display only matching terms
pipe the output to cut -d: --complement -f1 to remove first column of line numbers

With GNU awk for sorted_in:
$ cat tst.awk
BEGIN { PROCINFO["sorted_in"] = "#val_num_asc" }
NR==FNR { res[$0]; next }
{
delete found
for ( re in res ) {
if ( !(re in found) ) {
if ( match($0,re) ) {
found[re] = RSTART
}
}
}
for ( re in found ) {
printf "%s (line #%d match)\n", re, FNR
}
}
$ awk -f tst.awk file1 file2
Sam (line #1 match)
Tom (line #2 match)
Tom (line #3 match)
Tom (line #4 match)
Sam (line #4 match)
Sam (line #5 match)
Tom (line #5 match)
Sam (line #6 match)

Could you please try following and let me know if this helps you.
awk -F"[. ]" 'FNR==NR{a[$0];next} {for(i=1;i<=NF;i++){if($i in a){print $i;next}}}' Input_file1 Input_file2

Seems grep could be made to work
grep -nof f1 f2 | sort -u
1:Sam
2:Tom
3:Tom
4:Sam
4:Tom
5:Sam
5:Tom
6:Sam

Related

Add file name as a new column with awk

First of all existing questions didn't solve my problem that's why I am asking again.
I have two txt files temp.txt
adam 12
george 15
thomas 20
and demo.txt
mark 8
richard 11
james 18
I want to combine them and add a 3rd column as their file names without extension, like this:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
I used this script:
for i in $(ls); do name=$(basename -s .txt $i)| awk '{OFS="\t";print $0, $name} ' $i; done
But it yields following table:
mark 8 mark 8
richard 11 richard 11
james 18 james 18
adam 12 adam 12
george 15 george 15
thomas 20 thomas 20
I don't understand why it gives the name variable as the whole table.
Thanks in advance.
Awk has no access to Bash's variables, or vice versa. Inside the Awk script, name is undefined, so $name gets interpreted as $0.
Also, don't use ls in scripts, and quote your shell variables.
Finally, the assignment of name does not print anything, so piping its output to Awk makes no sense.
for i in ./*; do
name=$(basename -s .txt "$i")
awk -v name="$name" '{OFS="\t";print $0, $name}' "$i"
done
As such, the basename calculation could easily be performed natively in Awk, but I leave that as an exercise. (Hint: sub(regex, "", FILENAME))
awk has a FILENAME variable whose value is the path of the file being processed, and a FNR variable whose value is the current line number in the file;
so, at FNR == 1 you can process FILENAME and store the result in a variable that you'll use afterwards:
awk -v OFS='\t' '
FNR == 1 {
basename = FILENAME
sub(".*/", "", basename) # strip from the start up to the last "/"
sub(/\.[^.]*$/, "", basename) # strip from the last "." up to the end
}
{ print $0, basename }
' ./path/temp.txt ./path/demo.txt
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
Using BASH:
for i in temp.txt demo.txt ; do while read -r a b ; do printf "%s\t%s\t%s\n" "$a" "$b" "${i%%.*}" ; done <"$i" ; done
Output:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
For each source file read each line and use printf to output tab-delimited columns including the current source file name without extension via bash parameter expansion.
First, you need to unmask $name which is inside the single quotes, so does not get replaced by the filename from the shell. After you do that, you need to add double quotes around $name so that awk sees that as a string:
for i in $(ls); do name=$(basename -s .txt $i); awk '{OFS="\t";print $0, "'$name'"} ' $i; done

How to extract multiple strings with single regex expression in Awk

I have the following strings:
Mike has XXX cats and XXXXX dogs.
MikehasXXXcatsandXXXXXdogs
I would like to replace Xs with the digits corresponding to the number of Xs:
I tried:
awk '{ match($0, /[X]+/);
a = length(substr($0, RSTART, RLENGTH));
gsub(/[X]+/, a) }1'
But it captures only the first match.
Expected output:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
With your shown samples, could you please try following. Written and tested in GNU awk(should work in any awk).
awk '{for(i=1;i<=NF;i++){if($i~/^X+$/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
Sample output will be:
Mike has 3 cats and 5 dogs.
Explanation: Going through all the fields(space delimited) and checking if field starts from X and has only X till end of current field, if yes then globally substituting it with its own value(to get the count) and saving into current field itself. Then mentioning 1 will print current line.
NOTE: As per Ed sir's comment(under question section), in case your fields may have values other X too then try(this will even cover XXX456 value in any column too):
awk '{for(i=1;i<=NF;i++){if($i~/X/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
EDIT: Since OP's samples are changed so adding this solution here, written and tested with GNU awk.
awk -v RS='X+' '{ORS=(RT ? gsub(/./,"",RT) : "")} 1' Input_file
OR
awk -v RS='X+' '{ORS=(RT ? length(RT) : "")} 1' Input_file
Output will be as follows for above code:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
another awk
$ awk '{for(i=1;i<=NF;i++) if($i~/^X+$/) $i=length($i)}1' file
Mike has 3 cats and 5 dogs.
$ awk '{while( match($0,/X+/) ) $0=substr($0,1,RSTART-1) RLENGTH substr($0,RSTART+RLENGTH)} 1' file
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
If Perl is okay:
$ perl -pe 's/X+/length $&/ge' ip.txt
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
The e flag allows Perl code in replacement section. $& will have the matched portion.
Here's the cleanest awk-based solution i can think of
{mawk/mawk2/gawk} 'BEGIN { FS = "^$" } /X/ {
while(match($0, /[X]+/)) { sub(/[X]+/, RLENGTH) } } 1'
downside of this is having to use regex engine twice for every replacmeent. upside is that it avoids a bunch of substr( ) ops.

awk find out how many times columns two and three equal specific word

Lets say I have a names.txt file with the following
Bob Billy Billy
Bob Billy Joe
Bob Billy Billy
Joe Billy Billy
and using awk I want to find out how many times $2 = Billy while $3 = Billy. In this case my desired output would be 3 times.
Also, I'm testing this on a mac if that matters.
You first need to test $2==$3 then test that one of those equals "Billy". Increment a counter and then print the result at the end:
$ awk '$2==$3 && $2=="Billy"{cnt++} END{print cnt+0}' names.txt
3
Or, you could almost write just what you said:
$ awk '$2=="Billy" && $3=="Billy" {cnt++} END{print cnt+0}' names.txt
3
And if you want to use a variable so you don't need to type it several times:
$ awk -v name='Billy' '$2==name && $3==name {cnt++}
END{printf "Found \"%s\" %d times\n", name, cnt+0}' names.txt
Found "Billy" 3 times
Or, you could collect them all up and report what was found:
$ awk '{cnts[$2 "," $3]++}
END{for (e in cnts) print e ": " cnts[e]}' names.txt
Billy,Billy: 3
Billy,Joe: 1
You may also consider use grep to do that,
$ grep -c "\sBilly\sBilly" name.txt
3
-c: print a count of matching lines

AWK: Find a sentence in one file and replace it with a sentence from another file

I have two files: file1 and file2
file1:
alpha
bravo
charlie //comment 1
delta
victor //comment 2
zulu
.
.
file2:
kirk
mike //new comment 1
some
phil //new comment 2
.
.
.
How can I replace comment 1 in file1 with new comment 1 in file2
and comment 2 with new comment 2 and so on.
Note : Number of comment lines are equal on both the files.
Note 2: Using awk.
Note 3: comment is a string. It can be anything.
eg: charlie //likethis
What I did:
I was just starting, So, I was trying to first achieve it with files having single comment.
awk -F\/\/ '{sub("//.*", "(cat file2 | grep "//")", $0); print $0}' file1
Desired Output:
alpha
bravo
charlie //new comment 1
delta
victor //new comment 2
zulu
.
.
If the matching cannot be based on the content of the comments but on the number of the comment within the file (i.e., first comment, second comment...) and if only the comment has to be replaced, try this
awk -F"//" '(NR==FNR && NF>1){a[++i]=$2}
(NR!=FNR){if(NF>1){print $1 FS a[++j]}
else{print $0}}' file2 file1
Note the order of arguments: file2 before file1.
This will give the desired output:
alpha
bravo
charlie //new comment 1
delta
victor //new comment 2
zulu
Update: Here is a more "bullet-proof" version which allows comments within comments (e.g., charlie // comments start with // and are followed by text)
awk 'BEGIN{FS=OFS="//"}(NR==FNR && NF>1){$1="";a[++i]=$0}
(NR!=FNR){if(NF>1){print $1 a[++j]}
else{print $0}}' file2 file1
Another solution:
awk 'NR==FNR{if($2){a[i++]=$2}next}
$2{$2=a[j++]}1' FS='//' OFS='//' file2 file1
Explanation
NR==FNR{if($2){a[i++]=$2}next}: This part takes care of file2 if we find that the field that contains the comment has a value (if($2)) it will be stored in an array (a) preserving the order, next statement is used to avoid rest of the rows of further processing.
$2{$2=a[j++]}1: process file1 only when the 2th field, the comment, has a value, replacing it with the one stored in the array used for file2 using the index in the same order.
Finally 1, is just a helper to print all file1 records (modified or not by the previous $2=a[j++]), in awk when an expression is evaluated to true (1 in this case) the default action is printthe affected record.
FS='//' and OFS='//' sets the separators to the value of the comment marks.
Results
alpha
bravo
charlie //new comment 1
delta
victor //new comment 2
zulu

delete line if string is matched and the next line contains another string

got an annoying text manipulation problem, i need to delete a line in a file if it contains a string but only if the next line also contains another string. for example, i have these lines:
john paul
george
john paul
12
john paul
i want to delete any line containing 'john paul' if it is immediately followed by a line that contains 'george', so it would return:
george
john paul
12
john paul
not sure how to grep or sed this. if anyone could lend a hand that'd be great!
This might work for you (GNU sed):
sed '/john paul/{$!N;/\n.*george/!P;D}' file
If the line contains john paul read the next line and if it contains george don't print the first line.
N.B. If the line containing george contains john paul it will be checked also.
awk 'NR > 1 && !(/george/ && p ~ /john paul/) { print p } { p = $0 } END { print }' file
Output:
george
john paul
12
john paul
This awk should do:
cat file
john paul
george
john paul
12
john paul
hans
george
awk 'f~/john paul/ && /george/ {f=$0;next} NR>1 {print f} {f=$0} END {print}' file
george
john paul
12
john paul
hans
george
This will only delete name above george if it is john paul
Here is one version more general:
if the lines matches a string and previous line was exactly "john paul" then do nothing, otherwise, print the previous line. (change the ^[a-zA-Z]$ part to george if you only want george to be detected.
awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
In your example:
$> echo 'john paul
george
john paul
12
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
george
john paul
12
john paul
if there is some numbers in the line, it prints the previous, otherwise it doesn't:
$> echo 'john paul
george 234
john paul
auie
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
john paul
george 234
auie
john paul
The sed solution is short: two commands and lots of comments ;)
/john paul/ {
# read the next line and append to pattern space
N
# and then if we find "george" in that next line,
# only retain the last line in the pattern space
s/.*\n\(.*george\)/\1/
# and finally print the pattern space,
# as we don't use the -n option
}
You put the above in some sedscript file and then run:
sed -f sedscript your_input_file
Just to throw some Perl into the mix:
perl -ne 'print $p unless /george/ && $p =~ /john paul/; $p = $_ }{ print $p' file
Print the previous line, unless the current line matches /george/ and the previous line $p matched /john paul/. Set $p to the value of the previous line. }{ effectively creates an END block, so the last line is also printed after the file has been read.
You might have to change the \r\n to \n or to \r, other than that this should work:
<?php
$string = "john paul
george
john paul
12
john paul";
$string = preg_replace("#john paul\r\n(george)#i",'$1',$string);
echo $string;
?>
You could also read a file into the variable and then after overwrite the file.
With GNU awk for multi-char RS:
$ gawk -vRS='^$' '{gsub(/john paul\ngeorge/,"george")}1' file
george
john paul
12
john paul
or if there's more on each line than your sample input shows just change the RE to suit and use gensub():
$ gawk -vRS='^$' '{$0 = gensub(/[^\n]*john paul[^\n]*\n([^\n]*george[^\n]*)/,"\\1","")}1' file
george
john paul
12
john paul