Some good folk here on StackOverflow helped me find common lines in two files using awk:
awk 'NR==FNR{a[tolower($0)]; next} tolower($0) in a' 1.txt 2.txt
But how to find common words in two files where words are in line?
For example, let's say that I have 1.txt with these words:
apple
orange
butter
flower
And then 2.txt with these words:
dog cat Butter tower
How to return butter or Butter?
I just want to find the common words.
This grep should do the job:
grep -oiwFf 1.txt 2.txt
Butter
Or else this simple gnu awk would also work:
awk -v RS='[[:space:]]+' 'NR==FNR {w[tolower($1)]; next} tolower($1) in w' 1.txt 2.txt
Butter
Given:
$ cat file1
apple
orange
butter
flower
$ cat file2
dog cat Butter tower
I would write it this way:
awk 'FNR==NR{for(i=1;i<=NF;i++) words[tolower($i)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
Note there is a field by field loop in the case of FNR==NR that handles files that may have more than one word per line. If you know that that is not the case, you can simplify to:
awk 'FNR==NR{words[tolower($1)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
If this is not working on Windows it may be an issue with \r\n line endings. If awk is using a RS=[\n] value then the the \r is left on all words at the end of a line; butter\r does not match butter.
Try:
awk -v RS='[ \r\n\t]' 'FNR==NR{words[tolower($0)]; next}
tolower($0) in words' file1 file2
Comments on your WSL comments in the link:
Your workarounds for Unix files on DOS are many.
Create file1 with DOS line endings this way:
$ printf 'apple\r\norange\r\nbutter\r\nflower\r\n' >file1
Now you can test / see the file has those line endings with cat -v:
$ cat -v file1
apple^M
orange^M
butter^M
flower^M
You can also remove those line endings with sed, perl, awk, etc. Here is a awk removing the \r from the files:
$ cat -v <(awk 1 RS='\r\n' ORS='\n' file1)
apple
orange
butter
flower
A sed and perl:
$ cat -v <(sed 's/\r$//' file1)
#same
or
$ cat -v <(perl -0777 -lpe 's/\r\n/\n/g' file1)
etc. Then use that same construct with awk-on-windows:
awk 'your_awk_program' <(awk 1 RS='\r\n' ORS='\n' file1) <(awk 1 RS='\r\n' ORS='\n' file2)
The downside: While each input is treated as a different logical file, so the FNR==NR awk test still works, the awk special variable FILENAME is lost in the process. If you want to keep FILENAME associated with the actual file, you need to preprocess the files prior to feeding to awk or deal with the \r inside your awk script.
You need to loop over every field per line (of 2.txt) and check:
awk 'NR==FNR{a[tolower($0)];next}{for(i=1;i<=NF;i++){if(tolower($i) in a){print $i}}}' \
1.txt 2.txt
An alternative way to do this in awk would be to add whitespace to the input record separator when processing the 2nd file:
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' 1.txt RS="[\n ]" 2.txt
Trying to use awk to remove each line that has an_ in $5. The formating of the file makes it look like its $4, but neither works. I also tied sed '/_/d' but that removed all lines. Thank you :).
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 114713789 114713988 NRAS_3
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
chr1 247423836 247425609 NLRP3_3
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
awk
awk -F\t '$4 !~ /_/'
awk -F\t '$5 !~ /_/'
Could you please try following, written and tested with shown samples in GNU awk.
awk -F'[[:blank:]]+' '$5!~/_/' Input_file
Explanation: Simply making [[:blank:]] character class as a field separator for all the lines of Input_file. Then checking condition if 5th field is NOT having _ then print that line(no action mentioned so by default printing of that line will happen).
2nd solution: Or if its always last field in your Input_file then try following.
awk '$NF!~/_/' Input_file
You may use $NF as that field is last field in every line:
awk -F '\t' '$NF !~ /_/' file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
Or you can avoid regex:
awk -F '\t' 'index($NF, "_") == 0' file
I have a pretty large file from which I'd like to extract only the first line of those containing my match and then continuing doing that until the end of the file. Example of input and desired output below
Input
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
S,23,4,7,78,8,9,6
S,3,5,67,8,54,56
S,4,8,9,54,3,4,52
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
S,5,6,4,5,67,87,88
S,4,23,5,8,5,7,3
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65
Output
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
The match in my lines is S, which usually comes after a line that starts with either E or C. What I'm struggling with is to tell awk to print only the first line after those with E or C. Another way would be to print the first of the bunch of lines containing S. Any idea??
does this one-liner help?
awk '/^S/&&!i{print;i=!i}!/^S/{i=!i}' file
or more "readable":
awk -v p=1 '/^S/&&p{print;p=0}!/^S/{p=1}' file
You can use sed, like this:
sed -rn '/^(E|C)/{:a;n;/^S/!ba;p}' file
here's a multi liner to enter in a file (e.g. u.awk)
/^[CE]/ {ON=1; next}
/^S/ {if (ON) print}
{ON=0}
then run : "awk -f u.awk inputdatafile"
awk to the rescue!
$ awk '/^[CE]/{p=1} /^S/&&p{p=0;print}' file
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
$ awk '/^S/{if (!f) print; f=1; next} {print; f=0}' file
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65
I have 2 files, file1 and file2. I am trying to read one line from file1 and read another line from file2 and insert HTML
flags to make is usealbe in an html file. I have been trying to work with awk with little success. Can someone please help?
File1:
SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem
SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes
File2:
FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt
FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt
Desired output:
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
Using bash:
printItem() { printf "<%s>%s</%s>\n" "$1" "${!1}" "$1"; }
paste file1 file2 |
while read workflow File; do
echo "<ParameterFile>"
printItem workflow
printItem File
done
With awk, it would be:
awk '
NR==FNR {workflow[FNR]=$1; next}
{
print "<ParameterFile>"
printf "<workflow>%s</workflow>\n", workflow[FNR]
printf "<File>%s</File>\n", $1
}
' file1 file2
Another approach that does not require storing the first file in memory:
awk '{
print "<ParameterFile>"
print "<workflow>" $0 "</workflow>"
getline < "file2"
print "<File>" $0 "</File>"
}' file1
If you don't mind mixing in some shell:
$ paste -d$'\n' file1 file2 |
awk '{ printf (NR%2 ? "<ParameterFile>\n<workflow>%s</workflow>\n" : "<File>%s</File>\n"), $0 }'
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
otherwise see #GlennJackman's solution for the pure awk way to do it.