Grep that tolerates mismatches to subset .fastq - awk

I am working with bash on a linux cluster. I am trying to extract reads from a .fastq file if they contain a match to a queried sequence. Below is an example .fastq file containing three reads.
$ cat example.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
I would like to extract reads containing the sequence GAAATAATA. I can perform this extraction using grep as shown in the following command.
$ grep -F -B 1 -A 2 "GAAATAATA" example.fastq > MATCH.fastq
$ cat MATCH.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
However, this strategy does not tolerate any mismatches. For example, a read containing the sequence GAAATGATA will be ignored. I need this extraction to tolerate one mismatch at any position in the queried sequence. So my question is how can I achieve this? Is there a sequence alignment package available with similar functionality to grep? Are there any fastq subsetting packages available that perform this type of operation? One note is that speed is very important. Thanks for your guidance.

Here is a solution using agrep to get the record numbers of matches and an awk that prints out those records with some context (due to missing -Aand -B in agrep):
$ agrep -1 -n "GAAATGATA" file |
awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - file
Output:
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

This should work but idk if the MATCH.fastq in your question is the expected output or not or even if your sample input contains any cases that need a working solution to find so idk if it's actually working or not:
$ cat tst.awk
BEGIN {
for (i=1; i<=length(seq); i++) {
regexp = regexp sep substr(seq,1,i-1) "." substr(seq,i+1)
sep = "|"
}
}
{ rec = rec $0 ORS }
!(NR % 4) {
if (rec ~ regexp) {
printf "%s", rec
}
rec = ""
}
$ awk -v seq='GAAATAATA' -f tst.awk example.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

You might try a file of patterns -
$: cat GAAATAATA
.AAATAATA
G.AATAATA
GA.ATAATA
GAA.TAATA
GAAA.AATA
GAAAT.ATA
GAAATA.TA
GAAATAA.A
GAAATAAT.
then
grep -B 1 -A 2 -f GAAATAATA example.fastq > MATCH.fastq
but it will probably slow the process down a bit to add both full regex parsing AND an alternate pattern for each possible single change...
responding to question in comments:
For a given value of $word, such as word=GAAATAATA,
awk '{
for ( i=1; i<=length($0); i++ ) {
split($0,tmp,""); tmp[i]=".";
for ( n=1; n<=length($0); n++ ) { printf tmp[n]; }
printf "\n";
}
}' <<< "$word" > "$word"
This will create this specific file.
Hope that helps, but remember that this will be a lot slower since you are now using regexes instead of just matching plain strings, AND you are introducing a whole series of alternate patterns to match...

Related

AWK:Convert columns to rows with condition (create list ) [duplicate]

I have a tab-delimited file with three columns (excerpt):
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR002110 Ankyrin repeat
AC148152.3_FG001 IPR026961 PGG domain
and I'd like to get this using bash:
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain
So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.
give this one-liner a try:
awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file
For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n
cat FILENAME | perl -e 'foreach $Line (<STDIN>) { #Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(#{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", #{$List})."\n"; }'
will depend off file size (and awk limitation)
if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing
A classical version with post print using a modification of the whole line
sort YourFile \
| awk '
last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
NR > 1 { print Last C; Last = $1; C = ""}
END { print Last}
'
Another version using field and pre-print but less "human readable"
sort YourFile \
| awk '
last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
'
A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.
All on one line:
{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[#]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv
Readable and commented equivalent:
{
# Define `merged` as an empty associative array.
declare -A merged
merged=()
# Read tab-separated lines. Any leftover fields also end up in `value`.
while IFS=$'\t' read -r key value
do
# Append to any value that's already there, separated by a tab.
merged[$key]="${merged[$key]}"$'\t'"$value"
done
# Loop over the input keys. Note that the order is arbitrary;
# pipe through `sort` if you want a predictable order.
for key in "${!merged[#]}"
do
# Each value is prefixed with a tab, so no need for a tab here.
echo "$key${merged[$key]}"
done
} < INPUT_FILE.tsv

AWK, Comma delimited fields enclosed in quotes [duplicate]

The intent of this question is to provide a canonical answer.
Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:
$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
What's the most robust way efficiently using awk to identify the separate records and fields:
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
so it can be used as those records and fields internally by the rest of the awk script.
A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.
The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.
If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):
$ echo 'foo,"field,""with"",commas",bar' |
awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
or the equivalent using any awk:
$ echo 'foo,"field,""with"",commas",bar' |
awk -v fpat='[^,]*|("([^"]|"")*")' -v OFS=',' '{
rec = $0
$0 = ""
i = 0
while ( (rec!="") && match(rec,fpat) ) {
$(++i) = substr(rec,RSTART,RLENGTH)
rec = substr(rec,RSTART+RLENGTH+1)
}
for (i=1; i<=NF;i++) print i " <" $i ">"
}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.
If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1 fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",
Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:
$ cat decsv.awk
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( gsub(/^"|"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
# To replace newlines with blanks add gsub(/\n/," ",$i) here
printf " $%d=<%s>\n", i, $i
}
print "----"
}
.
$ awk -f decsv.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.
It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.
*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:
if ( gsub(/^"|"$/,"",fldStr) ) {
to this:
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:
#mosvy for escaped double quotes within fields.
#datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.
Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.
An improvement upon #EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).
gawk -v FPAT='[^,]*|("[^"]*")+' ...
This STILL
isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.
Example:
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
for(i=1; i<=NF;i++){
if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
print "<"$i">"
}
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>
This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.
Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.
csvquote file.csv | awk -f my_awk_script | csvquote -u
EDIT:
For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.
csvquote file.csv | awk -f decsv.awk | csvquote -u
(for the source of decsv.awk see answer from Ed Morton )
outut:
Record 1:
$1=<rec1 fld1>
$2=<>
$3=<rec1","fld3.1",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3fld2">
$3=<>
----
I have found csvkit a really useful toolkit to handle with csv files in command line.
line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"
echo 'foo,"field,""with"",commas",bar' | csvcut -c 3
bar
It also contains csvstat, csvstack etc. tools which are also very handy.
cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
csvcut -c 1 file.csv
"rec1, fld1"
"rec2, fld1.1
fld1.2"
""""""
csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""
Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'
The output is:
James T. Kirk
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print a[1] }, aka prints first csv field of the line.
If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.
frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).
GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with #"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.
With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:
$ cat records.awk
{
printf "Record %d:\n", NR
for (i=1; i<=NF; i++)
printf " $%d=<%s>\n", i, $i
print "----"
}
Then use either frawk -i csv or goawk -i csv to get the expected output. For example:
$ frawk -i csv -f records.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
$ goawk -i csv -f records.awk file.csv
Record 1:
... same as above ...
----

AWK, exclude results from one file with regards to a second file

Using Awk, I am able to get a list of URL with a given error number :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Fine and dandy.
But we would like to further refine it by matching that result against a list of already know 404 URL
example :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7} '| sort | uniq -c | sort -k 2 -r | awk '{print > "/mnt/tmp/404error.txt"}'
yield today :
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
the day after :
1 /going-out/ru/%d0%bd%d0%be%d1%87%d0%bd%d0%b0%d1%8f-%d0%b6%d0%b8%d0%b7%d0%bd%d1%8c-%d0%bd%d0%b0-%d0%bf%d1%85%d1%83%d0%ba%d0%b5%d1%82%d0%b5/%d1%81%d0%be%d0%b2%d0%b5%d1%82%d1%8b-%d0%bb%d1%8e%d0%b1%d0%b8%d1%82%d0%b5%d0%bb%d1%8f%d0%bc-%d0%bd%d0%be%d1%87%d0%bd%d1%8b%d1%85-%d1%80%d0%b0%d0%b7%d0%b2%d0%bb%d0%b5%d1%87%d0%b5%d0%bd%d0%b8%d0%b9/
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
1 /going-out/report-content/?cid=29968
1 /going-out/report-content/?cid=29823
The goal is to have only the new URL.
At that point I am lost, I know I can push first file into an array, I presume I can do the same with the second file (but in a second array), then maybe (not sure if awk does have the capacity) simply cross them, and kept what does not match.
Any help will be fully appreciate.
So you have a file whose $9 field may match /404/. If so, you want to store the 7th field. Then, count how many of them appeared in total, but just if they did not appear before in a file you have.
I think all of this can be done with this (untested, because I have no sample input data):
awk 'FNR==NR {seen[$2];next}
$9 ~ /404/ {if !($7 in seen) a[$7]++}
END {for (i in a) print a[i], i}' old_file log_file
This stores the 2nd column from the file with data into an array seen[]. Then, goes through the new file and stores the 7th column if it wasn't seen before. Finally, it prints the counters.
Since it looks like you have an old awk version that does not support the syntax index in array, you can use this workaround for it:
$9 ~ /404/ {for (i in seen) {if (i==$7) next} a[$7]++}
Note you must be using a veeery old version, since this was introduced in 1987:
A.1 Major Changes Between V7 and SVR3.1
The awk language evolved considerably between the release of Version 7
Unix (1978) and the new version that was first made generally
available in System V Release 3.1 (1987). This section summarizes the
changes, with cross-references to further details:
The expression ‘indx in array’ outside of for statements (see
Reference to Elements)
You can use grep --fixed-strings --file=FILEALL FILENEW or comm -23 FILENEW FILEALL for this. FILEALL is the file containing the urls already found, FILENEW contains the pages found today. For comm both files must be sorted.
http://www.gnu.org/software/gawk/manual/gawk.html#Other-Inherited-Files
http://linux.die.net/man/1/comm
I think commis more efficient because I uses sorted files, but I did not test this.
I came up with the following :
awk 'BEGIN {
while (getline < "/mnt/tmp/404error.txt") {
A[$1] = $1;
};
while (getline < "/var/log/nginx/access.log") {
if( $9 ~ /404/)
{
{
exist[$7] = $7 ;
}
{
if ($7 in A) blah += 1; else new[$7];
}
}
}
{
asort(exist);
for(i in exist)
print exist[i] > "/mnt/tmp/404error.txt"
}
{
asorti(new);
for(i in new)
print new[i] > "/mnt/tmp/new404error.txt"
}
}
' | mutt -s "subject" -a /mnt/tmp/new404error.txt -- whoever#mail.net, whatever#mail.net
that seems providing me what I want (almost).
But I believe it is verbous too much, might be possible one of you genius can improve it
Thanks

search for variable in multiple files within the same script

i have a script which reads every line of a file and outputs based on certain match,
function tohyphen (o) {
split (o,a,"to[-_]")
split (a[2],b,"-")
if (b[1] ~ / /) { k=""; p=""; }
else { k=b[1]; p=b[2] }
if (p ~ / /) { p="" }
return k
}
print k, "is present in" , FILENAME
what i need to do is check if the value of k is present in say about 60 other files and print that filename and also it has to ignore the file which it was original reading, im currently doing this with grep , but the calling of grep so many times causes the cpu to go high, is there a way i can do this within the awk script itself.
You can try something like this with gnu awk.
gawk '/pattern to search/ { print FILENAME; nextfile }' *.files
You can replace your pipeline grep "$k" *.cfg | grep "something1" | grep "something2" | cut -d -f2,3,4 with the following single awk script:
awk -v k="$k" '$0~k&&/something1/&&/something2/{print $2,$3,$4}' *.cfg
You mention printing the filename in your question, in this case:
awk -v k="$k" '$0~k&&/something1/&&/something2/{print FILENAME;nextfile}' *.cfg

Best Awk Commands

I find AWK really useful. Here is a one liner I put together to manipulate data.
ls | awk '{ print "awk " "'"'"'" " {print $1,$2,$3} " "'"'"'" " " $1 ".old_ext > " $1 ".new_ext" }' > file.csh
I used this AWK to make a script file that would rename some files and only print out selective columns. Anyone know a better way to do this? What are you best AWK one liners or clever manipulations?
The AWK book is full of great examples. They used to be collected for download from Kernighan's webpage (404s now).
You can find several nice one liners here.
I use this:
df -m | awk '{p+=$3}; END {print p}'
To total all disk space used on a system across filesystems.
Many years ago I wrote a tail script in awk:
#!/usr/bin/awk -f
BEGIN {
lines=10
}
{
high = NR % lines + 1
a[high] = $0
}
END {
for (i = 0; i < lines; i++) {
n = (i + high) % lines + 1
if (n in a) {
print a[n]
}
}
}
It's silly, I know, but that's what awk does to you. It's just very fun playing with it.
Henry Spencer wrote a fairly good implementation of nroff on awk. He called it "awf". He also claimed that if Larry Wall had known how powerful awk was, he wouldn't have needed to invent perl.
Here's a couple of awks that I used to use regularly ... note that you can use $1, $2, etc to get out the column you want. So, for manipulating a bunch of files, for example here's a stupid command you could use instead of mv ...
ls -1 *.mp3 | awk '{printf("mv %s newDir/%s\n",$1,$1)}' | /bin/sh
Or if you're looking at a set of processes maybe ...
ps -ef | grep -v username | awk '{printf("kill -9 %s\n",$2)}' | /bin/sh
Pretty trivial but you can see how that would get you quite a ways. =) Most of the stuff I used to do you can use xargs for, but hey, who needs them new fangled commands?
I use this script a lot for editing PATH and path-like environment variables.
Usage:
export PATH=$(clnpath /new/bin:/other/bin:$PATH /old/bin:/other/old/bin)
This command adds /new/bin and /other/bin in front of PATH, removes both /old/bin and /other/old/bin from PATH (if present - no error if absent), and removes duplicate directory entries on path.
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Count memory used by httpd
ps -ylC httpd | awk '/[0-9]/ {SUM += $8} END {print SUM/1024}'
Or any other process by replacing httpd. Dividing by 1024 to get output in MB.
I managed to build a DOS tree command emulator for UNIX ( find + awk ):
find . -type d -print 2>/dev/null|awk '{for (i=1;i< NF;i++)printf("%"length($i)"s","|");gsub(/[^\/]*\//,"--",$0);print $NF}' FS='/'
Print lines between two patterns:
awk '/END/{flag=0}flag;/START/{flag=1}' inputFile
Detailed explanation: http://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
A couple of favorites, essentially unrelated to each other. Read as 2 different, unconnected suggestions.
Identifying Column Numbers Easily
:
For those that use awk frequently, as I do for log analysis at work, I often find myself needing to find out what the column numbers are for a file. So, if I am analyzing, say, Apache access files (some samples can be found here) I run the script below against the file:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
print i "\t" $i
}
}
NR > 1 {
exit
}
I typically call it "cn.awk", for 'c'olumn 'n'umbers. Creative, eh? Anyway, the output looks like:
1 64.242.88.10
2 -
3 -
4 [07/Mar/2004:16:05:49
5 -0800]
6 "GET
7 /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
8 HTTP/1.1"
9 401
10 12846
Very easy to tell what's what. I usually alias this on my servers and have it everywhere.
Referencing Fields by Name
Now, suppose your file has a header row and you'd rather use those names instead of field numbers. This allows you to do so:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
field[$i] = i
}
}
Now, suppose I have this header row...
metric,time,val,location,http_status,http_request
...and I'd like to sum the val column. Instead of referring to $3, I can refer to it by name:
NR > 1 {
SUM += $field["val"]
}
The main benefit is making the script much more readable.
Printing fields is one of the first things mentioned in most AWK tutorials.
awk '{print $1,$3}' file
Lesser known but equally useful is excluding fields which is also possible:
awk '{$1=$3=""}1' file