How to count patterns in multiple files using awk - awk

I have multiple log files and I need to count the number of occurrences of certain patterns in all those files.
#!/usr/bin/awk
match($0,/New connection from user \[[a-z_]*\] for company \[([a-z_]*)\]/, a)
{instance[a[1]]++}
END {
for(i in instance)
print i":"instance[i]
}
I am running this script like this:
awk -f script *
But it looks like count is not correct. Is my above approach correct to handle multiple files?

try moving the curly brace up to the same line as the match function. Otherwise the instance[a[1]]++ will occur for every line. Doesn't it also print out the full line of every match too, to start? The match will have a default action of {print} when on its own line.
#!/usr/bin/awk
match($0,/pattern/, a) {
instance[a[1]]++
}
END {
for(i in instance)
print i":"instance[i]
}
Further details of how individual file names are read is available at the GNU site but applicable generally. BEGIN is before any files have been read, END is after all, and variables stay the same, apart from a few built in (FNR for example, record number for this file).

Related

Comparing a .TTL file to a CSV file and extract "similar" results into a new file

I have a large CSV file that is filled with millions of different lines of which each have the following format:
/resource/example
Now I also have a .TTL file in which each line possibly has the exact same text. Now I want to extract every single line from that .TTL file containing the same text as my current CSV file into a new CSV file.
I think this is possible using grep but that is a linux command and I am very, very inexperienced with that. Is it possible to do this in Windows? I could write a Python script that compares the two files, but since both files contain millions of lines that would literally take days to execute I think. Could anyone point me in the right direction on how to do this?
Thanks in advance! :)
Edit:
Example line from .TTL file:
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Example line from current CSV file:
/resource/algoritme
So with these two example lines it should export the line from the .TTL file into a new CSV file.
Using GNU awk. First read the CSV and hash it to a. Then compare each entry in a against each row in the TTL file:
$ awk 'BEGIN { IGNORECASE = 1 } # ignoring the case
NR==FNR { a[$1]; next } # hash csv to a hash
{
for(i in a) # each entry in a
if($0 ~ i) { # check against every record of ttl
print # if match, output matched ttl record
next # and skip to next ttl record
}
}' file.csv file.ttl
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Depending on the sizes of files it might be slow and maybe could be made faster but not based on info offered in the OP.

Compare 2 files in awk and append Pass / Fail

Thank you all for the feedback. Apologies since I am newer to coding and new to SO. Below is the code I have currently been running.
awk 'FNR==NR{a[$4,$5]=$0}{if(b=a[$4,$5]); print b, "PASS";next}else{if(b!=a[$4,$5]){print a, b, "FAIL";next}}'
This appends a PASS next to each line if it is the same, but does print FAIL if there are any inconsistencies in the line.
Trying to get myself more familiar with awk. Using FNR==NR I've been able to compare 2 files (line by line) and then print PASS at the end of the file. However, I cannot actually get it to properly fail the scenario and print FAIL if they do not match. Could anybody help a noobie out?
Here is some awk script to get you started.
$ awk 'NR==FNR{a[NR]=$0;next}
{f=$0!=a[FNR]; delete a[FNR]}
f{c=FNR;exit}
END{c=c?c:(FNR+1);print f||(c in a)?"FAIL on line "c:"PASS"}'
additional complexity is due to files might have different lengths. Note also that there are existing tools (diff, comm, ...) to do this in a much compact way.

Buffering output with AWK

I have an input file which consists of three parts:
inputFirst
inputMiddle
inputLast
Currently I have an AWK script which with this input creates an output file which consists of two parts:
outputFirst
outputLast
where outputFirst and outputLast is generated (on the fly) from inputFirst and inputLast respectively. However, to calculate the outputMiddle part (which is only one line) I need to scan the entire input, so I store it in a variable. The problem is that the value of this variable should go in between outputFirst and outputLast in the output file.
Is there a way to solve this using a single portable AWK script that takes no arguments? Is there a portable way to create temporary files in an AWK script or should I store the output from outputFirst and outputLast in two variables? I suspect that using variables will be quite inefficient for large files.
All versions of AWK (since at least 1985) can do basic I/O redirection to files or pipelines, just like the shell can, as well as run external commands without I/O redirection.
So, there are any number of ways to approach your problem and solve it without having to read the entire input file into memory. The most optimal solution will depend on exactly what you're trying to do, and what constraints you must honour.
A simple approach to the more precise example problem you describe in your comment above would perhaps go something like this: first in the BEGIN clause form two unique filenames with rand() (and define your variables), then read and sum the first 50 numbers from standard input while also writing them to a temporary file, then continuing to read and sum the next 50 numbers and write them to a second file, then finally in an END clause you would use a loop to read the first temporary file with getline and write it to standard output, print the total sum, then read the second temporary file the same way and write it to standard output, and finally call system("rm " file1 " " file2) to remove the temporary files.
If the output file is not too large (whatever that is), saving outputLast in a variable is quite reasonable. The first part, outputFirst, can (as described) be generated on the fly. I tried this approach and it worked fine.
Print the "first" output while processing the file, then write the remainder to a temporary file until you have written the middle.
Here is a self-contained shell script which processes its input files and writes to standard output.
#!/bin/sh
t=$(mktemp -t middle.XXXXXXXXX) || exit 127
trap 'rm -f "$t"' EXIT
trap 'exit 126' HUP INT TERM
awk -v temp="$t" "NR<500000 { print n+1 }
{ s+=$1 }
NR>=500000 { print n+1 >>temp
END { print s }' "$#"
cat "$t"
For illustration purposes, I used really big line numbers. I'm afraid your question is still too vague to really obtain a less general answer, but perhaps this can help you find the right direction.

how can I find the total number of lines in a file and also detect the empty lines by using CGI and Perl

I have a script which reads a text file and print it. How can I detect the empty lines in the file and ignore them.
is there any way to find out the total line number of the file without running the
while (<$file>)
$linenumbers++;
To print the number of non empty lines in a file
perl -le 'print scalar(grep{/./}<>)'

correct way to write to the same file from multiple processes awk

The title says it all.
I have 4 awk processes logging to the same file, and output seems fine, not mangled, but I'm not sure that just redirecting print output like this: print "xxx" >> file in every process is the right way to do it.
There are many similar questions around the site, but this one is particularly about awk and a pragmatic, code-correct way to approach the problem.
EDIT
Sorry folks, of course I wasn't "just redirecting" like I wrote, I was appending.
No it is not safe.
the awk print "foo" > "file" will open the file and overwrite the file content, till the end of script.
That is, if your 4 awk processes started writing to the same file on different time, they overwrite the result of each other.
To reproduce it, you could start two (or more) awk like this:
awk '{while(++i<9){system("sleep 2");print "p1">"file"}}' <<<"" &
awk '{while(++i<9){system("sleep 2");print "p2">"file"}}' <<<"" &
and same time you monitoring the content of file, you will see finally there are not exactly 8 "p1" and 8 "p2".
using >> could avoid the losing of entries. but the entry sequence from 4 processes could be messed up.
EDIT
Ok, the > was a typo.
I don't know why you really need 4 processes to write into same file. as I said, with >>, the entries won't get lost (if you awk scripts works correctly). however personally I won't do in this way. If I have to have 4 processes, i would write to different files. well I don't know your requirement, just speaking in general.
outputting to different files make the testing, debugging easier.. imagine when one of your processes had problem, you want to solve it. etc...
I think using the operating system print command is save. As in fact this will append the file write buffer with the string you provide as log. So the system will menage the actual writing process of the data to disc, also if another process will want to use the same file the system will see that the resource is already claimed and will wait for 1st thread to finish its processing, than will allow the 2nd process to write to the buffer.