Linux parsing space delimited log files - logfiles

I need to parse apache-access log files which has 16 space delimited columns, that is,
xyz abc ... ... home?querystring
I need to count total number of hits for each page in that file, that is, total number of home page hits ignoring querystring
For few lines the url is column 16 and for other its 14 or 15. Hence I need to parse each line in reverse order (get the last column, ignore query string of the last column, aggregate page hits)
I am new to linux, shell scripting. How do I approach this, do I have to look into awk or shell scripting. Can you give a small sample code that would perform such task.
ANSWER: perl one liner solved the problem
perl -lane | scalar array

Well for starters, if you are only interested in working on columns 14-16, I would start by running
cut -d\ -f14-16 <input_file.log> | awk '{ one = match($1,/www/)
two = match($2,/www/)
three = match($3,/www/)
if (one)
print $1
else if(two)
print $2
else if(three)
Note: there are two spaces after the d\
You can then pretty easily just count up the urls that you see. I also think this would be solved a lot easier using a few lines of python or perl.

You can read line by line of input using the read bash command:
while read my_variable; do
echo "The text is: $my_variable"
done
To get input from a specific file, use the input redirect <:
while read my_variable; do
echo "The text is: $my_variable"
done < my_logfile
Now, to get the last column, you can use the ${var##* } construction. For example, if the variable my_var is the string some_file_name, then ${my_var##*_} is the same string, but whith everything before (and including) the last _ deleted.
We come up with:
while read line; do
echo "The last column is: ${line##* }"
done < my_logfile
If you want to echo it to another file, use the >> redirect:
while read line; do
echo "The last column is: ${line##* }" >> another_file
done < my_logfile
Now, to take away the querystring, you can use the same technique:
while read line; do
last_column="${line##* }"
url="${last_column%%\?*}"
echo "The last column without querystring is: $url" >> another_file
done < my_logfile
This time, we have %%?* instead of ##*? because we want to delete what's after the first ?, instead of before the last. (Note that I have escaped the character ?, which is special to bash.) You can read all about it here.
I didn't understand where to get the page hits, but I think the main idea is there.
EDIT: Now the code works. I had forgotten the do bash keywork. Also, we need to use >> instead of > in order not to overwrite the another_file every time we do echo "..." > another_file. By using >>, we append to the file. I have also corrected the %% instead of ##.

It's hard to say without a few lines of concrete sample input and expected output, but it sounds like all you need is:
awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
For example:
$ cat file
xyz abc ... ... http://www.google.com?querystring
xyz abc ... ... some other http://www.google.com?querystring1
xyz abc ... some stuff we ignore http://yahoo.com?querystring1
$
$ awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
http://www.google.com 2
http://yahoo.com 1

Related

filter unique parameters from file

i have file contains urls plus params like following
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/"//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF
and i need to filter only urls with unique params
the desired output
http://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
i managed to filter only urls with params with grep
grep -E '(\?[a-zA-Z0-9]{1,9}\=)'
but i need to filter params in the same time so i tried with awk with the same regex
but it gives error
awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'
update
i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too.
i tried #James Brown with some editing and it looks good till the end line it dose not filter it unfortunately
awk -F '?|&' '$2&&!a[$2]++'
and to be more clear why the that output is good for me
it chosed the 1 st line because it has at least param1
2nd line because it has at least param3
3 line because it has at least param2
the comparison method here is choose just unique parameter whatever it concatenate with others with & char or not
Edited version after the reqs changes some:
$ awk -F? '{ # ? as field delimiter
split($2,b,/&/) # split at & to get whats between ? and &
if(b[1]!=""&&!a[b[1]]++) # no ? means no $2
print
}' file
Output as expected. Original answer was:
A short one:
$ awk -F? '$2&&!a[$2]++' file
Explained: Split records at ? (-F?) and if there is a second field ($2) and (&&) it is unique this far by counting the instances of the parameters in the array a (!a[$2]++), output it.
EDIT: Following solution may help when query string has ? as well as & present in it and we want to consider both of them for removing duplicates.
awk '
/\?/{
match($0,/\?[^&]*/)
val=substr($0,RSTART,RLENGTH)
match($0,/&.*/)
if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
print
}
}' Input_file
2nd solution: (Following solution may help when we don't have & parameters in query string) With your shown samples, please try following awk program.
awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file
OR above could be shorten to as follows:(as per Ed sir's suggestions):
awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file
Explanation: Simple explanation would be, using match function of awk which matches everything from ? to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.
With gnu awk, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+) followed by matching any character except the &
Then you can use the !seen[$0]++ part with the value of capture group 1.
awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file
Output
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
Using awk you can check that the string starts with the protocol and contains a question mark.
Then to get the first parameter only, you can split on ? and & and use the second part of the split for seen
awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Comparing column of two files

I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$
for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238
The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1

Add a field to the current record before processing in awk

I want to use a large awk script that was designed to take a particular input. For example "city zipcode street housenumber", so $2 is zipcode, etc...
Now the input is provided to me in a new format. In my example "city" is now missing. The new file is "zipcode street housenumber" (not for real, just trying to make the example simple)
but I happen to know that the city is a constant for that input (which is why it's not in the dataset). So if I run it through the original script $2 is now street, and everything is one field off.
I could first process the input file to prepend the city name to each line (using awk, sed, or whatever), then run it through the original script, but I would prefer to run only one script that supports both formats. I could add a command-line option that tells it the city, but I don't know how to insert it in front of the current record at the top of the script so that the rest of the script can be unchanged. It looks like I can change a field but what I want to do is "shift" the fields right so I can modify $1.
Did I mention I am a complete awk novice? (Perl is my poison.)
I think I fixed my own problem, I'm doing the following (haven't figured out how to do this conditionally based on a command line option, but it should be easy to find tutorials for that)
NF+=1;
for(i=NF; i>1; --i) $(i)=$(i-1);
$1="Vancouver";
I had the loop wrong in my comment above, but the basic idea of manipulating NF and copying fields into each others seems to work
Something in the lines of this should do it. First some missed test data:
$ cat file
1 2 3 4
2 3 4
The awk:
$ awk -v c=V '{ # define external var
if(NF==3) # if record has only three fields
$0=v FS $0 # prepend the var to the record
print $1 # print first field
}' file
Output:
1
V

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.