Apache log file: show IPs that access us 24/7 - apache

In my quest to catch robots, I would like to find a way to extract IP addresses from our Apache access logs which have requests in every hour of a 24 hour period.
For example this will show me a list and a count of requests per IPs that requested /bla/
find . -name logfile.*00 -print0 | xargs -0 fgrep -h '"GET /bla/' | awk '{print $NF}' | sort | uniq -c
Would there be some kind of awk that would tell me which IPs are present in all hours of the day?
Alternatively some free weblog analyser that could do the same.
Some information about the situation:
Our robot.txt is blocking all non-static files but is ignored by the majority of our abusers
At the moment I am only interested in some tool or awk statement that can give me a list of IPs that access us 24/7 since a normal user will access us 6-9 hours per day but from different timezones.
we already have several methods to detect and blacklist IPs and ranges of IPs but I want to see how the spread are on robots that are just turned on and runs non-stop.
The awk statement above is providing
Req IP
3234 111.222.333.444
234 222.222.333.444
5234 333.222.333.444
and I am looking for
IP Hrs
111.222.333.444 24
222.222.333.444 24
333.222.333.444 24
or better:
IP Hrs Req
111.222.333.444 24 3234
222.222.333.444 24 234
333.222.333.444 24 5234

I keep recommending Piwik found at http://piwik.org/. It is one of the best log file analysis tools out there and it is free! This is an excellent tool!!
You will find that is it not access over 24 hours. Google and Bing fall into these categories. You will want to be looking for a few things:
Did they access an image?
Did they access robots.txt?
Are they accessing your site at a reasonable speed? Is it human or machine?
Are they accessing a normal/reasonable amount of pages? Is it human or machine?
There are more factors, but this is enough. You will be able to quickly determine human from machine very quickly with just these factors.
What you do not want to do is make too many assumptions. For example, throw out the concept of using any agent name as some kind of indication. It is junk data.
What you need to do is research domain names and IP addresses and begin to know who these people (lack of a better term) are. Of course there are the obvious such as Google, Bing, Yandex, and Baidu. Some will be legitimate SEO sites such as MOZ, Ahrefs, or Open Site Explorer. You may want to allow accesses for these. However, there are lot of SEO scraper sites and content scraper sites. You will find some accesses from China, Russia, even Poland. These are often junk. You may even see competitors using Screaming Frog to discover how your site competes with keyword usage. And of course, let us not forget the script-kiddies that are trying to landscape and hack your site. It will take time of course to get a feel for who is good for our site. However, the discovery of abusers will never end.
You will want to block bad accesses as far away from your web server as possible. This means using a hardware firewall. But if that is not an option for you, you will want to explore the various software firewall options and possibly use ModSecurity and other tools to protect your site. Of course there is always the .htaccess file (assuming Apache).
This is a common practice that webmasters have to perform daily. It is just the reality of things. If you are not sure of an IP address or domain name or even an access pattern, just post it here and I will evaluate it for you and try and help. I study these things as a smaller subject area of my research. I do not always have answers, but most of the time I have one at hand ready to go and of course I can always do some research.

I decided to solve it brute force - there are simply too many IPs to handle.
Just ran this in 24 steps and extracted the IPs that were in all files.
Added bonus I got to see how many requests they managed to do a day.
find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:00" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_00.txt;
find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:01" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_01.txt;
find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:02" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_02.txt;
.
.

Related

Capture and parse output of Whateverable bots

Since that is the standard way to present output in the Perl 6 documentation, I'm using the whateverable bots to evaluate expressions via the #perl6 IRC channel or the #whateverable channel. Produced output is something like this:
10:28:19 jmerelo | p6: say 333444777 ~~ /(3+)/ │
10:28:19 evalable6 | jmerelo, rakudo-moar 5ce24929f: OUTPUT: «「333」␤ 0 => 「333」␤»
(in the WeeChat console program). From that output, I cut and paste to the document, erasing the parts I'm not interested in.
I was wondering if there was some easy way to parse and save that output directly, either server-based (some Whateverable bots save to gists, for instance), or client-based via scriptint the irssi or weechat platform.
I think the most convenient solution in this case would be to bypass irc bots and define a bash function. Something like this:
d6() { echo -n '# OUTPUT: «'; perl6 -e "$1" | sed -z 's/\n/␤/g'; echo '»'; }
Then you can use it like this:
d6 'say 42'
Which will produce this output:
# OUTPUT: «42␤»
Of course, you'd need a different solution for other operating systems.
As a bonus, you can also put it into the clipboard automatically:
d6 'say 42' | tee >(xclip -selection clipboard)

NSLookup Script to update hosts file once a week.

First time poster so my apologies if this has been covered in a previous topic that I was unable to locate. Basically I'm tasked with creating a script to perform a NSLookup on 50 domain names, format the results and pass them to the hosts file. I'll worry about checking and overwriting duplicate entries later.
Example:
Input: nslookup www.cbc.ca
Result:
Name: a1849.gc.akamai.net
Addresses: 184.50.238.64, 184.50.238.89
Aliases: www.cbc.ca, www.cbc.ca.edgesuite.net
Eventual Output: #184.50.238.64 www.cbc.ca a1849.gc.akamai.net
I figured this was possible with grep, awk and sed but have been messing about with switches and haven't gotten the right combination (mostly cause I'm not the most learned when it comes to regular expressions.) I'm partial to vbs, batch, cmd suggestions as well.
Thanks in advance for the time and effort! :)
nslookup $NAME | awk -v name="$NAME" 'BEGIN{hit=0; addr=""; alias=""} /answer:/{hit=1} /^Address:/{if (hit == 1 && "" == addr) addr=$2} /^Name:/{alias=alias " " $2} END{print(addr, name, alias)}'
Only one address and would not solve multiple identical names like nslookup google.com...

How can I get the disk IO trace with actual input values?

I want to generate some trace file from disk IO, but the problem is I need the actual input data along with timestamp, logical address and access block size, etc.
I've been trying to solve the problem by using the "blktrace | blkparse" with "iozone" on the ubuntu VirtualBox environment, but it seems not working.
There is an option in blkparse for setting the output format to show the packet data, -f "%P", but it dose not print anything.
below is the command that i use:
$> sudo blktrace -a issue -d /dev/sda -o - | blkparse -i - -o ./temp/blktrace.sda.iozone -f "%-12C\t\t%p\t%d\t%S:%n:%N\t\t%P\n"
$> iozone -w -e -s 16M -f ./mnt/iozone.dummy -i 0
In the printing format "%-12C\t\t%p\t%d\t%S:%n:%N\t\t%P\n", all other things are printed well, but the "%P" is not printed at all.
Is there anyone who knows why the packet data is not displayed?
OR anyone who knows other way to get the disk IO packet data with actual input value?
As far as I know blktrace does not capture the actual data. It just capture the metadata. One way to capture real data is to write your own kernel module. Some students at FIU.edu did that in this paper:
"I/O deduplication: Utilizing content similarity to ..."
I would ask this question in linux-btrace mailing list as well:
http://vger.kernel.org/majordomo-info.html

What is the easiest way to find the biggest objects in Redis?

I have a 20GB+ rdb dump in production.
I suspect there's a specific set of keys bloating it.
I'd like to have a way to always spot the first 100 biggest objects from static dump analysis or ask it to the server itself, which by the way has ove 7M objects.
Dump analysis tools like rdbtools are not helpful in this (I think) really common use case!
I was thinking to write a script and iterate the whole keyset with "redis-cli debug object", but I have the feeling there must be some tool I'm missing.
An option was added to redis-cli: redis-cli --bigkeys
Sample output based on https://gist.github.com/michael-grunder/9257326
$ ./redis-cli --bigkeys
# Press ctrl+c when you have had enough of it... :)
# You can use -i 0.1 to sleep 0.1 sec every 100 sampled keys
# in order to reduce server load (usually not needed).
Biggest string so far: day:uv:483:1201737600, size: 2
Biggest string so far: day:pv:2013:1315267200, size: 3
Biggest string so far: day:pv:3:1290297600, size: 5
Biggest zset so far: day:topref:2734:1289433600, size: 3
Biggest zset so far: day:topkw:2236:1318723200, size: 7
Biggest zset so far: day:topref:651:1320364800, size: 20
Biggest string so far: uid:3467:auth, size: 32
Biggest set so far: uid:3029:allowed, size: 1
Biggest list so far: last:175, size: 51
-------- summary -------
Sampled 329 keys in the keyspace!
Total key length in bytes is 15172 (avg len 46.12)
Biggest list found 'day:uv:483:1201737600' has 5235597 items
Biggest set found 'day:uvx:555:1201737600' has 47 members
Biggest hash found 'day:uvy:131:1201737600' has 2888 fields
Biggest zset found 'day:uvz:777:1201737600' has 1000 members
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
19 lists with 5236744 items (05.78% of keys, avg size 275618.11)
50 sets with 112 members (15.20% of keys, avg size 2.24)
250 hashs with 6915 fields (75.99% of keys, avg size 27.66)
10 zsets with 1294 members (03.04% of keys, avg size 129.40)
redis-rdb-tools does have a memory report that does exactly what you need. It generates a CSV file with memory used by every key. You can then sort it and find the Top x keys.
There is also an experimental memory profiler that started to do what you need. Its not yet complete, and so isn't documented. But you can try it - https://github.com/sripathikrishnan/redis-rdb-tools/tree/master/rdbtools/cli. And of course, I'd encourage you to contribute as well!
Disclaimer: I am the author of this tool.
I am pretty new to bash scripting. I came out with this:
for line in $(redis-cli keys '*' | awk '{print $1}'); do echo `redis-cli DEBUG OBJECT $line | awk '{print $5}' | sed 's/serializedlength://g'` $line; done; | sort -h
This script
Lists all the key with redis-cli keys "*"
Gets size with redis-cli DEBUG OBJECT
sorts the script based on the name prepend with the size
This may be very slow due to the fact that bash is looping through every single redis key. You have 7m keys you may need to cache the out put of the keys to a file.
If you have keys that follow this pattern "A:B" or "A:B:*", I wrote a tool that analyzes both existing content as well as monitors for things such as hit rate, number of gets/sets, network traffic, lifetime, etc. The output is similar to the one below.
https://github.com/alexdicianu/redis_toolkit
$ ./redis-toolkit report -type memory -name NAME
+----------------------------------------+----------+-----------+----------+
| KEY | NR KEYS | SIZE (MB) | SIZE (%) |
+----------------------------------------+----------+-----------+----------+
| posts:* | 500 | 0.56 | 2.79 |
| post_meta:* | 440 | 18.48 | 92.78 |
| terms:* | 192 | 0.12 | 0.63 |
| options:* | 109 | 0.52 | 2.59 |
Try redis-memory-analyzer - a console tool to scan Redis key space in real time and aggregate memory usage statistic by key patterns. You may use this tools without maintenance on production servers. It shows you detailed statistics about each key pattern in your Redis serve.
Also you can scan Redis db by all or selected Redis types such as "string", "hash", "list", "set", "zset". Matching pattern also supported.
RMA also try to discern key names by patterns, for example if you have keys like 'user:100' and 'user:101' application would pick out common pattern 'user:*' in output so you can analyze most memory distressed data in your instance.

Nano hacks: most useful tiny programs you've coded or come across

It's the first great virtue of programmers. All of us have, at one time or another automated a task with a bit of throw-away code. Sometimes it takes a couple seconds tapping out a one-liner, sometimes we spend an exorbitant amount of time automating away a two-second task and then never use it again.
What tiny hack have you found useful enough to reuse? To make go so far as to make an alias for?
Note: before answering, please check to make sure it's not already on favourite command-line tricks using BASH or perl/ruby one-liner questions.
i found this on dotfiles.org just today. it's very simple, but clever. i felt stupid for not having thought of it myself.
###
### Handy Extract Program
###
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "'$1' cannot be extracted via >extract<" ;;
esac
else
echo "'$1' is not a valid file"
fi
}
Here's a filter that puts commas in the middle of any large numbers in standard input.
$ cat ~/bin/comma
#!/usr/bin/perl -p
s/(\d{4,})/commify($1)/ge;
sub commify {
local $_ = shift;
1 while s/^([ -+]?\d+)(\d{3})/$1,$2/;
return $_;
}
I usually wind up using it for long output lists of big numbers, and I tire of counting decimal places. Now instead of seeing
-rw-r--r-- 1 alester alester 2244487404 Oct 6 15:38 listdetail.sql
I can run that as ls -l | comma and see
-rw-r--r-- 1 alester alester 2,244,487,404 Oct 6 15:38 listdetail.sql
This script saved my career!
Quite a few years ago, i was working remotely on a client database. I updated a shipment to change its status. But I forgot the where clause.
I'll never forget the feeling in the pit of my stomach when I saw (6834 rows affected). I basically spent the entire night going through event logs and figuring out the proper status on all those shipments. Crap!
So I wrote a script (originally in awk) that would start a transaction for any updates, and check the rows affected before committing. This prevented any surprises.
So now I never do updates from command line without going through a script like this. Here it is (now in Python):
import sys
import subprocess as sp
pgm = "isql"
if len(sys.argv) == 1:
print "Usage: \nsql sql-string [rows-affected]"
sys.exit()
sql_str = sys.argv[1].upper()
max_rows_affected = 3
if len(sys.argv) > 2:
max_rows_affected = int(sys.argv[2])
if sql_str.startswith("UPDATE"):
sql_str = "BEGIN TRANSACTION\\n" + sql_str
p1 = sp.Popen([pgm, sql_str],stdout=sp.PIPE,
shell=True)
(stdout, stderr) = p1.communicate()
print stdout
# example -> (33 rows affected)
affected = stdout.splitlines()[-1]
affected = affected.split()[0].lstrip('(')
num_affected = int(affected)
if num_affected > max_rows_affected:
print "WARNING! ", num_affected,"rows were affected, rolling back..."
sql_str = "ROLLBACK TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
sql_str = "COMMIT TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
ret_code = sp.call([pgm, sql_str], shell=True)
I use this script under assorted linuxes to check whether a directory copy between machines (or to CD/DVD) worked or whether copying (e.g. ext3 utf8 filenames -> fusebl
k) has mangled special characters in the filenames.
#!/bin/bash
## dsum Do checksums recursively over a directory.
## Typical usage: dsum <directory> > outfile
export LC_ALL=C # Optional - use sort order across different locales
if [ $# != 1 ]; then echo "Usage: ${0/*\//} <directory>" 1>&2; exit; fi
cd $1 1>&2 || exit
#findargs=-follow # Uncomment to follow symbolic links
find . $findargs -type f | sort | xargs -d'\n' cksum
Sorry, don't have the exact code handy, but I coded a regular expression for searching source code in VS.Net that allowed me to search anything not in comments. It came in very useful in a particular project I was working on, where people insisted that commenting out code was good practice, in case you wanted to go back and see what the code used to do.
I have two ruby scripts that I modify regularly to download all of various webcomics. Extremely handy! Note: They require wget, so probably linux. Note2: read these before you try them, they need a little bit of modification for each site.
Date based downloader:
#!/usr/bin/ruby -w
Day = 60 * 60 * 24
Fromat = "hjlsdahjsd/comics/st%Y%m%d.gif"
t = Time.local(2005, 2, 5)
MWF = [1,3,5]
until t == Time.local(2007, 7, 9)
if MWF.include? t.wday
`wget #{t.strftime(Fromat)}`
sleep 3
end
t += Day
end
Or you can use the number based one:
#!/usr/bin/ruby -w
Fromat = "http://fdsafdsa/comics/%08d.gif"
1.upto(986) do |i|
`wget #{sprintf(Fromat, i)}`
sleep 1
end
Instead of having to repeatedly open files in SQL Query Analyser and run them, I found the syntax needed to make a batch file, and could then run 100 at once. Oh the sweet sweet joy! I've used this ever since.
isqlw -S servername -d dbname -E -i F:\blah\whatever.sql -o F:\results.txt
This goes back to my COBOL days but I had two generic COBOL programs, one batch and one online (mainframe folks will know what these are). They were shells of a program that could take any set of parameters and/or files and be run, batch or executed in an IMS test region. I had them set up so that depending on the parameters I could access files, databases(DB2 or IMS DB) and or just manipulate working storage or whatever.
It was great because I could test that date function without guessing or test why there was truncation or why there was a database ABEND. The programs grew in size as time went on to include all sorts of tests and become a staple of the development group. Everyone knew where the code resided and included them in their unit testing as well. Those programs got so large (most of the code were commented out tests) and it was all contributed by people through the years. They saved so much time and settled so many disagreements!
I coded a Perl script to map dependencies, without going into an endless loop, For a legacy C program I inherited .... that also had a diamond dependency problem.
I wrote small program that e-mailed me when I received e-mails from friends, on an rarely used e-mail account.
I wrote another small program that sent me text messages if my home IP changes.
To name a few.
Years ago I built a suite of applications on a custom web application platform in PERL.
One cool feature was to convert SQL query strings into human readable sentences that described what the results were.
The code was relatively short but the end effect was nice.
I've got a little app that you run and it dumps a GUID into the clipboard. You can run it /noui or not. With UI, its a single button that drops a new GUID every time you click it. Without it drops a new one and then exits.
I mostly use it from within VS. I have it as an external app and mapped to a shortcut. I'm writing an app that relies heavily on xaml and guids, so I always find I need to paste a new guid into xaml...
Any time I write a clever list comprehension or use of map/reduce in python. There was one like this:
if reduce(lambda x, c: locks[x] and c, locknames, True):
print "Sub-threads terminated!"
The reason I remember that is that I came up with it myself, then saw the exact same code on somebody else's website. Now-adays it'd probably be done like:
if all(map(lambda z: locks[z], locknames)):
print "ya trik"
I've got 20 or 30 of these things lying around because once I coded up the framework for my standard console app in windows I can pretty much drop in any logic I want, so I got a lot of these little things that solve specific problems.
I guess the ones I'm using a lot right now is a console app that takes stdin and colorizes the output based on xml profiles that match regular expressions to colors. I use it for watching my log files from builds. The other one is a command line launcher so I don't pollute my PATH env var and it would exceed the limit on some systems anyway, namely win2k.
I'm constantly connecting to various linux servers from my own desktop throughout my workday, so I created a few aliases that will launch an xterm on those machines and set the title, background color, and other tweaks:
alias x="xterm" # local
alias xd="ssh -Xf me#development_host xterm -bg aliceblue -ls -sb -bc -geometry 100x30 -title Development"
alias xp="ssh -Xf me#production_host xterm -bg thistle1 ..."
I have a bunch of servers I frequently connect to, as well, but they're all on my local network. This Ruby script prints out the command to create aliases for any machine with ssh open:
#!/usr/bin/env ruby
require 'rubygems'
require 'dnssd'
handle = DNSSD.browse('_ssh._tcp') do |reply|
print "alias #{reply.name}='ssh #{reply.name}.#{reply.domain}';"
end
sleep 1
handle.stop
Use it like this in your .bash_profile:
eval `ruby ~/.alias_shares`