Logstash count by unique IP - apache

I'm trying to do some log analysis with Logstash.
I need to count unique IPs from an Apache access log, then I need to match them with a count filter, to determine if an email will be sent.
Something like this:
If 10+ access from an unique IP in a 5 minutes interval is found, them I need to send an email with this IP on it.
What would be the best solution for this?

Doing this is surprisingly hard -- to do it you need to create a meter per IP address. Once you have a meter per IP address, you then need to look at it's rate_5m and decide if it's over your threashold (note rate_5m is the per second rate over the last 5 minutes). Once you've decided that you need to send off the alert, you'll probably want to include the IP address in it (so we need to extract that using a ruby filter)... all in all, not sure I'd ever use something like this in production because it would likely chew up memory like crazy (because of the meter per ip address).
filter {
metrics {
meter => "%{ip}"
add_tag => ["metric"]
}
ruby { code => '
ip = nil
if event["tags"].include? "metric"
event.to_hash.each do |key,value|
if key.end_with?(".rate_5m") and value > 0.2
ip = key[0..-9]
end
end
end
if ip
event["ip"] = ip
event["tags"] = ["alert"]
end
'
}
}
output {
if "alert" in [tags] {
email { ... }
}
}
You could probably write a custom filter that is smarter about it using something like the trending algorithm to find IP addresses that are trending higher in count.

Related

Splunk Host header overrides host key from log messages

How can I stop Splunk considering hostname "host" more important than "host" key?
Let's suppose that I have the following logs:
color = red ; host = localhost
color = blue ; host = newhost
The following query works fine:
index=myindex | stats count by color
but the following doesn't:
index=myindex | stats count by host
because instead of considering "host" being the key from the log, it sees the Host header as "host".
How can I deal with this?
When there are two fields with the same name one of them has to "win". In this case, it's the one Splunk defines before it processes the event itself. As you probably know, every event is given 4 fields at input time: index, host, source, and sourcetype. Data from the event won't override these unless specifically told to do so in the config files.
To override the settings, put this in your transforms.conf file
[sethost]
REGEX = host\s*=\s*(\w+)
DEST_KEY = MetaData:Host
FORMAT = host::$1
You'll also need to reference the transform in your props.conf file
[mysourcetype]
TRANSFORMS-host = sethost
I would have thought this solution would be more prominent, but I found it buried deep in the Splunk docs.
https://docs.splunk.com/Documentation/Splunk/8.2.6/Metrics/Search
You can use reserved fields such as "source", "sourcetype", or "host" as dimensions. However, when extracted dimension names are reserved names, the name is prefixed with "extracted_" to avoid name collision. For example, if a dimension name is "host", search for "extracted_host" to find it.
So, in your case:
index=myindex | stats count by extracted_host

List of addresses for gambling in Bitcoin

I'd like to analyze the gambling activities in Bitcoin.
Does anyone has a list of addresses for gambling services such as SatoshiDICE and LuckyBit?
For example, I found addresses of SatoshiDICE here.
https://www.satoshidice.com/Bets.php
My suggestion would be to go and look for a list of popular addresses, i.e., addresses that received and/or sent a lot of transactions. Most gambling sites will use vanity addresses that include part of the site's name in the address, so you might also just search in the addresses for similar patterns.
It's rather easy to build such a list using Rusty Russell's bitcoin-iterate if you have a synced full node:
bitcoin-iterate --output "%os" -q > outputscripts.csv
This will get you a list of all output scripts in confirmed transactions in the blockchain. The output scripts include the pubkey hash that is also encoded in the address.
Let's keep only the P2PKH scripts of the form 76a914<pubkey-hash>88ac
grep -E '^76a914.*88ac$' outputscripts.csv > p2pkhoutputs.csv
Just for reference, the 90.03% (484715631/538368714) of outputs are to P2PKH scripts, so we should be getting pretty accurate results.
So let's get a count for each outputscript and count its occurence:
sort p2pkhoutputs.csv | uniq -c | sort -g > uniqoutputscripts.csv
And finally let's convert the scripts to the addresses. We'll need to do the base58 encoding, and I chose the python base58 library:
from base58 import b58encode_check
def script2address(s):
h = s.decode('hex')[3:23]
h = chr(0) + h
return b58encode_check(h)
For details on how addresses are generated please refer to the Bitcoin wiki. And here we have the top 10 addresses sorted by incoming transactions:
1880739, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4
1601154, 1dice8EMZmqKvrGE4Qc9bUFf9PX3xaYDp
1194169, 1LuckyR1fFHEsXYyx5QK4UFzv3PEAepPMK
1105378, 1dice97ECuByXAvqXpaYzSaQuPVvrtmz6
595846, 1dice9wcMu5hLF4g81u8nioL5mmSHTApw
437631, 1dice7fUkz5h4z2wPc1wLMPWgB5mDwKDx
405960, 1MPxhNkSzeTNTHSZAibMaS8HS1esmUL1ne
395661, 1dice7W2AicHosf5EL3GFDUVga7TgtPFn
383849, 1LuckyY9fRzcJre7aou7ZhWVXktxjjBb9S
As you can see SatishiDice and LuckyBit are very much present in the set. Grepping for the vanity addresses unearths a lot of addresses too.
I would suggest using the usual chain analysis approach: send money to these services and note the addresses. Then perform transitive, symmetric etc closures on the same in the blockchain transaction graph to get all addresses in their wallet.
No technique can determine addresses in a wallet of the user is intelligent enough to mix properly.

Forward only a single group using the WirePattern helper when config.group is true

I'm trying to use the WirePattern helper to perform some synchronisation within my graph. I'm setting config.group to true so I can ensure that only packets received with the same group are collected and handled within this component.
For the sake of argument, here is an example packet from the first in-port:
<my-group>
123
</my-group>
And the second in-port:
<my-group>
456
</my-group>
Because config.group is set to true, these 2 packets will match by group and I can do something with them in my component. So far so good.
The problem lies in that I want to wrap the output with the same group that the 2 in-ports were matched by. This is what the out packet group should look like:
<my-group>
123456
</my-group>
I assumed config.group would do this by default but it doesn't, it just sends the output with no group:
123456
I tried setting config.forwardGroups to various values in an effort to forward the group from only one of the in-ports (seeing as they're identical). Regardless of whether this is set to true, "portname" or ["portname"], it double-wraps the out packet:
<my-group>
<my-group>
123456
</my-group>
</my-group>
This causes headaches further down the line as the grouping has changed and no longer matches up with the other components. I could manually remove one of the groups using another component, but I shouldn't have to do that.
How can I set up the WirePattern to continue matching by group (using config.group) but only forward a single group to the out port?
I don't mind doing it manually for now if this is something that the WirePattern doesn't support. I just need to know whether I'm doing something wrong, or whether it's just not possible in NoFlo yet.
Here's my config for reference:
var config = {
in: ["in", "value"],
params: ["property"],
out: "out",
// This doesn't forward the group
group: true, // Wait for packets of same group
// This duplicates groups when group: true
forwardGroups: ["value"],
arrayPolicy: {
in: "all", // Wait for all indexes
params: "all" // Wait for all indexes
}
};
This looks like a bug to me as I remember enforcing group uniqueness upon forwarding. I've opened https://github.com/noflo/noflo/issues/269 and will fix it by next NoFlo release.
For now, another workaround would be: don't use forwardGroups feature but rather send groups manually to the output inside the process handler (which is absolutely legal when using WirePattern too):
out.beginGroup(groups[0]);
out.send(input.in + input.value);
out.endGroup();

Resetting a Bacon property on value changed to empty

TL;DR
How can I reset emailProfile/aliasProfile when email/alias is cleared after having a value?
Slightly longer version
I have a form that has inputs for email and alias. Neither is mandatory. But, if you fill in the alias field, it might require the email as well, if the alias is reserved.
So far so good, I have the pipe setup from an empty form, up until checking if an alias is reserved and whether the given email matches up. This works correctly and reliably.
Where my setup falters, is when after filling in a correct e-mail I clear the email. The status of emailProfile remains status quo (last server response).
What I want to achieve, is to clear emailProfile when email has no value (or actually when validEmail is false), but in all other cases return the last server response.
The direct (and only) way to tackle the problem I can think of, would be to drop the filter and return null from the lookup function when validation fails, but there has to be a better way, right?
// Functions that can be assumed to work as they should (they do):
// form.findInput, validAlias,validEmail, compose,
// fetchProfilesByAlias, fetchProfileByEmail
var alias = Bacon.fromEventTarget(form.findInput("alias"), "change").
merge(
Bacon.fromEventTarget(form.findInput("alias"), "keyup")
).
map(".target").
map(".value").
skipDuplicates().
toProperty(form.findInput("alias").value);
var email = Bacon.fromEventTarget(form.findInput("email"), "change").
merge(
Bacon.fromEventTarget(form.findInput("email"), "keyup")
).
map(".target").
map(".value").
skipDuplicates().
toProperty(form.findInput("email").value);
var aliasProfiles = alias.
debounce(600).
filter(validAlias).
flatMapLatest(compose(Bacon.fromPromise.bind(Bacon), fetchProfilesByAlias)).
toProperty(null);
var emailProfile = email.
debounce(600).
filter(validEmail).
flatMapLatest(compose(Bacon.fromPromise.bind(Bacon), fetchProfileByEmail)).
toProperty(null);
This is the most straightforward way I can think of.
var emailProfile = email.
debounce(600).
flatMapLatest(function(email) {
if (validEmail(email)) {
return Bacon.fromPromise(fetchProfileByEmail(email))
} else {
return null
}
}).
toProperty(null)
Pretty much the same that you already discovered, except the if is not in the lookup function :)

Get ALL tweets, not just recent ones via twitter API (Using twitter4j - Java)

I've built an app using twitter4j which pulls in a bunch of tweets when I enter a keyword, takes the geolocation out of the tweet (or falls back to profile location) then maps them using ammaps. The problem is I'm only getting a small portion of tweets, is there some kind of limit here? I've got a DB going collecting the tweet data so soon enough it will have a decent amount, but I'm curious as to why I'm only getting tweets within the last 12 hours or so?
For example if I search by my username I only get one tweet, that I sent today.
Thanks for any info!
EDIT: I understand twitter doesn't allow public access to the firehose.. more of why am I limited to only finding tweets of recent?
You need to keep redoing the query, resetting the maxId every time, until you get nothing back. You can also use setSince and setUntil.
An example:
Query query = new Query();
query.setCount(DEFAULT_QUERY_COUNT);
query.setLang("en");
// set the bounding dates
query.setSince(sdf.format(startDate));
query.setUntil(sdf.format(endDate));
QueryResult result = searchWithRetry(twitter, query); // searchWithRetry is my function that deals with rate limits
while (result.getTweets().size() != 0) {
List<Status> tweets = result.getTweets();
System.out.print("# Tweets:\t" + tweets.size());
Long minId = Long.MAX_VALUE;
for (Status tweet : tweets) {
// do stuff here
if (tweet.getId() < minId)
minId = tweet.getId();
}
query.setMaxId(minId-1);
result = searchWithRetry(twitter, query);
}
Really it depend on which API system you are using. I mean Streaming or Search API. In the search API there is a parameter (result_type) that is an optional parameter. The values of this parameter might be followings:
* mixed: Include both popular and real time results in the response.
* recent: return only the most recent results in the response
* popular: return only the most popular results in the response.
The default one is the mixed one.
As far as I understand, you are using the recent one, that is why; you are getting the recent set of tweets. Another issue is getting low volume of tweets that have the geological information. Because there are very few users added the geological information to their profile, you are getting very few tweets.