Store total online users in RRDtool given login/logout, checkin/checkout logs - logfiles

Given a log file with clear check-in and check-out messages per user, how can I feed this data into RRDtool to track the total users logged in to the site? (At this time, I do not care about unique users, but that would be nice too of course!)
I read about the DERIVE data source type. How do I get a hypothetical INTEGRAL type instead? Can this be done straightforwardly?
Of course, I could process the file myself, updating GAUGE into the RRD and saving state somehow.; however, I am hoping to avoid that if I can get away with it.

since rrdtool deals in 'rates' it can not really count as you would like it to. The best you could get would be to use an ABSOLUTE type data-source, but this would only work for creating something like a login/logout rate statistic ... you could answer how many logins and how many logouts per-hour you had ...
You would then do an rrdtool update login.rrd N:1 whenever you have a login.

Related

How to get number of Instagram followers on a specified date like minter.io does?

From the picture, you can see how followers statistics looks on minter.io
The only way how I imagine I would count the followers change: I would download the list of all he followers every day by the Instagram API to my DB. And after having this history already can calculate any change.
But on minter.io you can have such a graphics after few minutes after registration... How???
They are probably storing this information on a daily basis and hence are able to keep a historical trend.
If you go to the minter.io website, they mention at the bottom that they have collected data for close to 198 million accounts. I guess you were one of those.
You don't need to get the list of all followers just to show the absolute change in the numbers. The Instagram API gives that directly when you query any of the endpoints giving user information.
I know how it works at smartmetrics.co.
Smartmetrics collects information about all followers of tracked accounts and build history based on this data. So if you followed someone who is already tracked, you can get history for your account.
But minter makes fake linear graph according to some tests: How to Get Historical Data from Instagram API
Crowdbabble and Minter re-use Twitter tokens, which allows them to collect data on millions of accounts. This gives you the historical data that you want -- change in followers over time. As an individual, you aren't able to access the Twitter API and aggregate data like that for storage as easily. You don't have thousands of people giving you tokens that you can then scrape and store on a regular basis.
Crowdbabble has a free 14 day trial with no payment info required. If you don't want in-depth analytics, Twittercounter will give you your follower numbers over the past 30 days -- you can view each day separately.

Online Users Storing Elixir

I am working on one chatroom [all to all] application in Elixir using OTP Genserver and getting messages from js client as user gets registered with their names as first phase. Now, just bit not sure what would be the best approach to store these names at my elixir server somehow and send regular updates to client with list of users online or database storage. Please suggest the best approach.
I agree with bitwalker that ETS is a good fit.
Here's a short summary of what I did in production. It wasn't a chat server, but a server push with a couple of thousand of users connecting via long polling. Pushed data was divided in some 50 categories, and users were able to choose which ones they want. At peak times the server pushed new messages each 2 secs, and processed > 2000 reqs/sec.
Essentially, I kept a gen_server for each user, where I held pending messages and user's configuration (basically a list of selected channels). This was beneficial with long polling, since user's data is decoupled from the user's request, so the data remains while requests are transient. However, I think this approach is also good for permanent connections, such as websockets, since there might still be occasional disconnections, and keeping a more stable user's data gives you a chance of resuming after reconnect.
Obviously, when a request arrives, you need to find the user specific process, and for this, ETS is a good fit, since you don't have a single process bottleneck. Instead of manually working with ETS, I'd recommend using gproc in conjunction with via tuples. Basically, when starting a user's gen_server, you can provide name: {:via, :gproc, {:n, :l, key}} where key is some custom key (arbitrary term) you make based on your internal user's id(:n and :l indicate a unique name on the local node). You can then use that same via tuple when issuing calls/casts, and gen_server will use gproc to find the corresponding process.
Finally, you need to have some timeout/disconnect logic to cleanup user processes. In my case, I simply terminated a user's process if there was no activity from the web layer (no end-user came for data in some time). Gproc will automatically remove entries for terminated process from its internal ETS table. It's probably best to supervise user processes under a temporary strategy.
I realize all of this is still a bit vague, but I hope it makes some sense. Keep in mind that this is not the ultimate pattern (there's no such thing of course), but I think it's a reasonable first attempt.
You may also want to take a look at Phoenix web framework that has an interesting pub-sub facility in form ofTopics. I didn't try this out myself yet, but it seems interesting, and may even simplify some of the stuff I discussed above, or at least help for pushing notifications from chatroom to all users.
Sounds like a good use case for ETS.
A simpler approach might be to use an Agent to store the online users information, but it depends quite a lot on what you need from the storage mechanism you choose.

QuickBooks API - Retrieve only data that has changed

I am building an app that accesses the QuickBooks API v2.
I am looking for a way to retrieve only data that has changed.
For example, from time to time want to be able to check to see if there have been any changes to the chart of accounts in the QB data. Is there a quick way to do this without parsing a large response body? Maybe something like requesting and comparing just a checksum, and then requesting the whole chart of accounts to compare and update if there is a change? Or even just requesting the changes that occurred after a certain date?
This need is not just limited to the chart of accounts. For example, I may want to update historic transaction data, but only with the changes (e.g., a change to an old transaction), not the entire db which can be quite large.
Answer
In further reading the API docs, I should be able to filter the response using the created_at and updated_at metadata.
The filter is called Change Data Capture (CDC)
https://developer.intuit.com/docs/0025_quickbooksapi/0050_data_services/v2/0500_quickbooks_windows/0100_calling_data_services/0015_retrieving_objects
<ItemReceiptQuery xmlns='http://www.intuit.com/sb/cdm/v2'>
<CDCAsOf>2010-12-04T09:30:47.0Z</CDCAsOf>
</ItemReceiptQuery>
thanks
Jarred

Apply pattern recognition to user authentication for malicious attempts

To strengthen the authentication mechanism (web), I would like to log a user fingerprint for every attempt and apply pattern recognition to distinguish malicious attempts. For example if the user always logs in from European computers and there is an attempt made from China, the user is blocked until the user confirms (via email, for example) to allow logins from China.
I have a very, very small knowledge of pattern recognition from a university course. However, I cannot recall enough to start developing this service. What I know is that you should look at these various features:
Browser agent string, resulting in:
Operating system
Browser vendor
IP address, resulting in:
Location
Time stamp of login
Number of (failed) attempts (within session, or total)
You search for patterns and any extraordinary attempt is marked because it does not follow the average pattern. You probably will apply a threshold, so if a user logs in at night or has a new PC, it still works.
There are also a few requirements: first, the check of an attempt must be made real-time. You cannot block access after 2 minutes if the credentials were OK but you found out later on the attempt could have been malicious. Furthermore, all our apps are written in PHP, but PHP is probably too slow for this. I prefer to use Python then, but subsequently there is also a binding to Python required.
So the question is: where to start? What is the best approach to accomplish this? I can log all data in a key storage like Redis or document based like Mongo. How would I design a service which allows to validate a new attempt with certain features against a bulk of known other attempts? And return whether the attempt matches the average within a timely fashion, say 250ms.
What you want to do is called anomaly detection- wikipedia is a good place to start. As a first stab, you might want to try clustering:
you will need a data set. The good news is clustering is unsupervised, so you will not have to mark up a ton of login attempts as regular or malicious.
For a given user, keep a history of their past N logins (big brother warning!) and features of those logins. The features you have listed are a good start, but you can think of more.
apply a clustering algorithm to estimate what the average login is like. For every new attempt you can calculate the distance from the average and decide if it look malicious or not.
As a side not, you can go a long way without learning. My intuition is the location of the login and the number of failed attempts will get you most of the way there. A simple if-else might be good enough.

How to skip known entries when syncing with Google Reader?

for writing an offline client to the Google Reader service I would like to know how to best sync with the service.
There doesn't seem to be official documentation yet and the best source I found so far is this: http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
Now consider this: With the information from above I can download all unread items, I can specify how many items to download and using the atom-id I can detect duplicate entries that I already downloaded.
What's missing for me is a way to specify that I just want the updates since my last sync.
I can say give me the 10 (parameter n=10) latest (parameter r=d) entries. If I specify the parameter r=o (date ascending) then I can also specify parameter ot=[last time of sync], but only then and the ascending order doesn't make any sense when I just want to read some items versus all items.
Any idea how to solve that without downloading all items again and just rejecting duplicates? Not a very economic way of polling.
Someone proposed that I can specify that I only want the unread entries. But to make that solution work in the way that Google Reader will not offer this entries again, I would need to mark them as read. In turn that would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.
Cheers,
Mariano
To get the latest entries, use the standard from-newest-date-descending download, which will start from the latest entries. You will receive a "continuation" token in the XML result, looking something like this:
<gr:continuation>CArhxxjRmNsC</gr:continuation>`
Scan through the results, pulling out anything new to you. You should find that either all results are new, or everything up to a point is new, and all after that are already known to you.
In the latter case, you're done, but in the former you need to find the new stuff older than what you've already retrieved. Do this by using the continuation to get the results starting from just after the last result in the set you just retrieved by passing it in the GET request as the c parameter, e.g.:
http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC
Continue this way until you have everything.
The n parameter, which is a count of the number of items to retrieve, works well with this, and you can change it as you go. If the frequency of checking is user-set, and thus could be very frequent or very rare, you can use an adaptive algorithm to reduce network traffic and your processing load. Initially request a small number of the latest entries, say five (add n=5 to the URL of your GET request). If all are new, in the next request,
where you use the continuation, ask for a larger number, say, 20. If those are still all new, either the feed has a lot of updates or it's been a while, so continue on in groups of 100 or whatever.
However, and correct me if I'm wrong here, you also want to know, after you've downloaded an item, whether its state changes from "unread" to "read" due to the person reading it using the Google Reader interface.
One approach to this would be:
Update the status on google of any items that have been read locally.
Check and save the unread count for the feed. (You want to do this before the next step, so that you guarantee that new items have not arrived between your download of the newest items and the time you check the read count.)
Download the latest items.
Calculate your read count, and compare that to google's. If the feed has a higher read count than you calculated, you know that something's been read on google.
If something has been read on google, start downloading read items and comparing them with your database of unread items. You'll find some items that google says are read that your database claims are unread; update these. Continue doing so until you've found a number of these items equal to the difference between your read count and google's, or until the downloads get unreasonable.
If you didn't find all of the read items, c'est la vie; record the number remaining as an "unfound unread" total which you also need to include in your next calculation of the local number you think are unread.
If the user subscribes to a lot of different blogs, it's also likely he labels them extensively, so you can do this whole thing on a per-label basis rather than for the entire feed, which should help keep the amount of data down, since you won't need to do any transfers for labels where the user didn't read anything new on google reader.
This whole scheme can be applied to other statuses, such as starred or unstarred, as well.
Now, as you say, this
...would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.
True enough. Neither keeping a local read/unread state (since you're keeping a database of all of the items anyway) nor marking items read in google (which the API supports) seems very difficult, so why doesn't this work for you?
There is one further hitch, however: the user may mark something read as unread on google. This throws a bit of a wrench into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user in general will be touching only more recent stuff, and download the latest couple hundred or so items every time, checking the status on all of them. (This isn't all that bad; downloading 100 items took me anywhere from 0.3s for 300KB, to 2.5s for 2.5MB, albeit on a very fast broadband connection.)
Again, if the user has a large number of subscriptions, he's also probably got a reasonably large number of labels, so doing this on a per-label basis will speed things up. I'd suggest, actually, that not only do you check on a per-label basis, but you also spread out the checks, checking a single label each minute rather than everything once every twenty minutes. You can also do this "big check" for status changes on older items less often than you do a "new stuff" check, perhaps once every few hours, if you want to keep bandwidth down.
This is a bit of bandwidth hog, mainly because you need to download the full article from Google merely to check the status. Unfortunately, I can't see any way around that in the API docs that we have available to us. My only real advice is to minimize the checking of status on non-new items.
The Google API hasn't yet been released, at which point this answer may change.
Currently, you would have to call the API and dis-regard items already downloaded, which as you said isn't terribly efficient as you will be re-downloading items every time, even if you already have them.