logging apache2 to mongodb: apache hook? something out there? - apache

i just found a great blog posting on http://simonwillison.net/2009/Aug/26/logging/ stating the following
MongoDB is fantastic for logging".
Sounds tempting... high performance
inserts, JSON structured records and
capped collections if you only want to
keep the past X entries. If you care
about older historic data but still
want to preserve space you could run
periodic jobs to roll up log entries
in to summarised records. It shouldn’t
be too hard to write a command-line
script that hooks in to Apache’s
logging directive and writes records
to MongoDB.
is there anything out there already? anyone already using apache logging with mongodb?

A simple solution is to set Apache to write access logs to a perl script, which then does the needed work such as parsing, inserting into Mongo, and so on.
#Alexander, you don't need to have Apache block on IO. Write your logger/perl script so it uses a message queue + threading. Apache sends the log line to the perl script, which then inserts the message into a queue held in memory. Another thread reads the queue and does the actual work. We do this on our 1 billion+ views/month cache servers and it works without fail.

A relatively recent option is to use Flume to collect the logs and use the MongoDB sink plugin for Flume to write the events to MongoDB.

Related

Apache beam KafkaIO offset management to external data stores

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline, there is issue with duplicate and missing records.
From what i read, the best way to handle this is to manage offset to external data stores. Is it possible to do this with current version of apache beam and KafkaIO? I am using 2.2.0 version right now.
And, after reading from kafka,i will write it to BigQuery. Is there a setup in KafkaIO where I can set the committed message only after i insert the message to BigQuery? I can only find auto commit setup right now.
In Dataflow, you can update a job rather than restarting from scratch. The new job resumes from the last checkpointed state, ensuring exactly-once processing. This works for KafkaIO source as well. The auto-commit option in Kafka consumer configuration helps but it is not atomic with Dataflow internal state, which implies restarted job might have small fraction of duplicate or missing messages.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

Is Redis data volatile?

I am trying to figure out something and I've been searching for a while with no results.
What happens if a Redis server loses power or gets shut down or something that would wipe the RAM? Does it keep a backup somewhere?
I am wanting to use Redis for a SaaS style app so if I go to app.com/usernamesapp it would use redis to verify usernamesapp exists and get the ID... At which point it would use MySQL for all the rest of the stuff... Reasons being I want to begin showing the page ASAP and most of the stuff is javascript so all the MySQL would happen after the fact.
Thanks
Redis can be configured to write to disk at regular intervals so if the server fails you wont lose your data.
http://redis.io/topics/persistence
From the Redis FAQ
Redis is an in-memory but persistent on disk database
So a critical failure should not result in data loss. Read more at http://redis.io/topics/faq

The faster method to move redis data to MySQL

We have big shopping and product dealing system. We have faced lots problem with MySQL so after few r&D we planned to use Redis and we start integrating Redis in our system.
Following this previously directly hitting the database now we have moved the Redis system
User shopping cart details
Affiliates clicks tracking records
We have product dealing user data.
other site stats.
I am not only storing the data in Redis system i have written crons which moves Redis data in MySQL data at time intervals. This is the main point i am facing the issues.
Bellow points i am looking for solution
Is their any other ways to dump big data from Redis to MySQL?
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
Is Redis have any trigger system using that i can avoid the crons like queue system?
Is their any other way to dump big data from Redis to MySQL?
Redis has the possibility (using bgsave) to generate a dump of the data in a non blocking and consistent way.
https://github.com/sripathikrishnan/redis-rdb-tools
You could use Sripathi Krishnan's well-known package to parse a redis dump file (RDB) in Python, and populate the MySQL instance offline. Or you can convert the Redis dump to JSON format, and write scripts in any language you want to populate MySQL.
This solution is only interesting if you want to copy the complete data of the Redis instance into MySQL.
Does Redis have any trigger system that i can use to avoid the crons like queue system?
Redis has no trigger concept, but nothing prevents you to post events in Redis queues each time something must be copied to MySQL. For instance, instead of:
# Add an item to a user shopping cart
RPUSH user:<id>:cart <item>
you could execute:
# Add an item to a user shopping cart
MULTI
RPUSH user:<id>:cart <item>
RPUSH cart_to_mysql <id>:<item>
EXEC
The MULTI/EXEC block makes it atomic and consistent. Then you just have to write a little daemon waiting on items of the cart_to_mysql queue (using BLPOP commands). For each dequeued item, the daemon has to fetch the relevant data from Redis, and populate the MySQL instance.
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
I'm not sure I understand the question here. But if you use the above solution, the latency between Redis updates and MySQL updates will be quite limited. So if Redis fails, you will only loose the very last operations (contrary to a solution based on cron jobs). It is of course not possible to have 100% consistency in the propagation of data though.

Top & httpd - demystifying what is actually running

I often use the "top" command to see what is taking up resources. Mostly it comes up with a long list of Apache httpd processes, which is not very useful. Is there any way to see a similar list, but such that I could see which PHP scripts etc. those httpd processes are actually running?
If you're concerned about long running processes (i.e. requests that take more than a second or two to execute), you'll be able to get an idea of them using Apache's mod_status. See the documentation, and an example of the output (from www.apache.org). This isn't unique to PHP, but applies to anything running inside an apache process.
Note that the www.apache.org status output is publicly available presumably for demonstration purposes -- you'd want to restrict access to yours so that not everyone can see it.
There's a top-like ncurses-based utility called apachetop which provides realtime log analysis for Apache. Unfortunately, the project has been abandoned and the code suffers from some bugs, however it's actually very much usable. Just don't run it as root, run it as any user with access to the web server log files and you should be fine.
The php scripts happen so fast, top wouldn't show you very much. Or it would zip by quite quickly. Most webrequests are quite quick.
I think your best bet would be to have some type of real time log processor, that kept an eye on your access logs and updates stats for you of average run time, memory usage and stuff like that.
You could make your PHP pages time themselves and write their path and execution time to file or database. Note that would slow everything down while you were monitoring, but it would serve as a good measuring method.
It wouldn't be that interactive though. You'd be able to get daily or weekly results from it, but it'd be hard to see something meaningful within minutes or hours.