Monitoring Amazon S3 logs with Splunk? - amazon-s3

We have a large extended network of users that we track using badges. The total traffic is in the neighborhood of 60 Million impressions a month. We are currently considering switching from a fairly slow, database-based logging solution (custom-built on PHP—messy...) to a simple log-based alternative that relies on Amazon S3 logs and Splunk.
After using Splunk for some other analyisis tasks, I really like it. But it's not clear how to set up a source like S3 with the system. It seems that remote sources require the Universal Forwarder installed, which is not an option there.
Any ideas on this?

Very late answer but I was looking for the same thing and found a Splunk app that does what you want, http://apps.splunk.com/app/1137/. I have yet not tried it though.

I would suggest logging j-son preprocessed data to a documentdb database. For example, using azure queues or simmilar service bus messaging technologies that fit your scenario in combination with azure documentdb.
So I'll keep your database based approach and modify it to be a schemaless easy to scale document based DB.

I use http://www.insight4storage.com/ from AWS Marketplace to track my AWS S3 storage usage totals by prefix, bucket or storage class over time; plus it shows me the previous versions storage by prefix and per bucket. It has a setting to save the S3 data as splunk format logs that might work for your use case, in addition to its UI and webservice API.

You use Splunk Add-On for AWS.
This is what I understand,
Create a Splunk instance. Use the website version or the on-premise
AMI of splunk to create an EC2 where splunk is running.
Install Splunk Add-On for AWS application on the EC2.
Based on the input logs type (e.g. Cloudtrail logs, Config logs, generic logs, etc) configure the Add-On and supply AWS account id or IAM Role, etc parameters.
The Add-On will automatically ping AWS S3 source and fetch the latest logs after specified amount of time (default to 30 seconds).
For generic use case (like ours), you can try and configure Generic S3 input for Splunk

Related

Looking for a solution to ingest Pega cloud service logs to Splunk using Splunk addons for AWS

I am looking for a solution to ingest pega cloud service logs to Splunk. I cam across to approaches push and pull. With Push option splunk has blueprint lamdbda which can be used to push events to Splunk HTTP Event Collector (HEC). I am not finding any clear solution for pull approach. Can some one summarize for which scenarios pull will work and for which scenarios push will work.
Splunk and Pega Cloud services are on different vpc , how we can have secure data transfer between them.
A fairly simple solution would be to set up your own client application to pull all relevant notifications. (https://community.pega.com/knowledgebase/articles/pega-predictive-diagnostic-cloud/subscribing-pega-predictive-diagnostic-cloud-notifications-using-rest-api)
You can run your application from your own secure server and have it write the info into Splunk. (https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtogetdatapython/#To-add-data-directly-to-an-index)

Cloud and local application sync ideas

I've a situation where my central MySQL db and file system (S3) runs on a EC2.
But one of my application runs locally at my client site on a PI-3 device, which needs to look up data and files from both the DB and file system on cloud. The application generates transactional records in turn and need to upload the DB and FS (may be at day end).
The irony is that sometimes the cloud may not be available due to connectivity issues (being in a remote area).
What could be the best strategies to accommodate this kind of a scenario?
Can AWS Greengrass help in here?
How to keep the Lookup data (DB and FS)in sync with the local devices?
How to update/sync the transactional data generated by the local devices?
And finally, what could be the risks in such a deployment model?
Appreciate some help/suggestions.
How to keep the Lookup data (DB and FS)in sync with the local devices?
You can have a Greengrass Group and includes all of the devices in the that group. Make the devices subscribe to a topic e.g. DB/Cloud/update. Once device received the message on that topic, trigger a on-demand lambda to download the latest information from the Cloud. To make sure the device do not miss any update when offline, you can use persistent session, it will make sure device will receive all the missing message when it is back online.
How to update/sync the transactional data generated by the local devices?
You may try with the Stream Manager. https://docs.aws.amazon.com/greengrass/latest/developerguide/stream-manager.html
Right now, it is allowed you to add a local use lambda to pre-process the data and sync it up with the cloud

retrieving Apache log files from AWS Beanstalk

I know that Beanstalk's Snapshot Logs can give you a recent overview of the httpd/access_log files from among the EC2 instances under the ELB for that environment. But does anyone know a good way to get all the logs?
It's a production environment, so I want to do the processing elsewhere. But I don't want to (for obvious reasons) configure root sftp and go around collecting the files manually.
I think I had read something about configuring logging to S3?
In the "Configuration" tab for an Environment, under "Software Configuration", there is a checkbox for enabling log file rotation to S3. These are stored in an S3 bucket used specifically for Elastic Beanstalk.
You can feed your current logs to aws cloudwatch logs.
AWS cloudwatch logs will centralise all logs of your infrastructure with a neat solution to search an process them as well as creating metrix and alarm based on your logs.
I have a guide on how to Store aws beanstalk symfony and apache logs in cloudwatch logs. This will help you to get up and running fast, and then you can tweak it.

Amazon web services: Where to start

I am a recent grad and wanted to learn about doing web application using AWS. I have gone through the documentation and ran their sample Travel Log application Successfully.
But still I am not clear about the terminologies used. can anyone explain me the difference between Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Amazon SimpleDB in simple words.
I am looking to come up with a web app that has a signin page and people posting some text there. may i know what services of amazon would be required for me to build this app.
Thanks
Amazon Simple Storage Service (S3) is for load static content , maybe images, videos, or something you want to save, You could think of it like a hard drive for storage.
Amazon Elastic Compute Cloud: ( EC2) basically is your Virtual Operative System, you can install whatever OS you want (Debian, Ubuntu, Fedora, Centos, Windows Server, Suse enterprise). ( if your application uses server side processing this will be its home)
Amazon Simple DB, is a no-sql database system, that you could use for your aplications, and Amazon gives you as a service, but if you want to use something more, you could install yours on EC2, or use RDS for Database server (MySql for example)
If you want to know more, there are some books, like: "programming Amazon EC2" or see Amazon screencast at http://www.youtube.com/user/AmazonWebServices or its presentation on http://www.slideshare.net/AmazonWebServices
Amazon Simple Storage Service (Amazon S3)
Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-based service designed for online backup and archiving of data and application programs. It allows to upload, store, and download any type of files up to 5 TB in size. This service allows the subscribers to access the same systems that Amazon uses to run its own web sites. The subscriber has control over the accessibility of data, i.e. privately/publicly accessible.
Amazon Elastic Compute Cloud (Amazon EC2)
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the Amazon Web Services (AWS) cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.
Amazon SimpleDB
Amazon SimpleDB is a highly available NoSQL data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest.
Unbound by the strict requirements of a relational database, Amazon SimpleDB is optimized to provide high availability and flexibility, with little or no administrative burden. Behind the scenes, Amazon SimpleDB creates and manages multiple geographically distributed replicas of your data automatically to enable high availability and data durability. The service charges you only for the resources actually consumed in storing your data and serving your requests. You can change your data model on the fly, and data is automatically indexed for you. With Amazon SimpleDB, you can focus on application development without worrying about infrastructure provisioning, high availability, software maintenance, schema and index management, or performance tuning.
For more information, go through these:
https://aws.amazon.com/simpledb/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
https://www.tutorialspoint.com/amazon_web_services/amazon_web_services_s3.htm
Amazon S3 is used for storage of files. It is basically like the hard drives like on your system you use C or D your files. If you are developing any application you can use S3 for storing the static files or any backup files.
Amazon EC2 is exactly like your physical machine. Only difference is EC2 is on cloud. You can install and run software, applications store files exactly you do on your physical machines.
Amazon Simple DB is a a database on cloud. you can integrate it with your application and make queries.

How to replicate Amazon EBS to S3?

We have a site where users upload files, some of them quite large. We've got multiple EC2 instances and would like to load balance them. Currently, we store the files on an EBS volume for fast access. What's the best way to replicate the files so they can be available on more than one instance?
My thought is that some automatic replication process that uploads the files to S3, and then automatically downloads them to other EC2 instances would be ideal.
EBS snapshots won't work because they replicate the entire volume, and we need to be able to replicate the directories of individual customers on demand.
You could write a shell script that would spawn s3cmd to sync your local filesystem with a S3 bucket whenever a new file is uploaded (or deleted). It would look something like:
s3cmd sync ./ s3://your-bucket/
Depends on what OS you are running on your EC2 instances:
There isn't really any need to add S3 to the mix unless you want to store them there for some other reason (like backup).
If you are running *nix the classic choice might be to run rsync and just sync between instances.
On Windows you could still use rsync or else SyncToy from Microsoft is a simple free option. Otherwise there are probably hundreds of commercial applications in this space...
If you do want to sync to S3 then I would suggest one of the S3 client apps like CloudBerry or JungleDisk, which both have sync functionality...
If you are running Windows it's also worth considering DFS (Distributed File System) which provides replication and is part of Windows Server...
The best way is to use the Amazon Cloud Front service. All of the replication is managed as part of the AWS. Content is served from several different availability zones, but does not require you to have EBS volumes in those zones.
Amazon CloudFront delivers your static and streaming content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance.
http://aws.amazon.com/cloudfront/
Two ways:
Forget EBS, transfer the files to S3 and use S3 as your file-manager than EBS, add cloudfront and use the common-link everywhere.
Mount S3 bucket on any machines.
1. Amazon CloudFront is a web service for content delivery. It delivers your static and streaming content using a global network of edge locations.
http://aws.amazon.com/cloudfront/
2. You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
That way you can share the S3 bucket on different machines.