I'm new to Neo4j and testing it on EC2 server, to see if we could use it for storing our ~1.5 nodes and their connection (currently using a Redshift).
I want to load all the data from Redshift to the Neo4j DB. I also work a lot with EMRs and usually storing most of my data on S3.
Is there any way to include AWS authentication information when I try importing data to Neo4j from S3 if the S3 location isn't public? Is there any other way to do so?
Thanks
Related
I have large amount of historical daily data (over 1 TB across many files) that i need to transfer from AWS S3 storage location to a internally hosted server. I have SFTP access to the internal server. could you please advice what is the efficient way to copy the data.
I am using mac and have used cyberduck to download files from S3 and upload files to the internally hosted server. I have also tried using filezilla to move directly from one server to another without a intermediate copy. But would like to see if there a way to move using CLI without much manual intervention. I am new to AWS CLI, but would appreciate community's input. Thank you!
Trying to use autoloader where AWS S3 is source and Delta lake is in Azure Datalake Gen. When I am trying to read files it gives me following error
Writing to Delta table on AWS from non-AWS is unsafe in terms of providing transactional guarantees. If you can guarantee that no one else will be concurrently modifying the same Delta table, you may turn this check off by setting the SparkConf: "spark.databricks.delta.logStore.crossCloud.fatal" to false when launching your cluster.
Tried setting up settings at cluster level and it works fine. My question is, is there any way we can ensure transactional guarantee wile loading data from AWS3 to Azure Datalake (Datalake is backend storage for our Delta Lake). We don't want to set "spark.databricks.delta.logStore.crossCloud.fatal" at Cluster level. Will there be any issue if we do and will it be a good solution for a production ETL pipeline?
This warning appears when Databricks detects that you're doing the multicloud work.
But this warning is for case when you're writing into AWS S3 using Delta, because AWS doesn't have atomic write operation (like, put-if-absent), so it requires some kind of coordinator process that is available only on AWS.
But in your case you can ignore this message because you're just reading from AWS S3, and writing into Delta that is on Azure Datalake.
i am working on a project that has a requirement to store scientific data on AWS S3 as raw data for the beginning of a data lake. we are planning JSON for application data and using S3 metadata to persist application metadata (JSON schema) and process metadata. at the moment, on site S3 is the only service that we have available to us from the AWS cloud.
the client would like a publish environment where they can get the raw data back as files. we would like to avoid building a custom catalog and security infrastructure.
i don't see anything about Apache Atlas that will connect directly to AWS S3. but we can put Apache Hive on top of AWS S3 and then put Apache Atlas and Ranger on top of that. but not sure if this is how we can publish the raw data from S3 or if that even works as Hive is more of a processing environment.
is it possible to use Apache Atlas and Ranger on top of AWS S3 directly?
We have a large extended network of users that we track using badges. The total traffic is in the neighborhood of 60 Million impressions a month. We are currently considering switching from a fairly slow, database-based logging solution (custom-built on PHP—messy...) to a simple log-based alternative that relies on Amazon S3 logs and Splunk.
After using Splunk for some other analyisis tasks, I really like it. But it's not clear how to set up a source like S3 with the system. It seems that remote sources require the Universal Forwarder installed, which is not an option there.
Any ideas on this?
Very late answer but I was looking for the same thing and found a Splunk app that does what you want, http://apps.splunk.com/app/1137/. I have yet not tried it though.
I would suggest logging j-son preprocessed data to a documentdb database. For example, using azure queues or simmilar service bus messaging technologies that fit your scenario in combination with azure documentdb.
So I'll keep your database based approach and modify it to be a schemaless easy to scale document based DB.
I use http://www.insight4storage.com/ from AWS Marketplace to track my AWS S3 storage usage totals by prefix, bucket or storage class over time; plus it shows me the previous versions storage by prefix and per bucket. It has a setting to save the S3 data as splunk format logs that might work for your use case, in addition to its UI and webservice API.
You use Splunk Add-On for AWS.
This is what I understand,
Create a Splunk instance. Use the website version or the on-premise
AMI of splunk to create an EC2 where splunk is running.
Install Splunk Add-On for AWS application on the EC2.
Based on the input logs type (e.g. Cloudtrail logs, Config logs, generic logs, etc) configure the Add-On and supply AWS account id or IAM Role, etc parameters.
The Add-On will automatically ping AWS S3 source and fetch the latest logs after specified amount of time (default to 30 seconds).
For generic use case (like ours), you can try and configure Generic S3 input for Splunk
I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)