Extract data fom Marklogic 8.0.6 to AWS S3 - amazon-s3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.

I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.

S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save

Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Related

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

AWS Glue check file contents correctness

I have a project in AWS to insert data from some files, which will be in S3, to Redshift. The point is that the ETL has to be scheduled each day to find new files in S3 and then check if those files are correct. However, this has to be done with custom code as the files can have different formats depending of their kind, provider, etc.
I see that AWS Glue allows to schedule, crawl and do the ETL. However I'm lost at how to one can create its own code for the ETL and parse the files to check the correctness before ending up doing the copy instruction from S3 to Redshift. Do you know if that can be done and how?
Another issue is that if the correctness is OK then, the system should upload the data from S3 to a web via some API. But if it's not the file should be left into an ftp email. Here again, do you know if that can be done as well with the AWS Glue and how?
many thanks!
You can write your glue/spark code, upload it to s3 and create a glue job referring to this script/library. Anything you want to write in python can be done in glue. its just a wrapper around spark which in turn uses python....

How to search a string in Amazon S3 files?

I have all types of files or documents stored in Amazon S3.
How to perform search on those documents using a search keyword or string (full-text search, if possible) ?
Is there any documentum built on it ?
Matching documents list which has the search string will be displayed to the user for download.
Any help please ?
Searching documents in S3 is not possible.
S3 is not a document database. It is an object store, designed for storing data but inferring no "meaning" from the data -- essentially a key/value store suporting very large values. It has no sense of context. It doesn't index the content of the objects, or even the object metadata. The only way to "find" an object in S3 is to already know its key.
It is excellent for highly available and highly reliable storage, but searching not part of its design.
The solutions depends on how structured your S3 file data is.
If it is structured or semi-structured like cvs, JSON with columnar alike format, AWS Athena will be the best choice. With just a few clicks, you’re ready to query your S3 files.
Otherwise, if the data is totally un-structured, you may want to use elasticsearch and etc.
You cant search as you wish in amazon S3.
but we have alternate solution for this. I am using S3 browser software for this.
here is link to download: http://s3browser.com/
Download it and you will have all access same like amazon S3 browser. you can also perform search and other processes.

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

Amazon MapReduce input splitting and downloading

I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file. Also i have been reading that input files will not be split unless they are 5gb, my files are not that large so does that mean they will only be processed by one instance?
My other question might seem relatively dumb but is it possible to use emr+streaming and have an input someplace other then s3? It seems redundant to have to download the logs from the CDN, then upload them to my s3 bucket to run mapreduce on them. Right now i have them downloading onto my server then my server is uploading them to s3, is there a way to cut out the middle man and have it go straight to s3, or run the inputs off my server?
are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file
Alas, no, straight gzip files are not splittable. One option is to just roll your log files more frequently; this very simple solution works for some people though it's a bit clumsy.
Also i have been reading that input files will not be split unless they are 5gb,
This is definitely not the case. If a file is splittable you have lots of options on how you want to split it, eg configuring mapred.max.split.size. I found [1] to be a good description of the options available.
is it possible to use emr+streaming and have an input someplace other then s3?
Yes. Elastic MapReduce now supports VPC so you could connect directly to your CDN [2]
[1] http://www.scribd.com/doc/23046928/Hadoop-Performance-Tuning
[2] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_VPC.html?r=146