Looking for examples on Airflow GCSToS3Operator. Thanks - amazon-s3

I am trying to send file from GCS bucket to S3 bucket using Airflow. I came across this article https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2 but looking for specific code implementations and examples which also explains the requirements for this. I am a newbie to Airflow and GCP.

Astronomer is a good place to start with . see the doc for GCSToS3Operator.
You can get dependencies, explanation on each variable and links to examples

Related

How to code/implement AWS Batch?

I am pretty new to AWS Batch. I have prior experience of batch implementation.
I gone through this link and got how to configure a batch on AWS.
I have to implement simple batch which will read incoming pipe separated file, read out data from that file, perform some transformation on that data and then save each line as a separate file in S3.
But I didn't find any document or example where I could see the implementation part. All or atleast most document talks only about AWS batch configuration.
Any idea from coding/implementation part? I would be using Java for implementation.
This might help you. The code is in python though!
AWSLABS/AWS-BATCH-GENOMICS
AWS Batch is running your Docker containers as jobs, so the implementation is not limited to the languages.
For Java, you could have some jars copied to your Dockerfile, and providing a entry point or CMD to start your code when the job is started in AWS Batch.

Cosmos on Wirecloud

Taking as a reference public documentation (https://wirecloud.conwet.etsiinf.upm.es/slides/1.2_Integration%20with%20other%20GEs.html#slide16) I wonder if at this point there is any progress on connecting Wirecloud & Cosmos in order to retrieve historical data and visualised it over mashups setups.
If not, could you give any direction so I can give a try implementing something around this?
Note: I have already checked some of the available documentation, and it looks to me that my desired feature could be tackled by a simple python implementation to retrieve HDFS files to the appropriated NGSI format, Is it right?
Nevertheless, I believe it will be a dirty mechanism. What should be the recommended way?
I honestly hope not to be cheating by answering my own questions and marking them as correct, but I would like to leave a record of a solution for those folks that might be experiencing same troubles as me.
I have developed a quick and dirty mechanism to retrieve HDFS files into NGSI formats so we can retrieve historical data like we do with Orion widgets.
https://github.com/netzahdzc/cloudCos
Please note, that this is a quite working progress, so there are some hardcode that I hope eventually fix.
Official Cosmos-WireCloud integration is currently not available, although there are third-party widgets using cosmos out there.
In my opinion, the best option for accessing the HDFS filesystem, is using WebHDFS (you will need adding a FIWARE token into the request for authentication).
It should also be possible to connect to Hive (see this ticket for more info)

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Read messages from SQS into Dataflow

I've got a bunch of data being generated in AWS S3, with PUT notifications being sent to SQS whenever a new file arrives in S3. I'd like to load the contents of these files into BigQuery, so I'm working on setting up a simple ETL in Google Dataflow. However, I can't figure out how to integrate Dataflow with any service that it doesn't already support out of the box (Pubsub, Google Cloud Storage, etc.).
The GDF docs say:
In the initial release of Cloud Dataflow, extensibility for Read and Write transforms has not been implemented.
I think I can confirm this, as I tried to write a Read transform and wasn't able to figure out how to make it work (I tried to base an SqsIO class on the provided PubsubIO class).
So I've been looking at writing a custom source for Dataflow, but can't wrap my head around how to adapt a Source to polling SQS for changes. It doesn't really seem like the right abstraction anyway, but I wouldn't really care if I could get it working.
Additionally, it looks like I'd have to do some work to download the S3 files (I tried creating a Reader for that as well with no luck b/c of the above mentioned reason).
Basically, I'm stuck. Any suggestions for integrating SQS and S3 with Dataflow would be very appreciated.
The Dataflow Java SDK now includes an API for defining custom unbounded sources:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/UnboundedSource.java
This can be used to implement a custom SQS Source.

Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here.
For more information about Apache Nutch, please see the Nutch wiki.
In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!
What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.
Here you find some examples how to do that using Python Scrapy:
Selectors
Scrapy Tutorial
On Hadoop the best way to go is to implement a crawling using selectors:
Web crawl with Hadoop
enter link description here
HiveQL
The cascading can be used to address the URL you specify:
Hadoop and Cascading
After having the data, you can also use R to optimize analysis:
R and Hadoop
Enabling R on Hadoop
If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.