Cascading Tap Configuration for Amazon S3 to Local file System - amazon-s3

So I'm trying out the cascading framework and I was able to run apps in local. As a next step I want to use Cascading to download files from S3 to local file system. I'm planning to use FileTap. When I google I found S3FS and looks like its deprecated. Can someone point me in the right direction. In other words, what is the right way to create a FileTap for Amazon S3 with access id and secret key like stuff.

Though this code is for the HadoopFlowConnector this will work with Local Flow connector if using the FileTap.
public class Main {
public void run(String[] args) {
Properties properties = new Properties();
String accessKey = args[0];
String secretKey = args[1];
// better put these keys to hadoop xml file
// for block file system
properties.setProperty("fs.s3.awsAccessKeyId", accessKey);
properties.setProperty("fs.s3.awsSecretAccessKey", secretKey);
// for s3 native file system
// properties.setProperty("fs.s3n.awsAccessKeyId", accessKey);
// properties.setProperty("fs.s3n.awsSecretAccessKey", secretKey);
// properties.setProperty("fs.defaultFS", "hdfs://localhost:8020/");
// properties.setProperty("fs.permissions.umask-mode", "007");
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(
properties);
String input = "s3://my-bucket/my-log.csv";
// If using the native S3
// String input = "s3n://my-bucket/my-log.csv";
Tap inTap = new Hfs(new TextDelimited(false, ";"), input);
Pipe copyPipe = new Pipe("copy");
Tap outTap = new Hfs(new TextDelimited(false, ";"),
"data/output");
FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();
}
public static void main(String[] args) {
new Main().run(args);
}
}
Code courtesy http://whiteboxdeveloper.blogspot.com/2015/01/processing-data-from-aws-s3-using.html.
I just added the S3N string input commented line. Since I was using S3N based accessing I uncommented the S3N features and used it.

Related

Copy Data From Azure Blob Storage to AWS S3

I am new to Azure Data Factory and have an interesting requirement.
I need to move files from Azure Blob storage to Amazon S3, ideally using Azure Data Factory.
However S3 isnt supported as a sink;
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
I also understand from a variety of comments i've read on here that you cannot directly copy from Blob Storage to S3 - you would need to download the file locally and then upload it to S3.
Does anyone know of any examples, in Data factory, SSIS or Azure Runbook that can do such a thing, I suppose an option would be to write an azure logic-app or function that is called from Data Factory.
Managed to get something working on this - it might be useful for someone else.
I decided to write an azure function that uses a HTTP request as a trigger.
These two posts helped me a lot;
How can I use NuGet packages in my Azure Functions?
Copy from Azure Blob to AWS S3 using C#
Please note my answer to the Nuget packages if you are using Azure functions 2.x.
Here is the code - you can modify the basis of this to your needs.
I return a JSON Serialized object because Azure Data Factory requires this as a response from a http request sent from a pipeline;
#r "Microsoft.WindowsAzure.Storage"
#r "Newtonsoft.Json"
#r "System.Net.Http"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
using Microsoft.WindowsAzure.Storage.Blob;
using System.Net.Http;
using Amazon.S3;
using Amazon.S3.Model;
using Amazon.S3.Transfer;
using Amazon.S3.Util;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
log.LogInformation("Example Function has recieved a HTTP Request");
// get Params from query string
string blobUri = req.Query["blobUri"];
string bucketName = req.Query["bucketName"];
// Validate query string
if (String.IsNullOrEmpty(blobUri) || String.IsNullOrEmpty(bucketName)) {
Result outcome = new Result("Invalid Parameters Passed to Function",false,"blobUri or bucketName is null or empty");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
// cast the blob to its type
Uri blobAbsoluteUri = new Uri(blobUri);
CloudBlockBlob blob = new CloudBlockBlob(blobAbsoluteUri);
// Do the Copy
bool resultBool = await CopyBlob(blob, bucketName, log);
if (resultBool) {
Result outcome = new Result("Copy Completed",true,"Blob: " + blobUri + " Copied to Bucket: " + bucketName);
return (ActionResult)new OkObjectResult(outcome.ConvertResultToJson());
}
else {
Result outcome = new Result("ERROR",false,"Copy was not successful Please review Application Logs");
return new BadRequestObjectResult(outcome.ConvertResultToJson());
}
}
static async Task<bool> CopyBlob(CloudBlockBlob blob, string existingBucket, ILogger log) {
var accessKey = "myAwsKey";
var secretKey = "myAwsSecret";
var keyName = blob.Name;
// Make the client
AmazonS3Client myClient = new AmazonS3Client(accessKey, secretKey, Amazon.RegionEndpoint.EUWest1);
// Check the Target Bucket Exists;
bool bucketExists = await AmazonS3Util.DoesS3BucketExistAsync (myClient,existingBucket);
if (!bucketExists) {
log.LogInformation("Bucket: " + existingBucket + " does not exist or is inaccessible to the application");
return false;
}
// Set up the Transfer Utility
TransferUtility fileTransferUtility = new TransferUtility(myClient);
// Stream the file
try {
log.LogInformation("Starting Copy");
using (var stream = await blob.OpenReadAsync()) {
// Note: You need permissions to not be private on the source blob
log.LogInformation("Streaming");
await fileTransferUtility.UploadAsync(stream,existingBucket,keyName);
log.LogInformation("Streaming Done");
}
log.LogInformation("Copy completed");
}
catch (AmazonS3Exception e) {
log.LogInformation("Error encountered on server. Message:'{0}' when writing an object", e.Message);
}
catch (Exception e) {
log.LogInformation("Unknown encountered on server. Message:'{0}' when writing an object", e.Message);
return false;
}
return true;
}
public class Result {
public string result;
public bool outcome;
public string UTCtime;
public string details;
public Result(string msg, bool outcomeBool, string fullMsg){
result=msg;
UTCtime=DateTime.Now.ToString("yyyy-MM-dd h:mm:ss tt");
outcome=outcomeBool;
details=fullMsg;
}
public string ConvertResultToJson() {
return JsonConvert.SerializeObject(this);
}
}
You can use Skyplane to copy data across clouds (110X speedup over CLI tools, with automatic compression to save on egress). To transfer from Azure blob storage to S3 you can call one of the commands:
skyplane cp -r az://azure-bucket-name/ s3://aws-bucket-name/
skyplane sync -r az://azure-bucket-name/ s3://aws-bucket-name/
ADF now includes SFTP as a sink. From the same link provided in the question (supported as a sink is the far-right column):
Using the AWS Transfer family you can set up an SFTP server and add a user with an SSH public key, then use that configuration to set up an SFTP connection from ADF that will connect directly to an S3 bucket.
Download Files From Azure Storage using AzCopy into a temporal local repository
You can download the files from Azure Cloud storage to your local system, just follow the below command, use the recursive flag to copy all the files
azcopy /Source:[source_container_url] /Dest:[local_file_path] /Sourcekey:[source_storage_account_access_key] /s
Upload Local Files to Amazon S3 using aws s3 cp command
aws s3 cp local_file_path s3://my-bucket/ --recursive

Sensenet: Upload Files through Sensenet Client API and Set Modified User

I have a requirement that consists on uploading files through other system to sensenet.
I'm trying to use the Sensenet Client API to upload files but I'm having difficult using the examples documented on the follow links:
Client Library (the code runs well but the file doesn't appear on Sensenet)
Common API Calls (I'm having trouble to compile the code... to instantiate the BinaryData object)
Beside this, I need for each uploading file define the "Modified By" that I specify in my code and not the user that I use to authenticate me in the API.
I think rewriting the ModifiedBy field is an edge case (or a small hack) but it is possible without any magic (see the code). The easiest way is a POST followed by a PATCH, that is perfectly managed by the SenseNet.Client (the code uses a local demo site):
static void Main(string[] args)
{
ClientContext.Initialize(new[]
{new ServerContext {Url = "http://localhost", Username = "admin", Password = "admin"}});
var localFilePath = #"D:\Projects\ConsoleApplication70\TestFileFromConsole1.txt";
var parentPath = "/Root/Sites/Default_Site/workspaces/Document/londondocumentworkspace/Document_Library";
var fileName = "TestFileFromConsole1.txt";
var path = parentPath + "/" + fileName;
var userPath = "/Root/IMS/BuiltIn/Demo/ProjectManagers/alba";
using (var stream = new FileStream(localFilePath, FileMode.Open))
Content.UploadAsync(parentPath, fileName, stream).Wait();
Console.WriteLine("Uploaded");
Modify(path, userPath).Wait();
Console.WriteLine("Modified");
Console.Write("Press <enter> to exit...");
Console.ReadLine();
}
// Rewrites the ModifiedBy field
private static async Task Modify(string path, string userPath)
{
var content = await Content.LoadAsync(path);
content["ModifiedBy"] = userPath;
await content.SaveAsync();
}

How to read an external file in Amazon Elastic MapReduce

Hi I'm new to using Amazon EMR and Hadoop. I was wondering how to read an external file (stored in S3) from an EMR job. For example, I have a file containing a long list of blacklisted strings. When my EMR job is processing my input, how do I get the job to read in this list of blacklisted strings beforehand in order to use it during processing?
I tried using a regular Java Scanner class and hardcoding the S3 path to the file but that didn't seem to work, although I could just be doing it wrong...
I'd do something like this (sorry code is scala not java, but it's the same)
Pass the path in as a argument to your main method
Set that as a property in your configuration
val conf = new Configuration()
conf.set("blacklist.file", args(0))
In the mapper's setup method, read the file:
var blacklist: List[String] = List()
override def setup(context: Context) {
val path = new Path(context.getConfiguration.get("blacklist.file"))
val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
}
It would be better if you may add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}

context path for file upload without HttpRequest in REST application

I am building REST application. I want to upload a file and I want to save it for example in /WEB-INF/resource/uploads
How can I get path to this directory ? My Controller looks like this
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile file, Model model) {
try {
String fileName = null;
InputStream inputStream = null;
OutputStream outputStream = null;
if (file.getSize() > 0) {
inputStream = file.getInputStream();
fileName = "D:/" + file.getOriginalFilename();
outputStream = new FileOutputStream(fileName);
int readBytes = 0;
byte[] buffer = new byte[10000];
while ((readBytes = inputStream.read(buffer, 0, 10000)) != -1) {
outputStream.write(buffer, 0, readBytes);
}
outputStream.close();
inputStream.close();
}
} catch(Exception ex) {
ex.printStackTrace();
}
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
Second question...what is the best place to upload user files ?
/WEB-INF is a bad place to try to store file uploads. There's no guarantee that this is an actual directory on the disk, and even if it is, the appserver may forbid write access to it.
Where you should store your files depends on what you want to do with them, and what operating system you're running on. Just pick somewhere outside of the webapp itself, is my advice. Perhaps create a dedicated directory
Also, the process of transferring the MultipartFile to another location is much simpler than you're making it out to be:
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile srcFile, Model model) throws IOException {
File destFile = new File("/path/to/the/target/file");
srcFile.transferTo(destFile); // easy!
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
You shouldn't store files in /WEB-INF/resource/uploads. This directory is either inside your WAR (if packaged) or exploded somewhere inside servlet container. The first destination is read-only and the latter should not be used for user files.
There are usually two places considered when storing uploaded files:
Some dedicated folder. Make sure users cannot access this directory directly (e.g. anonymous FTP folder). Note that once your application runs on more than one machine you won't have access to this folder. So consider some form of network synchronization or a shared network drive.
Database. This is controversial since binary files tend to occupy a lot of space. But this approach is a bit simpler when distributing your application.

How to write a string to Amazon S3 bucket?

How can I add a string as a file on amazon s3? From whaterver I searched, I got to know that we can upload a file to s3. What is the best way to upload data without creating file?
There is an overload for the AmazonS3.putObject method that accepts the bucket string, a key string, and a string of text content. I hadn't seen mention of it on stack overflow so I'm putting this here. It's going to be similar #Jonik's answer, but without the additional dependency.
AmazonS3 s3client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).build();
s3client.putObject(bucket, key, contents);
Doesn't look as nice, but here is how you can do it using Amazons Java client, probably what JetS3t does behind the scenes anyway.
private boolean putArtistPage(AmazonS3 s3,String bucketName, String key, String webpage)
{
try
{
byte[] contentAsBytes = webpage.getBytes("UTF-8");
ByteArrayInputStream contentsAsStream = new ByteArrayInputStream(contentAsBytes);
ObjectMetadata md = new ObjectMetadata();
md.setContentLength(contentAsBytes.length);
s3.putObject(new PutObjectRequest(bucketname, key, contentsAsStream, md));
return true;
}
catch(AmazonServiceException e)
{
log.log(Level.SEVERE, e.getMessage(), e);
return false;
}
catch(Exception ex)
{
log.log(Level.SEVERE, ex.getMessage(), ex);
return false;
}
}
What is the best way to upload data
without creating file?
If you meant without creating a file on S3, well, you can't really do that. On Amazon S3, the only way to store data is as files, or using more accurate terminology, objects. An object can contain from 1 byte zero bytes to 5 terabytes of data, and is stored in a bucket. Amazon's S3 homepage lays out the basic facts quite clearly. (For other data storing options on AWS, you might want to read e.g. about SimpleDB.)
If you meant without creating a local temporary file, then the answer depends on what library/tool you are using. (As RickMeasham suggested, please add more details!) With the s3cmd tool, for example, you can't skip creating temp file, while with the JetS3t Java library uploading a String directly would be easy:
// (First init s3Service and testBucket)
S3Object stringObject = new S3Object("HelloWorld.txt", "Hello World!");
s3Service.putObject(testBucket, stringObject);
There is a simple way to do it with PHP, simply send the string as the body of the object, specifying the name of the new file in the key -
$s3->putObject(array(
'Bucket' => [Bucket name],
'Key' => [path/to/file.ext],
'Body' => [Your string goes here],
'ContentType' => [specify mimetype if you want],
));
This will create a new file according to the specified key, which has a content as specified in the string.
If you're using java, check out https://ivan-site.com/2015/11/interact-with-s3-without-temp-files/
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3Object;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.*;
import java.nio.charset.StandardCharsets;
class S3StreamJacksonTest {
private static final String S3_BUCKET_NAME = "bucket";
private static final String S3_KEY_NAME = "key";
private static final String CONTENT_TYPE = "application/json";
private static final AmazonS3 AMAZON_S3 = new AmazonS3Client();
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final TestObject TEST_OBJECT = new TestObject("test", 123, 456L);
public void testUploadWithStream() throws JsonProcessingException {
String fileContentString = OBJECT_MAPPER.writeValueAsString(TEST_OBJECT);
byte[] fileContentBytes = fileContentString.getBytes(StandardCharsets.UTF_8);
InputStream fileInputStream = new ByteArrayInputStream(fileContentBytes);
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType(CONTENT_TYPE);
metadata.setContentLength(fileContentBytes.length);
PutObjectRequest putObjectRequest = new PutObjectRequest(
S3_BUCKET_NAME, S3_KEY_NAME, fileInputStream, metadata);
AMAZON_S3.putObject(putObjectRequest);
}
}
This works for me:
public static PutObjectResult WriteString(String bucket, String key, String stringToWrite, AmazonS3Client s3Client) {
ObjectMetadata meta = new ObjectMetadata();
meta.setContentMD5(new String(com.amazonaws.util.Base64.encode(DigestUtils.md5(stringToWrite))));
meta.setContentLength(stringToWrite.length());
InputStream stream = new ByteArrayInputStream(stringToWrite.getBytes(StandardCharsets.UTF_8));
return s3Client.putObject(bucket, key, stream, meta);
}
The sample code at https://docs.aws.amazon.com/AmazonS3/latest/dev/UploadObjSingleOpJava.html works for me.
s3Client.putObject(bucketName, stringObjKeyName, "Uploaded String Object");
Looks like this was added around 1.11.20, so make sure you are using that or new version of SDK.
https://javadoc.io/doc/com.amazonaws/aws-java-sdk-s3/1.11.20/com/amazonaws/services/s3/AmazonS3.html#putObject-java.lang.String-java.lang.String-java.lang.String-