Unable to use OrcFile.createReader to read from S3 - amazon-s3

I am having trouble reading an ORC file from S3 with the OrcFile.createReader option. I am using hive-exec-2.2.0.jar at the moment and am wondering if this is supported at all? Am i missing any configuration settings? See code below. Any help will be appreciated.
String accessKey = "***";
String secretKey = "***";
Configuration configuration = new Configuration();
configuration.set("fs.s3.awsAccessKeyId", accessKey);
configuration.set("fs.s3.awsSecretAccessKey", secretKey);
configuration.set("fs.defaultFS", "s3://<bucket>");
//configuration.set("fs.default.name", "s3://<bucket>");
//configuration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem");
FileSystem fs = FileSystem.get(configuration);
Reader reader = OrcFile.createReader(new Path("/some/path/file.orc"), OrcFile.readerOptions(configuration).filesystem(fs));
Exception - java.io.IOException: No such file.
The ReaderImpl seems to require either the fileMetadata or the OrcTail (both of which are null). Anything that i might be missing?
Update: So i managed to get out of the file not found exception by creating the s3 object with additional info (also fix the key) with
--metadata="fs=Hadoop,fs-type=block,fs-version=1". See --metadata.
It seems wrong/weird that the type of the orc file on s3 has to have those values for it to successfully retrieve the metadata.
Of course after this it barfs at getting the data, probably because the file formats differ(?).
in INode of package org.apache.hadoop.fs.s3;
public static INode deserialize(InputStream in) throws IOException {
if (in == null) {
return null;
} else {
DataInputStream dataIn = new DataInputStream(in);
INode.FileType fileType = FILE_TYPES[dataIn.readByte()];
The dataIn.readByte is returning a larger value (FILE_TYPES is an array of size 2).

Related

Cascading Tap Configuration for Amazon S3 to Local file System

So I'm trying out the cascading framework and I was able to run apps in local. As a next step I want to use Cascading to download files from S3 to local file system. I'm planning to use FileTap. When I google I found S3FS and looks like its deprecated. Can someone point me in the right direction. In other words, what is the right way to create a FileTap for Amazon S3 with access id and secret key like stuff.
Though this code is for the HadoopFlowConnector this will work with Local Flow connector if using the FileTap.
public class Main {
public void run(String[] args) {
Properties properties = new Properties();
String accessKey = args[0];
String secretKey = args[1];
// better put these keys to hadoop xml file
// for block file system
properties.setProperty("fs.s3.awsAccessKeyId", accessKey);
properties.setProperty("fs.s3.awsSecretAccessKey", secretKey);
// for s3 native file system
// properties.setProperty("fs.s3n.awsAccessKeyId", accessKey);
// properties.setProperty("fs.s3n.awsSecretAccessKey", secretKey);
// properties.setProperty("fs.defaultFS", "hdfs://localhost:8020/");
// properties.setProperty("fs.permissions.umask-mode", "007");
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(
properties);
String input = "s3://my-bucket/my-log.csv";
// If using the native S3
// String input = "s3n://my-bucket/my-log.csv";
Tap inTap = new Hfs(new TextDelimited(false, ";"), input);
Pipe copyPipe = new Pipe("copy");
Tap outTap = new Hfs(new TextDelimited(false, ";"),
"data/output");
FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();
}
public static void main(String[] args) {
new Main().run(args);
}
}
Code courtesy http://whiteboxdeveloper.blogspot.com/2015/01/processing-data-from-aws-s3-using.html.
I just added the S3N string input commented line. Since I was using S3N based accessing I uncommented the S3N features and used it.

TransferManager vs PutRequestObject in AmazonS3

I am confused about Object single upload and multipart upload. Both are taking consumes same time. My code is below:
File file = new File("D:\\AmazonS3\\aws-java-sdk-1.8.3\\lib\\aws-java-sdk-1.8.3-javadoc.jar");
FileInputStream fis = new FileInputStream(file);
String keyName = System.currentTimeMillis()+"_aws-java-sdk-1.8.3-javadoc.jar";
ObjectMetadata metaData = new ObjectMetadata();
metaData.addUserMetadata("test","TEST");
//Object single upload
PutObjectRequest putobejcObjectRequest = new PutObjectRequest(BUCKET_NAME, keyName, fis,metaData);
putobejcObjectRequest.setMetadata(metaData);
s3client.putObject(putobejcObjectRequest);
//Object multipart upload
TransferManagerConfiguration configuration = new TransferManagerConfiguration();
configuration.setMultipartUploadThreshold(5*com.amazonaws.services.s3.internal.Constants.MB);
TransferManager transferManager = new TransferManager(s3client);
transferManager.setConfiguration(configuration);
Upload upload = transferManager.upload(BUCKET_NAME, keyName, fis,metaData);
upload.waitForCompletion();
transferManager.shutdownNow();
Please help me, Is there any problem in my code.
I've faced the same problem and found (by inspecting SDK code) that transfer manager will use parallel parts upload only if you pass it file, not the InputStream.
Look at decision code from sdk (version 1.8.9):
if (TransferManagerUtils.isUploadParallelizable(putObjectRequest, isUsingEncryption)) {
captureUploadStateIfPossible();
uploadPartsInParallel(requestFactory, multipartUploadId);
return null;
} else {
return uploadPartsInSeries(requestFactory);
}
where is uploadParallelizable:
// Each uploaded part in an encrypted upload depends on the encryption context
// from the previous upload, so we cannot parallelize encrypted upload parts.
if (isUsingEncryption) return false;
// Otherwise, if there's a file, we can process the uploads concurrently.
return (getRequestFile(putObjectRequest) != null);
So, if you want to get advantage of parallel upload in parts, pass a file to TransferManager.
When uploading options from a stream, callers must supply the size of
options in the stream through the content length field in the
ObjectMetadata parameter. If no content length is specified for the
input stream, then TransferManager will attempt to buffer all the
stream contents in memory and upload the options as a traditional,
single part upload. Because the entire stream contents must be
buffered in memory, this can be very expensive, and should be avoided
whenever possible.
source: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#upload-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-

How to read an external file in Amazon Elastic MapReduce

Hi I'm new to using Amazon EMR and Hadoop. I was wondering how to read an external file (stored in S3) from an EMR job. For example, I have a file containing a long list of blacklisted strings. When my EMR job is processing my input, how do I get the job to read in this list of blacklisted strings beforehand in order to use it during processing?
I tried using a regular Java Scanner class and hardcoding the S3 path to the file but that didn't seem to work, although I could just be doing it wrong...
I'd do something like this (sorry code is scala not java, but it's the same)
Pass the path in as a argument to your main method
Set that as a property in your configuration
val conf = new Configuration()
conf.set("blacklist.file", args(0))
In the mapper's setup method, read the file:
var blacklist: List[String] = List()
override def setup(context: Context) {
val path = new Path(context.getConfiguration.get("blacklist.file"))
val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
}
It would be better if you may add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}

context path for file upload without HttpRequest in REST application

I am building REST application. I want to upload a file and I want to save it for example in /WEB-INF/resource/uploads
How can I get path to this directory ? My Controller looks like this
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile file, Model model) {
try {
String fileName = null;
InputStream inputStream = null;
OutputStream outputStream = null;
if (file.getSize() > 0) {
inputStream = file.getInputStream();
fileName = "D:/" + file.getOriginalFilename();
outputStream = new FileOutputStream(fileName);
int readBytes = 0;
byte[] buffer = new byte[10000];
while ((readBytes = inputStream.read(buffer, 0, 10000)) != -1) {
outputStream.write(buffer, 0, readBytes);
}
outputStream.close();
inputStream.close();
}
} catch(Exception ex) {
ex.printStackTrace();
}
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
Second question...what is the best place to upload user files ?
/WEB-INF is a bad place to try to store file uploads. There's no guarantee that this is an actual directory on the disk, and even if it is, the appserver may forbid write access to it.
Where you should store your files depends on what you want to do with them, and what operating system you're running on. Just pick somewhere outside of the webapp itself, is my advice. Perhaps create a dedicated directory
Also, the process of transferring the MultipartFile to another location is much simpler than you're making it out to be:
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile srcFile, Model model) throws IOException {
File destFile = new File("/path/to/the/target/file");
srcFile.transferTo(destFile); // easy!
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
You shouldn't store files in /WEB-INF/resource/uploads. This directory is either inside your WAR (if packaged) or exploded somewhere inside servlet container. The first destination is read-only and the latter should not be used for user files.
There are usually two places considered when storing uploaded files:
Some dedicated folder. Make sure users cannot access this directory directly (e.g. anonymous FTP folder). Note that once your application runs on more than one machine you won't have access to this folder. So consider some form of network synchronization or a shared network drive.
Database. This is controversial since binary files tend to occupy a lot of space. But this approach is a bit simpler when distributing your application.

How to write a string to Amazon S3 bucket?

How can I add a string as a file on amazon s3? From whaterver I searched, I got to know that we can upload a file to s3. What is the best way to upload data without creating file?
There is an overload for the AmazonS3.putObject method that accepts the bucket string, a key string, and a string of text content. I hadn't seen mention of it on stack overflow so I'm putting this here. It's going to be similar #Jonik's answer, but without the additional dependency.
AmazonS3 s3client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).build();
s3client.putObject(bucket, key, contents);
Doesn't look as nice, but here is how you can do it using Amazons Java client, probably what JetS3t does behind the scenes anyway.
private boolean putArtistPage(AmazonS3 s3,String bucketName, String key, String webpage)
{
try
{
byte[] contentAsBytes = webpage.getBytes("UTF-8");
ByteArrayInputStream contentsAsStream = new ByteArrayInputStream(contentAsBytes);
ObjectMetadata md = new ObjectMetadata();
md.setContentLength(contentAsBytes.length);
s3.putObject(new PutObjectRequest(bucketname, key, contentsAsStream, md));
return true;
}
catch(AmazonServiceException e)
{
log.log(Level.SEVERE, e.getMessage(), e);
return false;
}
catch(Exception ex)
{
log.log(Level.SEVERE, ex.getMessage(), ex);
return false;
}
}
What is the best way to upload data
without creating file?
If you meant without creating a file on S3, well, you can't really do that. On Amazon S3, the only way to store data is as files, or using more accurate terminology, objects. An object can contain from 1 byte zero bytes to 5 terabytes of data, and is stored in a bucket. Amazon's S3 homepage lays out the basic facts quite clearly. (For other data storing options on AWS, you might want to read e.g. about SimpleDB.)
If you meant without creating a local temporary file, then the answer depends on what library/tool you are using. (As RickMeasham suggested, please add more details!) With the s3cmd tool, for example, you can't skip creating temp file, while with the JetS3t Java library uploading a String directly would be easy:
// (First init s3Service and testBucket)
S3Object stringObject = new S3Object("HelloWorld.txt", "Hello World!");
s3Service.putObject(testBucket, stringObject);
There is a simple way to do it with PHP, simply send the string as the body of the object, specifying the name of the new file in the key -
$s3->putObject(array(
'Bucket' => [Bucket name],
'Key' => [path/to/file.ext],
'Body' => [Your string goes here],
'ContentType' => [specify mimetype if you want],
));
This will create a new file according to the specified key, which has a content as specified in the string.
If you're using java, check out https://ivan-site.com/2015/11/interact-with-s3-without-temp-files/
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3Object;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.*;
import java.nio.charset.StandardCharsets;
class S3StreamJacksonTest {
private static final String S3_BUCKET_NAME = "bucket";
private static final String S3_KEY_NAME = "key";
private static final String CONTENT_TYPE = "application/json";
private static final AmazonS3 AMAZON_S3 = new AmazonS3Client();
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final TestObject TEST_OBJECT = new TestObject("test", 123, 456L);
public void testUploadWithStream() throws JsonProcessingException {
String fileContentString = OBJECT_MAPPER.writeValueAsString(TEST_OBJECT);
byte[] fileContentBytes = fileContentString.getBytes(StandardCharsets.UTF_8);
InputStream fileInputStream = new ByteArrayInputStream(fileContentBytes);
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType(CONTENT_TYPE);
metadata.setContentLength(fileContentBytes.length);
PutObjectRequest putObjectRequest = new PutObjectRequest(
S3_BUCKET_NAME, S3_KEY_NAME, fileInputStream, metadata);
AMAZON_S3.putObject(putObjectRequest);
}
}
This works for me:
public static PutObjectResult WriteString(String bucket, String key, String stringToWrite, AmazonS3Client s3Client) {
ObjectMetadata meta = new ObjectMetadata();
meta.setContentMD5(new String(com.amazonaws.util.Base64.encode(DigestUtils.md5(stringToWrite))));
meta.setContentLength(stringToWrite.length());
InputStream stream = new ByteArrayInputStream(stringToWrite.getBytes(StandardCharsets.UTF_8));
return s3Client.putObject(bucket, key, stream, meta);
}
The sample code at https://docs.aws.amazon.com/AmazonS3/latest/dev/UploadObjSingleOpJava.html works for me.
s3Client.putObject(bucketName, stringObjKeyName, "Uploaded String Object");
Looks like this was added around 1.11.20, so make sure you are using that or new version of SDK.
https://javadoc.io/doc/com.amazonaws/aws-java-sdk-s3/1.11.20/com/amazonaws/services/s3/AmazonS3.html#putObject-java.lang.String-java.lang.String-java.lang.String-