Hi I'm new to using Amazon EMR and Hadoop. I was wondering how to read an external file (stored in S3) from an EMR job. For example, I have a file containing a long list of blacklisted strings. When my EMR job is processing my input, how do I get the job to read in this list of blacklisted strings beforehand in order to use it during processing?
I tried using a regular Java Scanner class and hardcoding the S3 path to the file but that didn't seem to work, although I could just be doing it wrong...
I'd do something like this (sorry code is scala not java, but it's the same)
Pass the path in as a argument to your main method
Set that as a property in your configuration
val conf = new Configuration()
conf.set("blacklist.file", args(0))
In the mapper's setup method, read the file:
var blacklist: List[String] = List()
override def setup(context: Context) {
val path = new Path(context.getConfiguration.get("blacklist.file"))
val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
}
It would be better if you may add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}
Related
How one would write a Lucene 8.11 ByteBuffersDirectory to disk?
something similar to Lucene 2.9.4 Directory.copy(directory, FSDirectory.open(indexPath), true)
You can use the copyFrom method to do this.
For example:
You are using a ByteBuffersDirectory:
final Directory dir = new ByteBuffersDirectory();
Assuming you are not concurrently writing any new data to that dir, you can declare a target where you want to write the data - for example, a FSDirectory (a file system directory):
Directory to = FSDirectory.open(Paths.get(OUT_DIR_PATH));
Use whatever string you want for the OUT_DIR_PATH location.
Then you can iterate over all the files in the original dir object, writing them to this new to location:
IOContext ctx = new IOContext();
for (String file : dir.listAll()) {
System.out.println(file); // just for testing
to.copyFrom(dir, file, file, ctx);
}
This will create the new OUT_DIR_PATH dir and populate it with files, such as:
_0.cfe
_0.cfs
_0.si
segments_1
... or whatever files you happen to have in your dir.
Caveat:
I have only used this with a default IOContext object. There are other constructors for the context - not sure what they do. I assume they give you more control over how the write is performed.
Meanwhile I figured it out by myself and created a straight forward method for it:
#SneakyThrows
public static void copyIndex(ByteBuffersDirectory ramDirectory, Path destination) {
FSDirectory fsDirectory = FSDirectory.open(destination);
Arrays.stream(ramDirectory.listAll())
.forEach(fileName -> {
try {
// IOContext is null because in fact is not used (at least for the moment)
fsDirectory.copyFrom(ramDirectory, fileName, fileName, null);
} catch (IOException e) {
log.error(e.getMessage(), e);
}
});
}
I am having trouble reading an ORC file from S3 with the OrcFile.createReader option. I am using hive-exec-2.2.0.jar at the moment and am wondering if this is supported at all? Am i missing any configuration settings? See code below. Any help will be appreciated.
String accessKey = "***";
String secretKey = "***";
Configuration configuration = new Configuration();
configuration.set("fs.s3.awsAccessKeyId", accessKey);
configuration.set("fs.s3.awsSecretAccessKey", secretKey);
configuration.set("fs.defaultFS", "s3://<bucket>");
//configuration.set("fs.default.name", "s3://<bucket>");
//configuration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem");
FileSystem fs = FileSystem.get(configuration);
Reader reader = OrcFile.createReader(new Path("/some/path/file.orc"), OrcFile.readerOptions(configuration).filesystem(fs));
Exception - java.io.IOException: No such file.
The ReaderImpl seems to require either the fileMetadata or the OrcTail (both of which are null). Anything that i might be missing?
Update: So i managed to get out of the file not found exception by creating the s3 object with additional info (also fix the key) with
--metadata="fs=Hadoop,fs-type=block,fs-version=1". See --metadata.
It seems wrong/weird that the type of the orc file on s3 has to have those values for it to successfully retrieve the metadata.
Of course after this it barfs at getting the data, probably because the file formats differ(?).
in INode of package org.apache.hadoop.fs.s3;
public static INode deserialize(InputStream in) throws IOException {
if (in == null) {
return null;
} else {
DataInputStream dataIn = new DataInputStream(in);
INode.FileType fileType = FILE_TYPES[dataIn.readByte()];
The dataIn.readByte is returning a larger value (FILE_TYPES is an array of size 2).
I am currently trying to use Apache Camel with a S3 bucket on AWS. We currently have 1 bucket for each environment that we have for our code bases (for ex: dev bucket, e2e bucket, prod bucket).
We would like to use 1 bucket with multiple directories inside to avoid the messiness and consolidate the space. We could then point camel to a folder within an endpoint to listen to (seems feasible). This has become extremely problematic while trying to configure. Camel seems to expect a bucket for each endpoint that it can connect to.
Can someone shed light on whether this is possible or if it is necessary to have 1 bucket per endpoint? The camel documentation does not specify whether it is possible.
Indeed as Alex Nano wrote, s3 folders are just prefixes to the object key (the "filename"), so if you want to set up multiple routes monitoring different s3 "folders" you could use the prefix option of the aws:s3 endpoint
Prefixes work in the same function you would use aws s3 cli,
e.g.
aws s3 ls s3://mybucket/myfolderInsideBucket/
A JUnit example
public class CamelAwsS3Test extends CamelTestSupport {
#EndpointInject(uri = "mock:result")
private MockEndpoint resultEndpoint;
#Before
public void setup() throws Exception {
context.start();
}
#Test
public void foo() throws Exception {
int expectedFileCount = 3
resultEndpoint.expectedMessageCount(expectedFileCount);
resultEndpoint.assertIsSatisfied();
}
#Override
protected JndiRegistry createRegistry() throws Exception {
JndiRegistry registry = super.createRegistry();
final String accessKey = "ACCESSKEY";
final String secretKey = "SECRETKEY";
AmazonS3 client = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey));
registry.bind("amazonS3Client", client);
return registry;
}
#Override
protected RoutesBuilder createRouteBuilder() throws Exception {
return new RouteBuilder() {
#Override
public void configure() throws Exception {
from("aws-s3://{BUCKET}?amazonS3Client=#amazonS3Client&prefix=${FOLDER}")
.log("hey ${body}").to("mock:result");
}
};
}
}
few thoughts:
S3 only looks like folders. in reality, it's a single name:value pair (bucket:key).
That is, only the top level 'buckets' are actually folders; what looks like a path, is just a key.
so - just use one bucket, and then filter out the key per predicate (http://camel.apache.org/predicate.html) case statement to process what you need.
I'm sure there are other ways, but this seems straightforward.
also note AWS s3 limits that justify your impulse to consolidate:
only 100 buckets per AWS account
all buckets must be uniquely named in a region (even across all AWS accounts)
So I'm trying out the cascading framework and I was able to run apps in local. As a next step I want to use Cascading to download files from S3 to local file system. I'm planning to use FileTap. When I google I found S3FS and looks like its deprecated. Can someone point me in the right direction. In other words, what is the right way to create a FileTap for Amazon S3 with access id and secret key like stuff.
Though this code is for the HadoopFlowConnector this will work with Local Flow connector if using the FileTap.
public class Main {
public void run(String[] args) {
Properties properties = new Properties();
String accessKey = args[0];
String secretKey = args[1];
// better put these keys to hadoop xml file
// for block file system
properties.setProperty("fs.s3.awsAccessKeyId", accessKey);
properties.setProperty("fs.s3.awsSecretAccessKey", secretKey);
// for s3 native file system
// properties.setProperty("fs.s3n.awsAccessKeyId", accessKey);
// properties.setProperty("fs.s3n.awsSecretAccessKey", secretKey);
// properties.setProperty("fs.defaultFS", "hdfs://localhost:8020/");
// properties.setProperty("fs.permissions.umask-mode", "007");
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(
properties);
String input = "s3://my-bucket/my-log.csv";
// If using the native S3
// String input = "s3n://my-bucket/my-log.csv";
Tap inTap = new Hfs(new TextDelimited(false, ";"), input);
Pipe copyPipe = new Pipe("copy");
Tap outTap = new Hfs(new TextDelimited(false, ";"),
"data/output");
FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();
}
public static void main(String[] args) {
new Main().run(args);
}
}
Code courtesy http://whiteboxdeveloper.blogspot.com/2015/01/processing-data-from-aws-s3-using.html.
I just added the S3N string input commented line. Since I was using S3N based accessing I uncommented the S3N features and used it.
I am building REST application. I want to upload a file and I want to save it for example in /WEB-INF/resource/uploads
How can I get path to this directory ? My Controller looks like this
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile file, Model model) {
try {
String fileName = null;
InputStream inputStream = null;
OutputStream outputStream = null;
if (file.getSize() > 0) {
inputStream = file.getInputStream();
fileName = "D:/" + file.getOriginalFilename();
outputStream = new FileOutputStream(fileName);
int readBytes = 0;
byte[] buffer = new byte[10000];
while ((readBytes = inputStream.read(buffer, 0, 10000)) != -1) {
outputStream.write(buffer, 0, readBytes);
}
outputStream.close();
inputStream.close();
}
} catch(Exception ex) {
ex.printStackTrace();
}
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
Second question...what is the best place to upload user files ?
/WEB-INF is a bad place to try to store file uploads. There's no guarantee that this is an actual directory on the disk, and even if it is, the appserver may forbid write access to it.
Where you should store your files depends on what you want to do with them, and what operating system you're running on. Just pick somewhere outside of the webapp itself, is my advice. Perhaps create a dedicated directory
Also, the process of transferring the MultipartFile to another location is much simpler than you're making it out to be:
#RequestMapping(value = "/admin/house/update", method = RequestMethod.POST)
public String updateHouse(House house, #RequestParam("file") MultipartFile srcFile, Model model) throws IOException {
File destFile = new File("/path/to/the/target/file");
srcFile.transferTo(destFile); // easy!
model.addAttribute("step", 3);
this.houseDao.update(house);
return "houseAdmin";
}
You shouldn't store files in /WEB-INF/resource/uploads. This directory is either inside your WAR (if packaged) or exploded somewhere inside servlet container. The first destination is read-only and the latter should not be used for user files.
There are usually two places considered when storing uploaded files:
Some dedicated folder. Make sure users cannot access this directory directly (e.g. anonymous FTP folder). Note that once your application runs on more than one machine you won't have access to this folder. So consider some form of network synchronization or a shared network drive.
Database. This is controversial since binary files tend to occupy a lot of space. But this approach is a bit simpler when distributing your application.