Issues uploading/downloading files in akka-http/akka-streams - amazon-s3

I'm trying to use akka-streams and akka-http and the alpakka library to download/upload files to Amazon S3. I am seeing two issues which might be related...
I can only download very small files, the largest one 8kb.
I can't upload larger files. It fails with the message
Error during processing of request: 'Substream Source has not been
materialized in 5000 milliseconds'. Completing with 500 Internal
Server Error response. To change default exception handling behavior,
provide a custom ExceptionHandler.
akka.stream.impl.SubscriptionTimeoutException:
Substream Source has not been materialized in 5000 milliseconds
Here are my routes
pathEnd {
post {
fileUpload("attachment") {
case (metadata, byteSource) => {
val writeResult: Future[MultipartUploadResult] = byteSource.runWith(client.multipartUpload("bucketname", key))
onSuccess(writeResult) { result =>
complete(result.location.toString())
}
}
}
}
} ~
path("key" / Segment) {
(sourceSystem, sourceTable, sourceId) =>
get {
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
onSuccess(result) {
complete(_)
}
}
}
Trying to download a file of say 100KB will end up fetching a truncated version of the file usually of size around 16-25Kb
Any help appreciated
Edit: For the download issue, I took Stefano's suggestion and got
[error] found : akka.stream.scaladsl.Source[akka.util.ByteString,akka.NotUsed]
[error] required: akka.http.scaladsl.marshalling.ToResponseMarshallable
This made it work
complete(HttpEntity(ContentTypes.`application/octet-stream`, client.download("bucketname", key).runWith(Sink.head)))

1) On the download issue: by calling
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
you are streaming all the data from S3 into memory, and then serve the result.
Akka-Http as streaming support that allows you to stream bytes straight from a source, without buffering them all in memory. More info on this can be found in the docs. Practically, this means the complete directive can take a Source[ByteString, _], as in
...
get {
complete(client.download("bucketname", key))
}
2) On the upload issue: you can try to tweak Akka HTTP akka.http.server.parsing.max-content-length setting:
# Default maximum content length which should not be exceeded by incoming request entities.
# Can be changed at runtime (to a higher or lower value) via the `HttpEntity::withSizeLimit` method.
# Note that it is not necessarily a problem to set this to a high value as all stream operations
# are always properly backpressured.
# Nevertheless you might want to apply some limit in order to prevent a single client from consuming
# an excessive amount of server resources.
#
# Set to `infinite` to completely disable entity length checks. (Even then you can still apply one
# programmatically via `withSizeLimit`.)
max-content-length = 8m
Resulting code to test this would be something along the lines of:
withoutSizeLimit {
fileUpload("attachment") {
...
}
}

Related

Flux File upload validation - file type

I am new to reactive programming. I am using flux for file upload. I need to make sure that all the files uploaded are of a specific type. If not I need to fail the request.
File.flatmap(input-> validate file())
.flatMAp(output->uploadtoazur())
My problem is when the second file is unacceptable type the first file has been processed. I want validateFile to scan all file and then do processing further
Basically if you want to process all the files at a time, not one-by-one, you should first collect them, since you're dealing with Flux. So, you can achieve it with collectList() on your Flux
And then, having List of your files, you can validate and process them. Here is an example of making your validation with handle()
On your Flux of files:
.collectList() // collect all files to List
.handle((files, sink) -> {
// validate all your files here, for example, using regular Stream API with allMatch()
...
if (allValid) {
// return these files if all files are valid
sink.next(files);
} else {
// throw Exception if some files are not valid
sink.error(new Exception("Some files are not valid"));
}
})
...
// further processing
This is one of the many possible ideas how to achieve what you want.
P.S. Actually you should have provided more code and format it properly.

stream data into my vega view progressively

I am using Papaparse to parse the CSV and on each data, I run an insert into the view, like so:
Papa.parse(createReadStream('geo.csv'), {
header: true,
chunk(data) {
console.log('chunk: ', data.data.length)
// data.data.length > 0 && tally.push(...data.data)
view.insert('test1', data.data)
},
complete() {
view.data('test1').length // this will return 0
console.log('memory:', process.memoryUsage().heapUsed / 1024 / 1024, ` == time: ${Date.now() - start}`)
},
})
the only way to keep inserting new data is to either:
call run() after insert, insert('test1', data.data).run() to "commit", but I do not need it to run yet, not until I have all of the data (which is why I run() in the complete() callback).
I would have to parse everything at once in memory then pass it using data('test1', allRows) (which I think, will use a lot more memory)
how do I progressively stream data into my vega view? Note that I am running this inside a web worker, as far as I know, vega loader does not support browser's File instance (only URLs for browser environment) this I'm using papaparse.
You need to run runAsync and await it before inserting more data into the view or otherwise updates may bet lost. See https://github.com/vega/vega/issues/2513 for more information on this.
If you don't care about intermediate updates while more data comes in, I would recommend collecting all the data you want to insert and then adding it at once. Memory won't be an issue since you will need all the data in memory anyway. Vega will keep the full data in memory anyway.

How to use select_object_content via rusoto / rust?

The following code tries to select some data from a file stored on S3:
let client = S3Client::new(Region::default());
let source = ... object providing bucket and key ...;
let r = SelectObjectContentRequest {
bucket: source.bucket,
key: source.key,
expression: "select id from S3Object[*].id".to_string(),
expression_type: "SQL".to_string(),
input_serialization: InputSerialization {
json: Some(JSONInput { type_: Some("LINES".to_string()) }),
..Default::default()
},
output_serialization: OutputSerialization {
json: Some(JSONOutput { record_delimiter: Some("\n".to_string()) }),
..Default::default()
},
..Default::default()
};
It causes the following error:
The specified method is not allowed against this
resource.POST
The example is a 1:1 port of a working Python/boto3 example, so I'm quite sure it should work. I found this issue, which is a few month old and the status is not clear to me. How do I get this working with Rust?
Unfortunately s3 select still doesn't work on the latest rusoto_s3-0.40.0. The issue you linked has all the answer. The problems are twofold.
First, right now the s3 select request rusoto sends out has a bogus query string. It should be /ObjectName?select&select-type=2, but rusoto encodes it to be /bjectName?select%26select-type=2. That's the error you saw.
To verify, run your project like so:
$ RUST_LOG=rusoto,hyper=debug cargo run
You will see logs from rusoto and hyper. Sure enough it emits an incorrect URI. One can even dig into the code responsible:
let mut params = Params::new();
params.put("select&select-type", "2");
request.set_params(params);
It is supposed to be:
let mut params = Params::new();
params.put("select-type", "2");
params.put("select", "");
request.set_params(params);
Although the fix seems trivial, remember these are glue code generated from AWS botocore service manifests, not manually coded. To incorporate the fix is not that straightforward.
Second, the bigger problem. The AWS s3 select response uses a customized binary format. rusoto simply doesn't have a deserializer for that yet.

Specifying run time for Web Polygraph load tool

I am use Web Polygraph load testing tool to make rapid http requests as it is reliable, low resource consumption, and has good reporting. However, I cannot find any settings to tell Web Polygraph to run for a certain amount of time. I want to be able to have accurate reporting instead of potential misses caused by sending a kill signal to the process.
I have been reading through web polygraph's help pages and can see that the requests per second is configurable, but am not seeing support for request duration time.
I have the config file as such (I think this is where the option would go, likely in the Robot configuration):
Content SimpleContent = {
size = exp(1KB); // response sizes distributed exponentially
cachable = 100%;
};
Server S1 = {
kind = "S101";
contents = [ SimpleContent ];
direct_access = contents;
addresses = ['X.X.X.X' ];
};
// a primitive robot
Robot R1 = {
kind = "R101";
req_rate = 100/sec;
interests = [ "foreign" ];
foreign_trace = "/home/x/trace.urls";
pop_model = { pop_distr = popUnif(); };
recurrence = 100% / SimpleContent.cachable;
origins = S1.addresses;
addresses = ['X.X.X.X' ];
};
I am expecting to be able to set some duration, say 40min, where I am able to have the R1 robot request 100 pages per second for 40 minutes.
I got an answer from the Web Polygraph support. For future reference, this can be accomplished through the Phase and Goal objects, as well as using the Schedule function with them. Here is a snipbit of the email I got back:
See the goal field inside the Phase object:
http://www.web-polygraph.org/docs/reference/pgl/types.html#type:docs/reference/pgl/types/Goal
http://www.web-polygraph.org/docs/reference/pgl/types.html#type:docs/reference/pgl/types/Phase
Do not forget to schedule() your phases:
http://www.web-polygraph.org/docs/reference/pgl/calls.html
Many workloads that are distributed with Polygraph include Phase
schedules. To see examples, search for "goal" in workloads/

Running Google Dataflow pipeline from a Google App Engine app?

I am creating a dataflow job using DataflowPipelineRunner. I tried the following scenarios.
Without specifying any machineType
With g1 small machine
with n1-highmem-2
In all the above scenarios, Input is a file from GCS which is very small file(KB size) and output is Big Query table.
I got Out of memory error in all the scenarios
The size of my compiled code is 94mb. I am trying only word count example and it did not read any input(It fails before the job starts). Please help me understand why i am getting this error.
Note: I am using appengine to start the job.
Note: The same code works with beta versoin 0.4.150414
EDIT 1
As per the suggestions in the answer tried the following,
Switched from Automatic scaling to Basic Scaling.
Used machine type B2 which provides 256MB memory
After these configuration, Java Heap Memory problem is solved. But it is trying to upload a jar into stagging location which is more than 10Mb, hence it fails.
It logs the following exception
com.google.api.client.http.HttpRequest execute: exception thrown while executing request
com.google.appengine.api.urlfetch.RequestPayloadTooLargeException: The request to https://www.googleapis.com/upload/storage/v1/b/pwccloudedw-stagging-bucket/o?name=appengine-api-L4wtoWwoElWmstI1Ia93cg.jar&uploadType=resumable&upload_id=AEnB2Uo6HCfw6Usa3aXlcOzg0g3RawrvuAxWuOUtQxwQdxoyA0cf22LKqno0Gu-hjKGLqXIo8MF2FHR63zTxrSmQ9Yk9HdCdZQ exceeded the 10 MiB limit.
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:157)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:543)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:422)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:275)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
at java.util.concurrent.FutureTask.run(FutureTask.java:260)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1168)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:605)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:1152)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:1146)
at java.lang.Thread.run(Thread.java:745)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1195)
I tried directly uploading the jar file - appengine-api-1.0-sdk-1.9.20.jar, still it tries to upload this jar appengine-api-L4wtoWwoElWmstI1Ia93cg.jar.
which i dont know what jar it is. Any idea on what this jar is appreciated.
Please help me to fix this issue.
The short answer is that if you use AppEngine on a Managed VM you will not encounter the AppEngine sandbox limits (OOM when using a F1 or B1 instance class, execution time limit issues, whitelisted JRE classes). If you really want to run within the App Engine sandbox, then your use of the Dataflow SDK most conform to the limits of the AppEngine sandbox. Below I explain common issues and what people have done to conform to the AppEngine sandbox limits.
The Dataflow SDK requires an AppEngine instance class which has enough memory to execute the users application to construct the pipeline, stage any resources, and send the job description to the Dataflow service. Typically we have seen users require using an instance class with more than 128mb of memory to not see OOM errors.
Generally constructing a pipeline and submitting it to the Dataflow service typically takes less than a couple of seconds if the required resources for your application are already staged. Uploading your JARs and any other resources to GCS can take longer than 60 seconds. This can be solved manually by pre-staging your JARs to GCS beforehand (the Dataflow SDK will skip staging them again if it detects they are already there) or using a task queue to get a 10 minute limit (note that for large applications, 10 mins may not be enough to stage all your resources).
Finally, within the AppEngine sandbox environment, you and all your dependencies are limited to using only whitelisted classes within the JRE or you'll get an exception like:
java.lang.SecurityException:
java.lang.IllegalAccessException: YYY is not allowed on ZZZ
...
EDIT 1
We perform a hash of the contents of the jars on the classpath and upload them to GCS with a modified filename. AppEngine runs a sandboxed environment with its own JARs, appengine-api-L4wtoWwoElWmstI1Ia93cg.jar refers to appengine-api.jar which is a jar that the sandboxed environment adds. You can see from our PackageUtil#getUniqueContentName(...) that we just append -$HASH before .jar.
We are working to solve why you are seeing the RequestPayloadToLarge excepton and it is currently recommended that you set the filesToStage option and filter out the jars not required to execute your Dataflow to get around the issue that you face. You can see how we build the files to stage with DataflowPipelineRunner#detectClassPathResourcesToStage(...).
I had the same problem with the 10MB limit. What I did was filtering out the JAR files bigger than that limit (instead of specific files), and then set the renaming files in the DataflowPipelineOptions with setFilesToStage.
So I just copied the method detectClassPathResourcesToStage from the Dataflow SDK and changed it sightly:
private static final long FILE_BYTES_THRESHOLD = 10 * 1024 * 1024; // 10 MB
protected static List<String> detectClassPathResourcesToStage(ClassLoader classLoader) {
if (!(classLoader instanceof URLClassLoader)) {
String message = String.format("Unable to use ClassLoader to detect classpath elements. "
+ "Current ClassLoader is %s, only URLClassLoaders are supported.", classLoader);
throw new IllegalArgumentException(message);
}
List<String> files = new ArrayList<>();
for (URL url : ((URLClassLoader) classLoader).getURLs()) {
try {
File file = new File(url.toURI());
if (file.length() < FILE_BYTES_THRESHOLD) {
files.add(file.getAbsolutePath());
}
} catch (IllegalArgumentException | URISyntaxException e) {
String message = String.format("Unable to convert url (%s) to file.", url);
throw new IllegalArgumentException(message, e);
}
}
return files;
}
And then when I'm creating the DataflowPipelineOptions:
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
dataflowOptions.setFilesToStage(detectClassPathResourcesToStage(DataflowPipelineRunner.class.getClassLoader()));
Here's a version of Helder's 10MB-filtering solution that will adapt to the default file-staging behavior of DataflowPipelineOptions even if it changes in a future version of the SDK.
Instead of duplicating the logic, it passes a throwaway copy of the DataflowPipelineOptions to DataflowPipelineRunner to see which files it would have staged, then removes any that are too big.
Note that this code assumes that you've defined a custom PipelineOptions class named MyOptions, along with a java.util.Logger field named logger.
// The largest file size that can be staged to the dataflow service.
private static final long MAX_STAGED_FILE_SIZE_BYTES = 10 * 1024 * 1024;
/**
* Returns the list of .jar/etc files to stage based on the
* Options, filtering out any files that are too large for
* DataflowPipelineRunner.
*
* <p>If this accidentally filters out a necessary file, it should
* be obvious when the pipeline fails with a runtime link error.
*/
private static ImmutableList<String> getFilesToStage(MyOptions options) {
// Construct a throw-away runner with a copy of the Options to see
// which files it would have wanted to stage. This could be an
// explicitly-specified list of files from the MyOptions param, or
// the default list of files determined by DataflowPipelineRunner.
List<String> baseFiles;
{
DataflowPipelineOptions tmpOptions =
options.cloneAs(DataflowPipelineOptions.class);
// Ignore the result; we only care about how fromOptions()
// modifies its parameter.
DataflowPipelineRunner.fromOptions(tmpOptions);
baseFiles = tmpOptions.getFilesToStage();
// Some value should have been set.
Preconditions.checkNotNull(baseFiles);
}
// Filter out any files that are too large to stage.
ImmutableList.Builder<String> filteredFiles = ImmutableList.builder();
for (String file : baseFiles) {
long size = new File(file).length();
if (size < MAX_STAGED_FILE_SIZE_BYTES) {
filteredFiles.add(file);
} else {
logger.info("Not staging large file " + file + ": length " + size
+ " >= max length " + MAX_STAGED_FILE_SIZE_BYTES);
}
}
return filteredFiles.build();
}
/** Runs the processing pipeline with given options. */
public void runPipeline(MyOptions options)
throws IOException, InterruptedException {
// DataflowPipelineRunner can't stage large files;
// remove any from the list.
DataflowPipelineOptions dpOpts =
options.as(DataflowPipelineOptions.class);
dpOpts.setFilesToStage(getFilesToStage(options));
// Run the pipeline as usual using "options".
// ...
}