Upload large files Spring Webflux - spring-webflux

I want to upload files to a minio file container.
Smaller files work as expected with this code:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
log.info("About to save database to minio container...");
Mono<Boolean> result = Mono.from(
filePart.content().flatMap(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return Flux.just(bytes);
})
.flatMap(databaseFileService::write)
.then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
}));
log.info("Successfully saved database to minio container...");
return result;
}
I need to provide a byte[] for my minio service to be uploaded.
Smaller files work as expected (will be stored to the container). But larger files (12 MB in my test) don´t work.
I get this exception:
java.lang.IndexOutOfBoundsException: readPosition 0 and length 1024 should be smaller than writePosition 808

I´ve tried a suggestion DataBufferUtils.join from another SO post. This is kind of odd but i think the following code does the job:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
var result = DataBufferUtils.join(filePart.content()).map(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return bytes;
}).map(databaseFileService::write).then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
});
log.info("Successfully saved database to minio container...");
return result;
}
Especially this line seems to do the trick:
DataBufferUtils.join
I don´t know why but it seems to work.
EDIT:
Looking up the above static join sets a parameter maxByteCount to -1. Maybe the other (not working function for larger files) sets some limits. But i don´t know.

Related

Very slow AWS presignedURL download. How to speed it up?

I am reading 20 - 30 different objects in varying size using the S3 IAM and a unique presignedURL for each file. The download of all files occur at once. Each phase occurs in sequence. Unfortunatly the S3Client is not thread safe so we cannot use async operations. Some files transfer rapidly while others lag. The total operation can take between 7 to > 15 seconds. I expected greater performance from S3 since AWS advertises that it has high throughput.
I see several posts that are unanswered about the download performance from S3. However the problem seems to have increased once we introduced link ambiguation using the IAM and presignedURL.
FYI my internet connection is broadband. It is unlikely the cause of the performance issue.
The tests are performed only a few hundred miles from S3 storage and eliminates distance as a factor of performance issues.
There is no server between the client and the S3 for downloading objects and is not the cause of performance issues.
One caveat. We tried using async forAllChunked using the Rice.edu habenero api. When we did not have any errors due to threading problems, the download performance was still very slow. This seemingly should eleminate the idea that download performance is slow due to it's serialization in the for loop. Albiet performance should be far better if we can download files simultainiously.
Code attatched.
public void cloudGetMedia(ArrayList<MediaSyncObj> mediaObjs, ArrayList<String> signedUrls) {
long getTime = System.currentTimeMillis();
// Ensure media directory exists or create it
String toDiskDir = DirectoryMgr.getMediaPath('M');
File diskFile = new File(toDiskDir);
FileOpsUtil.folderExists(diskFile);
// Process signedURLs
for(String signedurl : signedUrls) {
LOGGER.debug("cloudGetMedia called. signedURL is null: {}", signedurl == null);
URI fileToBeDownloaded = null;
try {
fileToBeDownloaded = new URI(signedurl);
} catch (URISyntaxException e) {
e.printStackTrace();
}
// get the file name from the presignedURL
AmazonS3URI s3URI = new AmazonS3URI(fileToBeDownloaded);
String localURL = toDiskDir + "/" + s3URI.getKey();
File file = new File(localURL);
AmazonS3 client = AmazonS3ClientBuilder.standard()
.withRegion(s3URI.getRegion())
.build();
try{
URL url = new URL(signedurl);
PresignedUrlDownloadRequest req = new PresignedUrlDownloadRequest(url);
client.download(req, file);
}
catch (MalformedURLException e) {
LOGGER.warn(e.getMessage());
e.printStackTrace();
}
}
getTime = (System.currentTimeMillis() - getTime);
LOGGER.debug("Total get time in syncCloudMediaAction: {} milliseconds, numElement: {}", getTime, signedUrls.size());
}

AmazonS3: Getting warning: S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection

Here's the warning that I am getting:
S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
I tried using try with resources but S3ObjectInputStream doesn't seem to close via this method.
try (S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(s3ObjectInputStream, StandardCharsets.UTF_8));
){
//some code here blah blah blah
}
I also tried below code and explicitly closing but that doesn't work either:
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(s3ObjectInputStream, StandardCharsets.UTF_8));
){
//some code here blah blah
s3ObjectInputStream.close();
s3object.close();
}
Any help would be appreciated.
PS: I am only reading two lines of the file from S3 and the file has more data.
Got the answer via other medium. Sharing it here:
The warning indicates that you called close() without reading the whole file. This is problematic because S3 is still trying to send the data and you're leaving the connection in a sad state.
There's two options here:
Read the rest of the data from the input stream so the connection can be reused.
Call s3ObjectInputStream.abort() to close the connection without reading the data. The connection won't be reused, so you take some performance hit with the next request to re-create the connection. This may be worth it if it's going to take a long time to read the rest of the file.
Following option #1 of Chirag Sejpal's answer I used the below statement to drain the S3AbortableInputStream to ensure the connection can be reused:
com.amazonaws.util.IOUtils.drainInputStream(s3ObjectInputStream);
I ran into the same problem and the following class helped me
#Data
#AllArgsConstructor
public class S3ObjectClosable implements Closeable {
private final S3Object s3Object;
#Override
public void close() throws IOException {
s3Object.getObjectContent().abort();
s3Object.close();
}
}
and now you can use without warning
try (final var s3ObjectClosable = new S3ObjectClosable(s3Client.getObject(bucket, key))) {
//same code
}
To add an example to Chirag Sejpal's answer (elaborating on option #1), the following can be used to read the rest of the data from the input stream before closing it:
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
try (S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent()) {
try {
// Read from stream as necessary
} catch (Exception e) {
// Handle exceptions as necessary
} finally {
while (s3ObjectInputStream != null && s3ObjectInputStream.read() != -1) {
// Read the rest of the stream
}
}
// The stream will be closed automatically by the try-with-resources statement
}
I ran into the same error.
As others have pointed out, the /tmp space in lambda is limited to 512 MB.
And if the lambda context is re-used for a new invocation, then the /tmp space is already half-full.
So, when reading the S3 objects and writing all the files to the /tmp directory (as I was doing),
I ran out of disk space somewhere in between.
Lambda exited with error, but NOT all bytes from the S3ObjectInputStream were read.
So, two things one need to keep in mind:
1) If the first execution causes the problem, be stingy with your /tmp space.
We have only 512 MB
2) If the second execution causes the problem, then this could be resolved by attacking the root problem.
Its not possible to delete the /tmp folder.
So, delete all the files in the /tmp folder after the execution is finished.
In java, here is what I did, which successfully resolved the problem.
public String handleRequest(Map < String, String > keyValuePairs, Context lambdaContext) {
try {
// All work here
} catch (Exception e) {
logger.error("Error {}", e.toString());
return "Error";
} finally {
deleteAllFilesInTmpDir();
}
}
private void deleteAllFilesInTmpDir() {
Path path = java.nio.file.Paths.get(File.separator, "tmp", File.separator);
try {
if (Files.exists(path)) {
deleteDir(path.toFile());
logger.info("Successfully cleaned up the tmp directory");
}
} catch (Exception ex) {
logger.error("Unable to clean up the tmp directory");
}
}
public void deleteDir(File dir) {
File[] files = dir.listFiles();
if (files != null) {
for (final File file: files) {
deleteDir(file);
}
}
dir.delete();
}
This is my solution. I'm using spring boot 2.4.3
Create an amazon s3 client
AmazonS3 amazonS3Client = AmazonS3ClientBuilder
.standard()
.withRegion("your-region")
.withCredentials(
new AWSStaticCredentialsProvider(
new BasicAWSCredentials("your-access-key", "your-secret-access-key")))
.build();
Create an amazon transfer client.
TransferManager transferManagerClient = TransferManagerBuilder.standard()
.withS3Client(amazonS3Client)
.build();
Create a temporary file in /tmp/{your-s3-key} so that we can put the file we download in this file.
File file = new File(System.getProperty("java.io.tmpdir"), "your-s3-key");
try {
file.createNewFile(); // Create temporary file
} catch (IOException e) {
e.printStackTrace();
}
file.mkdirs(); // Create the directory of the temporary file
Then, we download the file from s3 using transfer manager client
// Note that in this line the s3 file downloaded has been transferred in to the temporary file that we created
Download download = transferManagerClient.download(
new GetObjectRequest("your-s3-bucket-name", "your-s3-key"), file);
// This line blocks the thread until the download is finished
download.waitForCompletion();
Now that the s3 file has been successfully transferred into the temporary file that we created. We can get the InputStream of the temporary file.
InputStream input = new DataInputStream(new FileInputStream(file));
Because the temporary file is not needed anymore, we just delete it.
file.delete();

Windows Azure Storage Blobs to zip file with Express

I am trying to use this pluggin (express-zip). At the Azure Storage size we have getBlobToStream which give us the file into a specific Stream. What i do now is getting image from blob and saving it inside the server, and then res.zip it. Is somehow possible to create writeStream which will write inside readStream?
Edit: The question has been edited to ask about doing this in express from Node.js. I'm leaving the original answer below in case anyone was interested in a C# solution.
For Node, You could use a strategy similar to what express-zip uses, but instead of passing a file read stream in this line, pass in a blob read stream obtained using createReadStream.
Solution using C#:
If you don't mind caching everything locally while you build the zip, the way you are doing it is fine. You can use a tool such as AzCopy to rapidly download an entire container from storage.
To avoid caching locally, you could use the ZipArchive class, such as the following C# code:
internal static void ArchiveBlobs(CloudBlockBlob destinationBlob, IEnumerable<CloudBlob> sourceBlobs)
{
using (Stream blobWriteStream = destinationBlob.OpenWrite())
{
using (ZipArchive archive = new ZipArchive(blobWriteStream, ZipArchiveMode.Create))
{
foreach (CloudBlob sourceBlob in sourceBlobs)
{
ZipArchiveEntry archiveEntry = archive.CreateEntry(sourceBlob.Name);
using (Stream archiveWriteStream = archiveEntry.Open())
{
sourceBlob.DownloadToStream(archiveWriteStream);
}
}
}
}
}
This creates a zip archive in Azure storage that contains multiple blobs without writing anything to disk locally.
I'm the author of express-zip. What you are trying to do should be possible. If you look under the covers, you'll see I am in fact adding streams into the zip:
https://github.com/thrackle/express-zip/blob/master/lib/express-zip.js#L55
So something like this should work for you (prior to me adding support for this in the interface of the package itself):
var zip = zipstream(exports.options);
zip.pipe(express.response || http.ServerResponse.prototype); // res is a writable stream
var addFile = function(file, cb) {
zip.entry(getBlobToStream(), { name: file.name }, cb);
};
async.forEachSeries(files, addFile, function(err) {
if (err) return cb(err);
zip.finalize(function(bytesZipped) {
cb(null, bytesZipped);
});
});
Apologize if I've made horrible errors above; I haven't been on this for a bit.

Uploading PDF documents to a SQL Server

We have a scanner that is going to scan documents into a specific directory.
I need ideas on where to start on an application/service that I can schedule to run every hour or so and upload the documents to a Cloud DB.
Every single example I can find involves the uploader that requires a user to browse/select the file; I don't want that.
The documents will be stored as binary if that matters.
Maybe this will help you get started. Build a console app that does something along these lines
void Main()
{
while(true)
{
var files = Directory.GetFiles("c:\\your\\path\\here\\","*.pdf");
foreach (var file in files)
{
UploadToCloudDB(file);
//possibly delete the file now
}
Thread.Sleep(30000); //pause for 30 seconds between scans
}
}
static void UploadToCloudDB(string path)
{
var bytes = File.ReadAllBytes(path);
//upload those bytes to the cloud DB ....
}

Azure storage: Uploaded files with size zero bytes

When I upload an image file to a blob, the image is uploaded apparently successfully (no errors). When I go to cloud storage studio, the file is there, but with a size of 0 (zero) bytes.
The following is the code that I am using:
// These two methods belong to the ContentService class used to upload
// files in the storage.
public void SetContent(HttpPostedFileBase file, string filename, bool overwrite)
{
CloudBlobContainer blobContainer = GetContainer();
var blob = blobContainer.GetBlobReference(filename);
if (file != null)
{
blob.Properties.ContentType = file.ContentType;
blob.UploadFromStream(file.InputStream);
}
else
{
blob.Properties.ContentType = "application/octet-stream";
blob.UploadByteArray(new byte[1]);
}
}
public string UploadFile(HttpPostedFileBase file, string uploadPath)
{
if (file.ContentLength == 0)
{
return null;
}
string filename;
int indexBar = file.FileName.LastIndexOf('\\');
if (indexBar > -1)
{
filename = DateTime.UtcNow.Ticks + file.FileName.Substring(indexBar + 1);
}
else
{
filename = DateTime.UtcNow.Ticks + file.FileName;
}
ContentService.Instance.SetContent(file, Helper.CombinePath(uploadPath, filename), true);
return filename;
}
// The above code is called by this code.
HttpPostedFileBase newFile = Request.Files["newFile"] as HttpPostedFileBase;
ContentService service = new ContentService();
blog.Image = service.UploadFile(newFile, string.Format("{0}{1}", Constants.Paths.BlogImages, blog.RowKey));
Before the image file is uploaded to the storage, the Property InputStream from the HttpPostedFileBase appears to be fine (the size of the of image corresponds to what is expected! And no exceptions are thrown).
And the really strange thing is that this works perfectly in other cases (uploading Power Points or even other images from the Worker role). The code that calls the SetContent method seems to be exactly the same and file seems to be correct since a new file with zero bytes is created at the correct location.
Does any one have any suggestion please? I debugged this code dozens of times and I cannot see the problem. Any suggestions are welcome!
Thanks
The Position property of the InputStream of the HttpPostedFileBase had the same value as the Length property (probably because I had another file previous to this one - stupid I think!).
All I had to do was to set the Position property back to 0 (zero)!
I hope this helps somebody in the future.
Thanks Fabio for bringing this up and solving your own question. I just want to add code to whatever you have said. Your suggestion worked perfectly for me.
var memoryStream = new MemoryStream();
// "upload" is the object returned by fine uploader
upload.InputStream.CopyTo(memoryStream);
memoryStream.ToArray();
// After copying the contents to stream, initialize it's position
// back to zeroth location
memoryStream.Seek(0, SeekOrigin.Begin);
And now you are ready to upload memoryStream using:
blockBlob.UploadFromStream(memoryStream);