Download files from S3 in parallel (AWS .NET SDK)

Download files from S3 in parallel (AWS .NET SDK) - amazon-s3

I can't get AmazonS3Client.GetObject to download files in parallel. The code is as follows:
public async Task<string> ReadFile(string filename)
{
string filePath = config.RootFolderPath + filename;
var sw = Stopwatch.StartNew();
Console.WriteLine(filePath + " - start");
using (var response = await s3Client.GetObjectAsync(config.Bucket, filePath))
{
Console.WriteLine(filePath + " - request - " + sw.ElapsedMilliseconds);
using (var reader = new StreamReader(response.ResponseStream))
{
return await reader.ReadToEndAsync();
}
}
}
This is called like this:
var tasks = (from file in files select ReadFile(file)).ToArray();
await Task.WhenAll(tasks);
This results that the requests are returned sequentially (not in order though). I read about 50 tiny files, so that takes about 25 seconds hanging in method GetObjectAsync for the last read. Instead I hoped that I can read the 50 files in 2-3 seconds.
I've already verified:
I'm on the task pool. So the synchronization context isn't in the mix. I also added a ConfigureAwait(false) to the tasks, but that didn't make a difference as expected.
I've tried various settings with the AmazonS3Client, like using the HTTP protocol or changing the buffer size. Without success.
I added a stop watch to verify the problem isn't around reading the response stream. However, when not reading the response stream, the whole method returns quickly.

Related

Very slow AWS presignedURL download. How to speed it up?

I am reading 20 - 30 different objects in varying size using the S3 IAM and a unique presignedURL for each file. The download of all files occur at once. Each phase occurs in sequence. Unfortunatly the S3Client is not thread safe so we cannot use async operations. Some files transfer rapidly while others lag. The total operation can take between 7 to > 15 seconds. I expected greater performance from S3 since AWS advertises that it has high throughput.
I see several posts that are unanswered about the download performance from S3. However the problem seems to have increased once we introduced link ambiguation using the IAM and presignedURL.
FYI my internet connection is broadband. It is unlikely the cause of the performance issue.
The tests are performed only a few hundred miles from S3 storage and eliminates distance as a factor of performance issues.
There is no server between the client and the S3 for downloading objects and is not the cause of performance issues.
One caveat. We tried using async forAllChunked using the Rice.edu habenero api. When we did not have any errors due to threading problems, the download performance was still very slow. This seemingly should eleminate the idea that download performance is slow due to it's serialization in the for loop. Albiet performance should be far better if we can download files simultainiously.
Code attatched.
public void cloudGetMedia(ArrayList<MediaSyncObj> mediaObjs, ArrayList<String> signedUrls) {
long getTime = System.currentTimeMillis();
// Ensure media directory exists or create it
String toDiskDir = DirectoryMgr.getMediaPath('M');
File diskFile = new File(toDiskDir);
FileOpsUtil.folderExists(diskFile);
// Process signedURLs
for(String signedurl : signedUrls) {
LOGGER.debug("cloudGetMedia called. signedURL is null: {}", signedurl == null);
URI fileToBeDownloaded = null;
try {
fileToBeDownloaded = new URI(signedurl);
} catch (URISyntaxException e) {
e.printStackTrace();
}
// get the file name from the presignedURL
AmazonS3URI s3URI = new AmazonS3URI(fileToBeDownloaded);
String localURL = toDiskDir + "/" + s3URI.getKey();
File file = new File(localURL);
AmazonS3 client = AmazonS3ClientBuilder.standard()
.withRegion(s3URI.getRegion())
.build();
try{
URL url = new URL(signedurl);
PresignedUrlDownloadRequest req = new PresignedUrlDownloadRequest(url);
client.download(req, file);
}
catch (MalformedURLException e) {
LOGGER.warn(e.getMessage());
e.printStackTrace();
}
}
getTime = (System.currentTimeMillis() - getTime);
LOGGER.debug("Total get time in syncCloudMediaAction: {} milliseconds, numElement: {}", getTime, signedUrls.size());
}

.NET Core API saving image upload asynchronously with ImageSharp, MemoryStream and FileStream

I have a .NET Core API that I'd like to extend to save uploaded images asynchronously.
Using ImageSharp I should be able to check uploads and resize if predefined size limits are exceeded. However I can't get a simple async save working.
A simple (non-async) save to file works without problem:
My Controller extracts IFormFile from the upload and calls the following method without any problem
public static void Save(IFormFile image, string imagesFolder)
{
var fileName = Path.Combine(imagesFolder, image.FileName);
using (var stream = image.OpenReadStream())
using (var imgIS = Image.Load(stream, out IImageFormat format))
{
imgIS.Save(fileName);
}
}
ImageSharp is currently lacking async methods so a workaround is necessary.
The updated code below saves the uploaded file but the format is incorrect - when viewing the file I get the message "It appears we don't support this file format".
The format is extracted from the ImageSharp Load method. and used when saving to MemoryStream.
MemoryStream CopyToAsync method is used to save to FileStream to make the upload asynchronous.
public static async void Save(IFormFile image, string imagesFolder)
{
var fileName = Path.Combine(imagesFolder, image.FileName);
using (var stream = image.OpenReadStream())
using (var imgIS = Image.Load(stream, out IImageFormat format))
using (var memoryStream = new MemoryStream())
using (var fileStream = new FileStream(fileName, FileMode.OpenOrCreate))
{
imgIS.Save(memoryStream, format);
await memoryStream.CopyToAsync(fileStream).ConfigureAwait(false);
fileStream.Flush();
memoryStream.Close();
fileStream.Close();
}
}
I can't work out whether the issue is with ImageSharp Save to MemoryStream, or the MemoryStream.CopyToAsync.
I'm currently getting 404 on SixLabors docs - hopefully not an indication that the project has folded.
How can I make the upload async and save to file in the correct format?

CopyToAsync copies a stream starting at its current position. You must change the current position of memoryStream back to start before copying:
// ...
memoryStream.Seek(0, SeekOrigin.Begin);
await memoryStream.CopyToAsync(fileStream).ConfigureAwait(false);
// ...

Uploading and downloading large files ~50gb using ASP Core 2.2 api

I'm struggling to provide ability in my ASP Core 2.2 app to upload and download large files, up to 50gb. Currently, for testing purposes, I'm saving the files on local storage but in future, I will move it to some cloud storage provider.
Files will be sent by other server written in Java, more specifically it will be Jenkins plugin that sends project builds to my ASP Core server using This library.
Currently, I use classic Controller class with HttpPost to upload the files, but this seems to me like not the best solution for my purposes since I won't use any webpage to attach files from client.
[HttpPost]
[RequestFormLimits(MultipartBodyLengthLimit = 50000000000)]
[RequestSizeLimit(50000000000)]
[AllowAnonymous]
[Route("[controller]/upload")]
public async Task<IActionResult> Upload()
{
var files = Request.Form.Files;
SetProgress(HttpContext.Session, 0);
long totalBytes = files.Sum(f => f.Length);
if (!IsMultipartContentType(HttpContext.Request.ContentType))
return StatusCode(415);
foreach (IFormFile file in files)
{
ContentDispositionHeaderValue contentDispositionHeaderValue =
ContentDispositionHeaderValue.Parse(file.ContentDisposition);
string filename = contentDispositionHeaderValue.FileName.Trim().ToString();
byte[] buffer = new byte[16 * 1024];
using (FileStream output = System.IO.File.Create(GetPathAndFilename(filename)))
{
using (Stream input = file.OpenReadStream())
{
long totalReadBytes = 0;
int readBytes;
while ((readBytes = input.Read(buffer, 0, buffer.Length)) > 0)
{
await output.WriteAsync(buffer, 0, readBytes);
totalReadBytes += readBytes;
int progress = (int)((float)totalReadBytes / (float)totalBytes * 100.0);
SetProgress(HttpContext.Session, progress);
Log($"SetProgress: {progress}", #"\LogSet.txt");
await Task.Delay(100);
}
}
}
}
return Content("success");
}
I'm using this code now to upload files but for larger files >300mb it takes ages to start the upload.
I tried looking for many articles on how to achieve this, such as:
Official docs
or
Stack
But none of the solutions seems to work for me since the upload takes ages and I also noticed that for files ~200MB(the largest file I could upload for now) the more data is uploaded the more my PC gets slower.
I need a piece of advice if I am following the right path or maybe I should change my approach. Thank you.

ASP.net - Uploading Files Associated with a Database Record?

I know that there are tons of examples of multi-part form data uploading in ASP.net. However, all of them just upload files to the server, and use System.IO to write it to server disk space. Also, the client side implementations seem to handle files only in uploading, so I can't really use existing upload plugins.
What if I have an existing record and I want to upload images and associate them with the record? Would I need to write database access code in the upload (Api) function, and if so, how do I pass that record's PK with the upload request? Do I instead upload the files in that one request, obtain the file names generated by the server, and then make separate API calls to associate the files with the record?
While at it, does anyone know how YouTube uploading works? From a user's perspective, it seems like we can upload a video, and while uploading, we can set title, description, tags, etc, and even save the record. Is a record for the video immediately created before the API request to upload, which is why we can save info even before upload completes?
Again, I'm not asking HOW to upload files. I'm asking how to associate uploaded files with an existing record and the API calls involved in it. Also, I am asking for what API calls to make WHEN in the user experience when they also input information about what they're uploading.

I'm assuming you're using an api call to get the initial data for displaying a list of files or an individual file. You would have to do this in order to pass the id back to the PUT method to update the file.
Here's a sample of the GET method:
[HttpGet]
public IEnumerable<FileMetaData> Get()
{
var allFiles = MyEntities.Files.Select(f => new FileMetaData()
{
Name = f.Name,
FileName = f.FileName,
Description = f.Description,
FileId = f.Id,
ContentType = f.ContentType,
Tags = f.Tags,
NumberOfKB = f.NumberOfKB
});
return allFiles;
}
Here's a sample of the POST method, which you can adapt to be a PUT (update) instead:
[HttpPost]
[ValidateMimeMultipartContentFilter]
public async Task<IHttpActionResult> PutFile()
{
try
{
var streamProvider =
await Request.Content.ReadAsMultipartAsync(new InMemoryMultipartFormDataStreamProvider());
//We only allow one file
var thisFile = files[0];
//For a PUT version, you would grab the file from the database based on the id included in the form data, instead of creating a new file
var file = new File()
{
FileName = thisFile.FileName,
ContentType = thisFile.ContentType,
NumberOfKB = thisFile.ContentLength
};
//This is the file metadata that your client would pass in as formData on the PUT / POST.
var formData = streamProvider.FormData;
if (formData != null && formData.Count > 0)
{
file.Id = formData["id"];
file.Description = formData["description"];
file.Name = formData["name"] ?? string.Empty;
file.Tags = formData["tags"];
}
file.Resource = thisFile.Data;
//For your PUT, change this to an update.
MyEntities.Entry(file).State = EntityState.Detached;
MyEntities.Files.Add(file);
await MyEntities.SaveChangesAsync();
//return the ID
return Ok(file.Id.ToString());
}
I got the InMemoryMultipartFormDataStreamProvider from this article:
https://conficient.wordpress.com/2013/07/22/async-file-uploads-with-mvc-webapi-and-bootstrap/
And adapted it to fit my needs for the form data I was returning.

Appication goto Freeze when downloading a webpage

i wrote a function for download a webpage : function like:
public string GetWebPage(string sURL)
{
System.Net.WebResponse objResponse = null;
System.Net.WebRequest objRequest = null;
System.IO.StreamReader objStreamReader = null;
string sResultPage = null;
try
{
objRequest = System.Net.HttpWebRequest.Create(sURL);
objResponse = objRequest.GetResponse();
objStreamReader = new System.IO.StreamReader(objResponse.GetResponseStream());
sResultPage = objStreamReader.ReadToEnd();
return sResultPage;
}
catch (Exception ex)
{
return "";
}
}
But my problem is that. when this function working at that time application goto freeze (not response) and that time my can't not do any thing. How can i solve this problem. when downloading at time user can do other thing in my application.

Welcome to the world of blocking IO.
Consider the following:
You want your program to download a web page and then return the first 10 letters it finds in the source html. Your code might look like this:
...
string page = GetWebPage("http://example.com"); // download web page
page = page.Substring(0, 10);
Console.WriteLine(page);
....
When your program calls GetWebPage(), it must WAIT for the web page to be fully downloaded before it can possibly try to call Substring() - else it may try to get the substring before it actually downloads the letters.
Now consider your program. You've got lots of code - maybe a GUI interface running - and it's all executing line by line one instruction at a time. When your code calls GetWebPage(), it can't possibly continue executing additional code until that request is fully finished. Your entire program is waiting on that request to finish.
The problem can be solved in a few different ways, and the best solution depends on exactly what you're doing with your code. Ideally, your code needs to execute asynchronously. c# has methods that can handle a lot of this for you, but one way or another, you're going to want to start some work - downloading the web page in your case - and then continue executing code until your main thread is notified that the webpage is fully downloaded. Then your main thread can begin parsing the return value.
I'm assuming that since you've asked this question, you are very new to threads and concurrency in general. You have a lot of work to do. Here are some resources for you to read up about threading and implementing concurrency in c#:
C# Thread Introduction
.NET Asynchronous IO Design

the best was is to use thread
new Thread(download).Start(url);
and if your download page size is large use chunk logic.
HttpWebRequest ObjHttpWebRequest = (HttpWebRequest)HttpWebRequest.Create(Convert.ToString(url));
ObjHttpWebRequest.AddRange(99204);
ObjHttpWebRequest.Timeout = Timeout.Infinite;
ObjHttpWebRequest.Method = "get";
HttpWebResponse ObjHttpWebResponse = (HttpWebResponse)ObjHttpWebRequest.GetResponse();
Stream ObjStream = ObjHttpWebResponse.GetResponseStream();
StreamReader ObjStreamReader = new StreamReader(ObjStream);
byte[] buffer = new byte[1224];
int length = 0;
while ((length = ObjStream.Read(buffer, 0, buffer.Length)) > 0)
{
downloaddata += Encoding.GetEncoding(936).GetString(buffer);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Download files from S3 in parallel (AWS .NET SDK) - amazon-s3

Related

Very slow AWS presignedURL download. How to speed it up?

.NET Core API saving image upload asynchronously with ImageSharp, MemoryStream and FileStream

Uploading and downloading large files ~50gb using ASP Core 2.2 api

ASP.net - Uploading Files Associated with a Database Record?

Appication goto Freeze when downloading a webpage

Categories

Resources