How to limit blob storage file size in ASA output - azure-stream-analytics

I'm working with an Azure solution where there is an output to a blob storage in ASA. I'm getting output files in a folder tree structure like this: yyyy/mm/dd/hh (e.g. 2017/10/26/07). Sometimes, files in the blob storage are saving in the hour folder after that hour is past and, as the result, files can be very big. Is there a way to limit the size of those files from ASA?

There is no way to limit the size today, size limitation is based only on blob's limit. However ASA will create a new folder for every hour if your path is yyyy/mm/dd/hh though. Please note that this is based on System.Timestamp column, not wall clock time.

Yes you limit the file size and create new file once the existing file size reaches the limit by using below length property.
namespace Microsoft.Azure.Management.DataLake.Store.Models {
...
// Summary:
// Gets the number of bytes in a file.
[JsonProperty(PropertyName = "length")]
public long? Length { get; }
...
}
Below is the example with scenario:
scenario If file size exceeds 256MB OR 268435456 bytes then create new file or use existing file.
Create a function and use this function to determine the file path, below is the sample code snippet for function.
Code Snippet:
public static async Task<string> GetFilePath(DataLakeStoreClient client, string path) {
var createNewFile = false;
......
if (await client.GetFileSize(returnValue) >= 256 * 1024 * 1024)
{
returnValue = GetFilePath(path);
createNewFile = true;
}
......
}
public async Task<long?> GetFileSize(string filepath) {
return (await this._client.FileSystem.GetFileStatusAsync(_connectionString.AccountName, path)).FileStatus.Length;
}

Related

Upload large files Spring Webflux

I want to upload files to a minio file container.
Smaller files work as expected with this code:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
log.info("About to save database to minio container...");
Mono<Boolean> result = Mono.from(
filePart.content().flatMap(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return Flux.just(bytes);
})
.flatMap(databaseFileService::write)
.then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
}));
log.info("Successfully saved database to minio container...");
return result;
}
I need to provide a byte[] for my minio service to be uploaded.
Smaller files work as expected (will be stored to the container). But larger files (12 MB in my test) don´t work.
I get this exception:
java.lang.IndexOutOfBoundsException: readPosition 0 and length 1024 should be smaller than writePosition 808
I´ve tried a suggestion DataBufferUtils.join from another SO post. This is kind of odd but i think the following code does the job:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
var result = DataBufferUtils.join(filePart.content()).map(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return bytes;
}).map(databaseFileService::write).then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
});
log.info("Successfully saved database to minio container...");
return result;
}
Especially this line seems to do the trick:
DataBufferUtils.join
I don´t know why but it seems to work.
EDIT:
Looking up the above static join sets a parameter maxByteCount to -1. Maybe the other (not working function for larger files) sets some limits. But i don´t know.

How to enable parallelism for a custom U-SQL Extractor

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor
If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help.
Could anyone help me with this question?
The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).
If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).
Here is an example:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
// this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
for (int i = 0; i < array.Length; i++)
{
// DO YOUR PROCESSING
}
}
yield return outputrow.AsReadOnly();
}
}
Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.

Uploading PDF documents to a SQL Server

We have a scanner that is going to scan documents into a specific directory.
I need ideas on where to start on an application/service that I can schedule to run every hour or so and upload the documents to a Cloud DB.
Every single example I can find involves the uploader that requires a user to browse/select the file; I don't want that.
The documents will be stored as binary if that matters.
Maybe this will help you get started. Build a console app that does something along these lines
void Main()
{
while(true)
{
var files = Directory.GetFiles("c:\\your\\path\\here\\","*.pdf");
foreach (var file in files)
{
UploadToCloudDB(file);
//possibly delete the file now
}
Thread.Sleep(30000); //pause for 30 seconds between scans
}
}
static void UploadToCloudDB(string path)
{
var bytes = File.ReadAllBytes(path);
//upload those bytes to the cloud DB ....
}

How can i check size of file using ContentLength before save as in server (MVC4)

My error is Maximum request length exceeded.
I want file upload must be smaller than 2MB. Please help me to fix code bellow, thanks
my controller:
public ActionResult Index()
{
var path = "~/Images/upload/";
if (Request.Files["UpFile"] != null && Request.Files["UpFile"].ContentLength < 2048)
{
var upload = Request.Files["UpFile"];
upload.SaveAs(Server.MapPath(path + upload.FileName));
}
else
{
ModelState.AddModelError("", "The size of file too big");
}
return View();
}
Try to manage your maximum request length for reducing errors to minimum:
Maximum request length exceeded
I think it's a good practice to use try..catch when working with uploading files even if you have global exception handler.

Azure storage: Uploaded files with size zero bytes

When I upload an image file to a blob, the image is uploaded apparently successfully (no errors). When I go to cloud storage studio, the file is there, but with a size of 0 (zero) bytes.
The following is the code that I am using:
// These two methods belong to the ContentService class used to upload
// files in the storage.
public void SetContent(HttpPostedFileBase file, string filename, bool overwrite)
{
CloudBlobContainer blobContainer = GetContainer();
var blob = blobContainer.GetBlobReference(filename);
if (file != null)
{
blob.Properties.ContentType = file.ContentType;
blob.UploadFromStream(file.InputStream);
}
else
{
blob.Properties.ContentType = "application/octet-stream";
blob.UploadByteArray(new byte[1]);
}
}
public string UploadFile(HttpPostedFileBase file, string uploadPath)
{
if (file.ContentLength == 0)
{
return null;
}
string filename;
int indexBar = file.FileName.LastIndexOf('\\');
if (indexBar > -1)
{
filename = DateTime.UtcNow.Ticks + file.FileName.Substring(indexBar + 1);
}
else
{
filename = DateTime.UtcNow.Ticks + file.FileName;
}
ContentService.Instance.SetContent(file, Helper.CombinePath(uploadPath, filename), true);
return filename;
}
// The above code is called by this code.
HttpPostedFileBase newFile = Request.Files["newFile"] as HttpPostedFileBase;
ContentService service = new ContentService();
blog.Image = service.UploadFile(newFile, string.Format("{0}{1}", Constants.Paths.BlogImages, blog.RowKey));
Before the image file is uploaded to the storage, the Property InputStream from the HttpPostedFileBase appears to be fine (the size of the of image corresponds to what is expected! And no exceptions are thrown).
And the really strange thing is that this works perfectly in other cases (uploading Power Points or even other images from the Worker role). The code that calls the SetContent method seems to be exactly the same and file seems to be correct since a new file with zero bytes is created at the correct location.
Does any one have any suggestion please? I debugged this code dozens of times and I cannot see the problem. Any suggestions are welcome!
Thanks
The Position property of the InputStream of the HttpPostedFileBase had the same value as the Length property (probably because I had another file previous to this one - stupid I think!).
All I had to do was to set the Position property back to 0 (zero)!
I hope this helps somebody in the future.
Thanks Fabio for bringing this up and solving your own question. I just want to add code to whatever you have said. Your suggestion worked perfectly for me.
var memoryStream = new MemoryStream();
// "upload" is the object returned by fine uploader
upload.InputStream.CopyTo(memoryStream);
memoryStream.ToArray();
// After copying the contents to stream, initialize it's position
// back to zeroth location
memoryStream.Seek(0, SeekOrigin.Begin);
And now you are ready to upload memoryStream using:
blockBlob.UploadFromStream(memoryStream);