Uploading PDF documents to a SQL Server - sql

We have a scanner that is going to scan documents into a specific directory.
I need ideas on where to start on an application/service that I can schedule to run every hour or so and upload the documents to a Cloud DB.
Every single example I can find involves the uploader that requires a user to browse/select the file; I don't want that.
The documents will be stored as binary if that matters.

Maybe this will help you get started. Build a console app that does something along these lines
void Main()
{
while(true)
{
var files = Directory.GetFiles("c:\\your\\path\\here\\","*.pdf");
foreach (var file in files)
{
UploadToCloudDB(file);
//possibly delete the file now
}
Thread.Sleep(30000); //pause for 30 seconds between scans
}
}
static void UploadToCloudDB(string path)
{
var bytes = File.ReadAllBytes(path);
//upload those bytes to the cloud DB ....
}

Related

Upload large files Spring Webflux

I want to upload files to a minio file container.
Smaller files work as expected with this code:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
log.info("About to save database to minio container...");
Mono<Boolean> result = Mono.from(
filePart.content().flatMap(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return Flux.just(bytes);
})
.flatMap(databaseFileService::write)
.then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
}));
log.info("Successfully saved database to minio container...");
return result;
}
I need to provide a byte[] for my minio service to be uploaded.
Smaller files work as expected (will be stored to the container). But larger files (12 MB in my test) don´t work.
I get this exception:
java.lang.IndexOutOfBoundsException: readPosition 0 and length 1024 should be smaller than writePosition 808
I´ve tried a suggestion DataBufferUtils.join from another SO post. This is kind of odd but i think the following code does the job:
private Mono<Boolean> saveFileToMinio(FilePart filePart) {
var result = DataBufferUtils.join(filePart.content()).map(dataBuffer -> {
var bytes = dataBuffer.asByteBuffer().array();
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
return bytes;
}).map(databaseFileService::write).then(Mono.just(true))
.onErrorMap(throwable -> {
log.error(throwable.getMessage(), throwable);
return throwable;
});
log.info("Successfully saved database to minio container...");
return result;
}
Especially this line seems to do the trick:
DataBufferUtils.join
I don´t know why but it seems to work.
EDIT:
Looking up the above static join sets a parameter maxByteCount to -1. Maybe the other (not working function for larger files) sets some limits. But i don´t know.

Uploading and downloading large files ~50gb using ASP Core 2.2 api

I'm struggling to provide ability in my ASP Core 2.2 app to upload and download large files, up to 50gb. Currently, for testing purposes, I'm saving the files on local storage but in future, I will move it to some cloud storage provider.
Files will be sent by other server written in Java, more specifically it will be Jenkins plugin that sends project builds to my ASP Core server using This library.
Currently, I use classic Controller class with HttpPost to upload the files, but this seems to me like not the best solution for my purposes since I won't use any webpage to attach files from client.
[HttpPost]
[RequestFormLimits(MultipartBodyLengthLimit = 50000000000)]
[RequestSizeLimit(50000000000)]
[AllowAnonymous]
[Route("[controller]/upload")]
public async Task<IActionResult> Upload()
{
var files = Request.Form.Files;
SetProgress(HttpContext.Session, 0);
long totalBytes = files.Sum(f => f.Length);
if (!IsMultipartContentType(HttpContext.Request.ContentType))
return StatusCode(415);
foreach (IFormFile file in files)
{
ContentDispositionHeaderValue contentDispositionHeaderValue =
ContentDispositionHeaderValue.Parse(file.ContentDisposition);
string filename = contentDispositionHeaderValue.FileName.Trim().ToString();
byte[] buffer = new byte[16 * 1024];
using (FileStream output = System.IO.File.Create(GetPathAndFilename(filename)))
{
using (Stream input = file.OpenReadStream())
{
long totalReadBytes = 0;
int readBytes;
while ((readBytes = input.Read(buffer, 0, buffer.Length)) > 0)
{
await output.WriteAsync(buffer, 0, readBytes);
totalReadBytes += readBytes;
int progress = (int)((float)totalReadBytes / (float)totalBytes * 100.0);
SetProgress(HttpContext.Session, progress);
Log($"SetProgress: {progress}", #"\LogSet.txt");
await Task.Delay(100);
}
}
}
}
return Content("success");
}
I'm using this code now to upload files but for larger files >300mb it takes ages to start the upload.
I tried looking for many articles on how to achieve this, such as:
Official docs
or
Stack
But none of the solutions seems to work for me since the upload takes ages and I also noticed that for files ~200MB(the largest file I could upload for now) the more data is uploaded the more my PC gets slower.
I need a piece of advice if I am following the right path or maybe I should change my approach. Thank you.

How to limit blob storage file size in ASA output

I'm working with an Azure solution where there is an output to a blob storage in ASA. I'm getting output files in a folder tree structure like this: yyyy/mm/dd/hh (e.g. 2017/10/26/07). Sometimes, files in the blob storage are saving in the hour folder after that hour is past and, as the result, files can be very big. Is there a way to limit the size of those files from ASA?
There is no way to limit the size today, size limitation is based only on blob's limit. However ASA will create a new folder for every hour if your path is yyyy/mm/dd/hh though. Please note that this is based on System.Timestamp column, not wall clock time.
Yes you limit the file size and create new file once the existing file size reaches the limit by using below length property.
namespace Microsoft.Azure.Management.DataLake.Store.Models {
...
// Summary:
// Gets the number of bytes in a file.
[JsonProperty(PropertyName = "length")]
public long? Length { get; }
...
}
Below is the example with scenario:
scenario If file size exceeds 256MB OR 268435456 bytes then create new file or use existing file.
Create a function and use this function to determine the file path, below is the sample code snippet for function.
Code Snippet:
public static async Task<string> GetFilePath(DataLakeStoreClient client, string path) {
var createNewFile = false;
......
if (await client.GetFileSize(returnValue) >= 256 * 1024 * 1024)
{
returnValue = GetFilePath(path);
createNewFile = true;
}
......
}
public async Task<long?> GetFileSize(string filepath) {
return (await this._client.FileSystem.GetFileStatusAsync(_connectionString.AccountName, path)).FileStatus.Length;
}

How to merge 10000 pdf into one using pdfbox in most effective way

PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.
I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}

Windows Azure Storage Blobs to zip file with Express

I am trying to use this pluggin (express-zip). At the Azure Storage size we have getBlobToStream which give us the file into a specific Stream. What i do now is getting image from blob and saving it inside the server, and then res.zip it. Is somehow possible to create writeStream which will write inside readStream?
Edit: The question has been edited to ask about doing this in express from Node.js. I'm leaving the original answer below in case anyone was interested in a C# solution.
For Node, You could use a strategy similar to what express-zip uses, but instead of passing a file read stream in this line, pass in a blob read stream obtained using createReadStream.
Solution using C#:
If you don't mind caching everything locally while you build the zip, the way you are doing it is fine. You can use a tool such as AzCopy to rapidly download an entire container from storage.
To avoid caching locally, you could use the ZipArchive class, such as the following C# code:
internal static void ArchiveBlobs(CloudBlockBlob destinationBlob, IEnumerable<CloudBlob> sourceBlobs)
{
using (Stream blobWriteStream = destinationBlob.OpenWrite())
{
using (ZipArchive archive = new ZipArchive(blobWriteStream, ZipArchiveMode.Create))
{
foreach (CloudBlob sourceBlob in sourceBlobs)
{
ZipArchiveEntry archiveEntry = archive.CreateEntry(sourceBlob.Name);
using (Stream archiveWriteStream = archiveEntry.Open())
{
sourceBlob.DownloadToStream(archiveWriteStream);
}
}
}
}
}
This creates a zip archive in Azure storage that contains multiple blobs without writing anything to disk locally.
I'm the author of express-zip. What you are trying to do should be possible. If you look under the covers, you'll see I am in fact adding streams into the zip:
https://github.com/thrackle/express-zip/blob/master/lib/express-zip.js#L55
So something like this should work for you (prior to me adding support for this in the interface of the package itself):
var zip = zipstream(exports.options);
zip.pipe(express.response || http.ServerResponse.prototype); // res is a writable stream
var addFile = function(file, cb) {
zip.entry(getBlobToStream(), { name: file.name }, cb);
};
async.forEachSeries(files, addFile, function(err) {
if (err) return cb(err);
zip.finalize(function(bytesZipped) {
cb(null, bytesZipped);
});
});
Apologize if I've made horrible errors above; I haven't been on this for a bit.