Size of PDF breaks FastAPI using python-multipart?

Size of PDF breaks FastAPI using python-multipart? - pdf

I am trying to upload a PDF to FastAPI. After turning the PDF into a base64-blob and storing it in a txt-file, I POST this file to FastAPI using Postman.
This is my server-side code:
from fastapi import FastAPI, File, UploadFile
import base64
app = FastAPI()
#app.post("/uploadfile/")
async def create_upload_file(file: UploadFile = File(...)):
contents = await file.read()
blob = base64.b64decode(contents)
pdf = open('result.pdf','wb')
pdf.write(blob)
pdf.close()
return {"filename": file.filename}
This procedure works fine for a single-page PDF document of size 279KB (blob-size: 372KB), but it doesn't for a multi-page document of size 1.8MB (blob-size: 2.4MB).
When I try, I get the following WARNING and a 400 bad request response (along with the reseponse "detail": "There was an error parsing the body"):
"Did not find boundary character 55 at index 2"
I'm sure there must be an explanation for this behavior? Maybe it has something to do with async?

This is most likely an issue with saving the file using open().
For large files pdf.close() will execute before pdf.write() has finished saving all the contents of the file.
In order to ensure the whole file being written before it is closed, use with such as this:
with open('failed.pdf', 'wb') as outfile:
outfile.write(blob)
Using the with you will not need to close() after writing. with should also be considered best practice over saving the file into a local variable.

Related

Error converting .docx with HTML to PDF using Graph API

I am trying to convert MS Word (.docx) file to PDF format using Graph API. The file is stored in SharePoint Office 365. I am using below code which works.
var httpClient = await CreateAuthorizedHttpClient();
string path = $"{GraphEndpoint}sites/{SiteId}/drive/items/";
string requestUrl = $"{path}{fileId}/content?format={targetFormat}";
var response = await httpClient.GetAsync(requestUrl);
However, when we try to convert .docx file which contains HTML added using below code converting fails.
string altChunkId = "myId123";
//Create an alternative format import part on the MainDocumentPart
AlternativeFormatImportPart altformatImportPart = wordDoc.MainDocumentPart
.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, altChunkId);
using (MemoryStream htmlMemoryStream = new MemoryStream(Encoding.UTF8.GetBytes($"<html><head></head><body>{value}</body></html>")))
{
//Add the HTML data into the alternative format import part
altformatImportPart.FeedData(htmlMemoryStream);
//create a new altChunk and link it to the id of the AlternativeFormatImportPart
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
//p.InsertAfterSelf(altChunk);
documentBody.Append(altChunk);
break;
}
I get 406 Not Acceptable error when we try to convert the file using Graph API. Also I see that the file is not editable in browser and open in compatibility mode. If I try to open the document in edit mode I get error:
Sorry this document can't be opened because it contains objects that
word doesn't support
I tried removing the HTML part of the document and pasted that in another document and tried converting that document to PDF which worked. When I saw the XML of the document I saw Word App converted that HTML to word compatible XML tags.
Question 1: How can I convert the HTML to word compatible tags? So I can convert the document to PDF.
Also if I try to Download as PDF, the file is converted to PDF without any issue.
This option is using below API call:
https://word-view.officeapps.live.com/wv/WordViewer/request.pdf?WOPIsrc={SiteURL}%2F%5Fvti%5Fbin%2Fwopi%2Eashx%2Ffiles%2F{ID}&access_token=&access_token_ttl=&z=256&type=downloadpdf
Question 2: Is there a way I can use this API to convert .docx file to PDF? I saw the access token's audience value is "wopi/{TenantName}#{TenantID}". If I get the correct access token I think I will be able to use the above API.

Loading DICOM from zip archive

I am trying to load DICOMs from a DICOM Server. Loading a single file with the URL is working fine.
Now I want to load a whole series of DICOM Data. I get the data from the server with an HTTP-request as a zip archive.
I have tried to unzip the response with the zip.js library and pass the unziped text to the loader.parse function, to load the DICOMs as in the example "viewers_upload". But I get the error that the file could not be parsed.
Is there a way to load the data without the URL? Or how do I have to modify the example so that it will work for a zip archive?
This is the code from unzipping the file and passing it to the parser:
reader.getEntries(function(entries) {
if (entries.length) {
//getting one entry from the zipfile
entries[0].getData(new zip.ArrayBufferWriter(), function (dicom) {
loader.parse({url: "dicomName", dicom});
} , function (current, total) {
});
}
The error message is:
"dicomParser.readFixedString: attempt to read past end of buffer"
"Uncaught (in promise) parsers.dicom could not parse the file"
I think the problem might be with the returned datatype of the zipfile? Which type do I have to pass to the parse function? How has the structure of the data in the parser has to be? What length of the buffer does the parser expect?

File upload with Django REST Framework and coreapi

Is it possible to perform a file upload to DRF with HyperlinkedModelSerializer in a model which has a FileField?
I am using the coreapi File class from the utils package and coreapi complains about the File object not being a JSON primative (sic).
Looking through the code it looks like the schema has to say the encoding must be multipart form.
Where can I find a working example for such a file upload to DRF into a model with a FileField?

So... reading through the code I came across the encoding parameter for client.action.
If set to multipart/form-data, the file is correctly encoded and not validated as a JSON field but instead a body parameter.
with open('/Users/Jonathan/Desktop/test.png', 'rb') as f:
client.action(schema, ['incidents', 'create'], params={ 'file': utils.File('test.png', f) }, encoding="multipart/form-data")
Reading through transports/http.py and utils.py for the rest of the story….

MVC4 - How to upload a file partially (only the first 10 lines, for e.g.)

ASP.NET MVC - Is it possible to upload only the first 10 lines of a file? Basically, we have some files that can range from 1-10GB but the data that we need is present only in the first 10 rows in the file. Using the typical web development approache, we'd upload the whole file to the server and then read the first 10 rows, but uploading a 10GB file just to read a few bytes of data seems a big waste of resources. Is it possible to read such a file without uploading all of it to the webserver?
Solution - FileAPIs slice function solved this problem (thanks to Chris below). The simplified code is below for anyone interested -
var sampleFile = document.getElementById('yourfileelement').files[0];
var reader = new FileReader();
var fileData = sampleFile.slice(0, 500000); //Read top 500000 bytes
reader.onprogress = function (evt) { //Show progressbar etc }
reader.onloadend = function (evt) { alert(evt.target.result); } //evt.target.result contains the file data that was read
reader.readAsText(fileClientReadData);

No, but you may be able to accomplish it using the File API client-side to read and send to the server via AJAX just the first 10 lines. However, note that the File API is only supported in modern browsers, so this won't work with IE 9 or less. You might be able to create a more comprehensive solution using a Flash or Java applet, but ugh.

Locally calculate dropbox hash of files

Dropbox rest api, in function metatada has a parameter named "hash" https://www.dropbox.com/developers/reference/api#metadata
Can I calculate this hash locally without call any remote api rest function?
I need know this value to reduce upload bandwidth.

https://www.dropbox.com/developers/reference/content-hash explains how Dropbox computes their file hashes. A Python implementation of this is below:
import hashlib
import math
import os
DROPBOX_HASH_CHUNK_SIZE = 4*1024*1024
def compute_dropbox_hash(filename):
file_size = os.stat(filename).st_size
with open(filename, 'rb') as f:
block_hashes = b''
while True:
chunk = f.read(DROPBOX_HASH_CHUNK_SIZE)
if not chunk:
break
block_hashes += hashlib.sha256(chunk).digest()
return hashlib.sha256(block_hashes).hexdigest()

The "hash" parameter on the metadata call isn't actually the hash of the file, but a hash of the metadata. It's purpose is to save you having to re-download the metadata in your request if it hasn't changed by supplying it during the metadata request. It is not intended to be used as a file hash.
Unfortunately I don't see any way via the Dropbox API to get a hash of the file itself. I think your best bet for reducing your upload bandwidth would be to keep track of the hash's of your files locally and detect if they have changed when determining whether to upload them. Depending on your system you also likely want to keep track of the "rev" (revision) value returned on the metadata request so you can tell whether the version on Dropbox itself has changed.

This won't directly answer your question, but is meant more as a workaround; The dropbox sdk gives a simple updown.py example that uses file size and modification time to check the currency of a file.
an abbreviated example taken from updown.py:
dbx = dropbox.Dropbox(api_token)
...
# returns a dictionary of name: FileMetaData
listing = list_folder(dbx, folder, subfolder)
# name is the name of the file
md = listing[name]
# fullname is the path of the local file
mtime = os.path.getmtime(fullname)
mtime_dt = datetime.datetime(*time.gmtime(mtime)[:6])
size = os.path.getsize(fullname)
if (isinstance(md, dropbox.files.FileMetadata) and mtime_dt == md.client_modified and size == md.size):
print(name, 'is already synced [stats match]')

As far as I am concerned, No you can't.
The only way is using Dropbox API which is explained here.

The rclone go program from https://rclone.org has exactly what you want:
rclone hashsum dropbox localfile
rclone hashsum dropbox localdir
It can't take more than one path argument but I suspect that's something you can work with...
t0|todd#tlaptop/p8 ~/tmp|295$ echo "Hello, World!" > dropbox-hash-demo/hello.txt
t0|todd#tlaptop/p8 ~/tmp|296$ rclone copy dropbox-hash-demo/hello.txt dropbox-ttf:demo
t0|todd#tlaptop/p8 ~/tmp|297$ rclone hashsum dropbox dropbox-hash-demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt
t0|todd#tlaptop/p8 ~/tmp|298$ rclone hashsum dropbox dropbox-ttf:demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Size of PDF breaks FastAPI using python-multipart? - pdf

Related

Error converting .docx with HTML to PDF using Graph API

Loading DICOM from zip archive

File upload with Django REST Framework and coreapi

MVC4 - How to upload a file partially (only the first 10 lines, for e.g.)

Locally calculate dropbox hash of files

Categories

Resources