Export document to pdf file with sepatated pages - sap

I am using RESTful API /documents/id to retrieve pdf file as described here. But I get the single-page pdf file even if the document is very large, one big page. However if I try to export pdf by business intelligence designer I will get the pdf in multi-page format. How can I export the document in multi-page format by RESTful API? In the documentation I do not see appropriate parameter, just dpi for pdf...

I found an answer. There is a separate API for this - /raylight/v1/documents/27592/pages, p. 8.1.13.2 of User Guide.

Related

Azure Computer Vision API - OCR to Text on PDF files

I'm attempting to leverage the Computer Vision API to OCR a PDF file that is a scanned document but is treated as an image PDF.
I've tested it and it tells me that the PDF is "InvalidImageFormat", "Input data is not a valid image". When I test it on a PNG, it works perfectly.
Is there anyway to use the API against a PDF image or is there an Azure API that I could use in conjunction to go PDF > PNG > Text?
Edit
Since answering additional services have become available, although I have not personally tried some of them, they may suit this purpose.
https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-intro
And at some point in the future when It goes GA.
https://aws.amazon.com/textract/
Original Answer
Unfortunately Azure has no PDF integration for it's Computer Vision API. To make use of Azure Computer Vision you would need to change the pdf to an image (JPG, PNG, BMP, GIF) yourself.
Google do now offer pdf integration and I have been seeing some really good results from it from my testing so far.
This is done through the asyncBatchAnnotateFiles Method of the vision Client (I have been using the NodeJS Variant of the API)
It can handle files up to 2000 pages, Results are divided up into 20 page segments and output to Google Cloud Storage.
https://cloud.google.com/vision/docs/pdf
The latest OCR service offered recently by Microsoft Azure is called Recognize Text, which significantly outperforms the previous OCR engine. Recognize Text can now be used with Read, which reads and digitizes PDF documents up to 200 pages.
There is a new cognitive service API called Azure Form Recognizer (currently in preview - November 2019) available, that should do the job:
https://azure.microsoft.com/en-gb/services/cognitive-services/form-recognizer/
It can process the file formats you wanted:
Format must be JPG, PNG, or PDF (text or scanned). Text-embedded PDFs
are best because there's no possibility of error in character
extraction and location.
https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/overview
Here is the link the official Form Recognizer API documentation:
https://westus2.dev.cognitive.microsoft.com/docs/services/form-recognizer-api/operations/AnalyzeWithCustomModel
Note:
Form Recognizer is currently available in English, with additional language
availability growing (4.12.2019)
Form Recognizer is available in
the following Azure regions (4.12.2019):
Canada Central, North Europe, West Europe, UK South, Central US, East US, East US 2, South Central US, West US
https://azure.microsoft.com/en-in/global-infrastructure/services/?products=cognitive-services
Sorry you have to break the PDF pages into images (JPG and PNGs). Then send the images over to Computer Vision. It is also a good idea to break it down so that you don't have to OCR all pages, only the ones that have importance.
There is a new Read API to work with PDF
https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text
Computer Vision’s Read API is Microsoft’s latest OCR technology that extracts
printed text (seven languages), handwritten text (English only), digits, and
currency symbols from images and multi-page PDF documents.
Read API reference: https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-ga/operations/5d986960601faab4bf452005
It works well enough, but does not have a lot of languages yet.
You can convert the pdf to images for each page using fitz.
# import packages
import fitz
import numpy as np
import cv2
#set path to pdf
path2doc = <path to pdf>
#open pdf with fitz
doc = fitz.open(path2doc)
# determine number of pages
pagecount = doc.pageCount
# loop over all pages and convert to image (here jpeg)
for i in range(pagecount):
page = doc[i]
pix = page.getPixmap().getImageData(output='JPEG')
jpg_as_np = np.frombuffer(pix, dtype=np.uint8)
image = cv2.imdecode(jpg_as_np, flags=1)
Once this is done, you can send them to the API

SAP BI Open Doc URL for retrieving pdf

In a reporting application we use, we were using BI 3.x API to produce Web reports. While doing the migration activity to 4.x version, we thought it is fine to go with open doc url rather than doing the report generation through API.
Many of the samples I have seen uses sIDType and iDocID parameters along with Token value to retrieve the document by constructing a URL like below http://server:port/BOE/OpenDocument/opendoc/openDocument.jsp?token=[LogonToken]&iDocID=[XXXX]&sIDType=CUID
But all those URLs get HTML page as response from BI 4.x SAP webservice, the java script in that HTML page does the task of retrieving the pdf file.
I am just wondering if there is any way I could retrieve the pdf report as response from the BI Webservice directly ? Please assist me on this. Thanks
You can if you use the REST SDK to retrieve the document, refresh it and then export it to PDF.
In short, these are the steps:
Logon: POST /biprws/logon/long
Get the doc's prompts (if any) GET /biprws/raylight/v1/documents/5690743/parameters
Pass the correct values for the prompts (if any) and refresh the document: PUT /biprws/raylight/v1/documents/5690743/parameters
Export as PDF GET /biprws/raylight/v1/documents/5690743
That last step requires you to pass Accept: application/pdf in your HTTP headers to get the PDF version.
Detailed information on the REST SDK and the different steps listed above is available on help.sap.com (look for the manual SAP BusinessObjects RESTful Web Service SDK User Guide for Web Intelligence and the BI Semantic Layer).
Use sOutputFormat=P to always retrieve the PDF of the report using open doc

Streaming large PDFs from SharePoint

I have a client that wants to store large PDFs (>700MB) on SharePoint 2013. The problem is that viewing the PDF is currently requiring the entire PDF to be download before displaying the first page. I need the browser to display each page of the PDF as it downloads, a feature I believe Adobe calls "Fast Web View" or "Byte Streaming". Here is what I know:
"Fast Web View" is enabled on the PDF document in the Document Properties window.
I can verify that the PDF is "Linearized" by reading the ASCII content.
I have checked the PDF reading options from the PDF Accessibility.
The client has SharePoint 2013 on premise installed.
SharePoint's File Handling is set to permissive.
I have verified PDF is an AllowedInlinedownedMinme type of the Web Application.
Anything else I should check or configure?
It is not enough if the PDF files are linearized (technical term in PDF parlance) or optimized for fast web view (marketing term for that feature).
There need to be two conditions met before taking advantage of fast web view working for the end user:
The PDF viewer needs to be able to make use of the linearized/optimized PDF file features.
The PDF serving remote host (in this case SharePoint) needs to be properly configured to honor 'byte range requests' by the viewer, so downloading chunks of the PDF file may be delivered "out of order".
However,...
...I do not know if SharePoint servers in general do support the second requirement;
...if SharePoint is not the problem, you may want to check which PDF viewer is actually in use in that environment (test it with Adobe Reader -- that one takes advantage of linearized PDF features for sure).
See also this answer to a question from today, which gives a few more technical details:
How are PDF files able to be partially displayed while downloading?
A co-worker identified the problem after comparing the download from SharePoint to that of a working site using WireShark. The SharePoint site didn't include "Byte ranging" in the response headers. In order to enable that feature in SharePoint, you have to enable BlobCache. Beware, BlobCache is not supported in SharePoint foundations.

PDf viewer without download option

i have some pdf files which i'd like to upload on my association site thus i'd like them not to be able to download it as it may content some slightly sensitive information .
So ok they could ctrl+c but that would reduce the spreading of the information not to have them locally
php/js w/e
thanks
quoted answer from a similar question, if you' re using Adobe pdf viewer:
You can NOT prevent users from saving ANY TYPE of document from the web - PDF, HTML, JPEG, etc. It's a "feature" of the web.
What you CAN DO is prevent users from being able to use the PDF once it hits their own disk. To do this, you use powerful Digital Rights Management solutions...:
https://forums.adobe.com/message/5158866

PDF converter for Classic ASP Application?

I need my asp page (report) to be converted as PDF format. Is there any free 3rd party control available for Classic ASP?
Take a look at
http://code.google.com/p/wkhtmltopdf/
It uses webkit.
Basically it's an EXE. You can call it from ASP (starting a process) and then sending the output file in the response stream.
active pdf and websupergoo are good choices.
Another option is the Adobe FDF toolkit. The fdf toolkit is cool because it lets you store pdf form data, can be saved in a db, and then it populates a pdf.