how can we show a pdf in a QnA maker as answer to a question? - qnamaker

I have PDF documents, that has lots of screenshots and very less text. How can I show the PDF itself directly, when a user asks for a particular question.

Based on your description of the document in question it sounds like QnA Maker is the wrong tool for the job.
The following is from the official documentation:
QnA Maker imports your content into a knowledge base of question and answer pairs.
QnA Maker does not perform OCR on the ingested documents. You could use the Read API which is part of the Computer Vision offering to OCR the document and extract text from the images, then input into QnA Maker.
Alternatively, you could manually update the QnA pair in question to include the link to the PDF using the markdown syntax outlined here.
e.g. For more information, see product PDF which is available [here](https://<your-url-here>).

Well, I got the solution to this. There is a Link icon on the answers section in qnamaker.ai, where we can provide a link to the pdf, where pdf could be stored on a blob storage or any public location.

Related

Extract Text from PDF and Save Extracted Text in Excel or Elsewhere

I am not a professional programmer. I would like a simple way to extract text from pdf and save the text into Excel.
I think Uipath can extract text using OCR. But I don't think that is a very reliable way.
Can I use Uipath to do the text extraction via a more reliable way other than OCR?
Can I use Python, R or other user-friendly software to extract the text from pdf?
Thank you!
The UiPath OCR capabilities are very capable when the PDF or image being processed is of high quality. However, it performs poorly on low resolution text. If OCR is your only option on low quality quality artifacts, you'll want to use a sophisticated AI offering such as Google Cloud Vision as your OCR tool of choice. I compared UiPath vs Cloud Vision and the difference was stark.
Tagged vs Untagged PDFs
Check to see if the PDF document you are using is tagged or not. You can view this by looking at the document properties, as in this example:
Better than OCR
If your PDF is tagged, you can use the UiPath Anchor Base activity to extract name-value pairs. And you can perform a structured UiPath data scrape to extract tabular list type data. The results from these extractions will be very high quality and much easier to work with than a full page scrape or OCR.
Save to Excel
As for the need to save to Excel, UiPath has many built in features for working with Excel, spreadsheets and CSV files in general. The basic process is this:
Scrape the data
Store the scraped text in a DataTable
Create an Excel Application Scope activity
Append the DataTable to the Excel file
Here's a simple example of a UiPath Studio project that does exactly that:
As you can see from the image above, the data is scraped, the DataTable is iterated over and finally UiPath saves to Excel:
OCR is the way to go when extracting text from a PDF file.
Answering #1: Simply use the Read PDF Files activity, more infos
Answering #2: Sure there are many way to extract PDFs. You can use any technology you want. But you won't have big success without using OCR. Using UiPath is the easiest as you already have the precompiled activities that you can freely choose from.
And do not forget to play around with the different OCR tech of OCRTesseract, OCRMicrosoft and OCRGoogle.

How to make web apps to read and analyze pdf file?

I have a project from my lecture to create a web apps to read and analyze a pdf file based on keywords. What kind of programming language that I can use?
Example : I need to find or check some keywords or data on the pdf file. If the keyword or data is exist and available, the result is true.
I usually work in javascript so could answer you in that, I had a great help from the below conversation, it might be a good help for you too.
extract text from pdf in Javascript

.Net Tool or Library to compare one PDF to another PDF [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am working on a project that currently uses a .tiff, compares the defined template document to the document in question. We are moving away from the .tiff format for a variety of reasons but mainly because the new files will be coming in the format of PDF.
I see two potential solutions to the issue. First convert the PDF to a tiff and use the existing code.
Or second, use a PDF library that will compare the template PDF to the PDF that is received.
Because the PDF that is received will basically come from an outside source we won’t know for sure if it is text based or image based so the library or tool will have to be able to compare both.
Any suggestions on tools/libraries you have found helpful would be great!
Thank you in advance!
dj
How about i-net PDFC - it does a full content comparison - text, images, lines, header/footer-detection and so on. You can use it either on command line or with a GUI (2.0, currently in public beta-phase) or via API (I think we have an internal version being a .NET library).
Disclaimer: Yep, I work for the company who made this - so feedback highly appreciated.
What we ended up doing was using the Aspose.Pdf library.
I ended up learning there are two types of PDFs:
Image based and
Text based
I did not have any issues comparing the Text based PDFs. However, at the point that a image based PDF was received converting the PDF to a .tiff so that we could use Microsoft's MODI to compare the PDF against our specified template. The .tiff would be a blank image rather than the actual content of the PDF. Aspose.Pdf library did cost some money, however in the end, the library did exactly what we needed it to and it allowed us to meet our client's needs.
I think your method of comparing tiffs is the way to go, using ImageMagick or other library?
Converting PDF to images can also be done via ImageMagick with the help of Ghostscript.
http://www.imagemagick.org/script/compare.php
I have a C# wrapper for GhostScript that may help, sent me a mail (on profile) and I can send it to you.
As far as I can see from your question, you want visual comparison of 2 PDFs, not structural comparison. (Because I can create you a thousand different PDF pages, which will have different internal structures and PDF source code, but will render identically on screen or on paper.)
In this case any comparison software will have to transform the 2 PDFs into raster images and compare those.
But since you have your own code already to do that for TIFFs, you can as well re-use it for PDFs (like you are considering already) which you convert to TIFFs.
Unless you find another, external tool that is better, faster, more precise, more funky, less resource-hungry... than your own solution! -- But that one will not be able to avoid converting the PDF pages to some sort of raster image before it can start the real visual comparison. (This may happen internally and unnoticeable for the user, but nevertheless it will have to take place...)
Docotic.Pdf library can compare PDF documents for you.
Please have a look at Check that two PDF documents are equal sample.
We use this feature for regression testing of the library itself (yes, I am part of the library's dev team).

What is "Tagged PDF"?

Can someone please explain what a "Tagged PDF" is, and how it differs from regular, non-tagged PDF?
Will tagged PDFs contain special content, such as XML, Rich Media, Javascript, or the like?
Which TeX-toolchains generate Tagged PDFs?
Tagged PDF is a PDF file that contains meta-information around certain groups of PDF instructions inside a page content. This meta-information has many use cases: Text-extraction, content-reflow, document accessibility, geographic information in PDF containing maps, etc.
If you need to know more details about this topic I would recommend reading Chapter 10 - Document Interchange of Adobe PDF Reference version 1.7.
The main reason it is used is for accessibility. With the correct tags, a screen reader (for a blind person) can understand where headings fall, what is a table/footnote/graphic and so on. Also there is a feature called PDF Article Threading which is useful for magazine or newspaper layouts where an article is split across boxes/pages.

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
When you print from Google Docs (using the "print" link, not File/Print) you end up printing a nicely formated PDF file instead of relying on the print engine of the browser. Same is true for some of the reports in Google Analytics . . . the printed reports as PDF's are beautiful. How do they do that? I can't imagine they use something like Adobe Acrobat to facilitate it but maybe they do. I've seen some expensive HTML to PDF converters online from time to time but have never tired it. Any thoughts?
If you are specifically looking at how Google does it. If you look at the PDF Properties page, they use Prince 6.0 (see princexml.com)
There are lots of other PDF generators out there. I've had great success with PDFlib for tricky jobs.
iTextSharp and iText are opensource and free PDF generation libraries for .NET and Java respectively.
I've used them to generate report PDF's before and was quite happy with the results.
http://itextsharp.sourceforge.net/
http://www.lowagie.com/iText/
Great free alternative to PrinceXML: wkhtmltopdf . There are plenty of wrapper libraries for various languages - but I've only used Ruby ones. However the product itseld is on par with PrinceXML IMHO.
I have had success with pd4ml. It has a tag library, so you can turn any existing HTML into PDF by
<pd4ml:transform>
<!-- Your HTML is here -->
<c:import url="/page.html" />
</pd4ml:transform>
Well, I doubt it's as easy as generating HTML . . . I mean, first of all, PDF is not a human readable format and it's not plain text (like SVG). In fact, I would compare a SVG file to a PDF file in that with both you have precise control over the layout on a printed page. But SVG is different in that it's XML (and also in that it's not supported completely in the browser . . . still looking into SVG too). Come to think of it, SVG should probably will be my next question.
I know Google doesn't use .NET and I doubt they use Java so there must be some other libraries they use for generating the PDF files. More importantly, how do they create the PDF's without having to rewrite everything as a PDF instead of as HTML? I mean, there has to be some shared code for between when they generate the HTML view as opposed to the PDF view. Come to think of it, maybe the PDF view and the HTML view are completely separate and they just have two views and hence why the MVC development style seems to be the way to go.
Rendering a PDF is hard, complex problem. However generating them, is not. Simply make up some entities, and generate. It's about same problem domain as generating HTML for webpage vs. displaying (rendering) it.