How do I know if the PdfTextExtractor produced reliable results? - pdf

I am using the below code to extract text from PDFs for book keeping purposes.
How would I know if the PDF was "well readable" and produced accurate results or if produced "garbage" output which would require using an OCR solution?
Currently I have to inspect each results manually and see if it resulted in
"Iin voicE #Ajk 932 2"
or
"Invoice #8793201".
Using nReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(fileName)
For page As Integer = 1 To nReader.NumberOfPages
Dim strategy As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
Dim currentText As String = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(nReader, page, strategy)
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
sb.Append(currentText)
Next
nReader.Close()
End Using

Related

Stamp rotated text using itext7 in vb.net

I'm trying to convert some itextsharp code to use itext7 which stamps text on each page of a pdf at rotate 90 degrees. Unfortunately all the examples I can find are in c# and while I can use an online translator I'm having difficulties with this one.
The below code stamps my text on at the specified coords on each page of a given pdf:
Shared Sub itext7_stamp_text_on_pdf(mypdfname As String, myfoldername As String)
Dim src As String = myfoldername & "\" & mypdfname
Dim dest As String = myfoldername & "\Stamped " & mypdfname
Dim pdfDoc As PdfDocument = New PdfDocument(New PdfReader(src), New PdfWriter(dest))
Dim document As Document = New Document(pdfDoc)
Dim canvas As PdfCanvas
Dim n As Integer = pdfDoc.GetNumberOfPages()
For i = 1 To n
Dim page As PdfPage = pdfDoc.GetPage(i)
canvas = New PdfCanvas(page)
With canvas
.SetFontAndSize(PdfFontFactory.CreateFont(StandardFonts.HELVETICA), 12)
.BeginText()
.MoveText(100, 100)
.ShowText("SAMPLE TEXT 100,100")
.EndText()
End With
Next
pdfDoc.Close()
End Sub
... but I can't see a way of rotating it to 90 degrees.
There's an example here if you use a paragraph:
https://kb.itextpdf.com/home/it7kb/examples/itext-7-building-blocks-chapter-2-rootelement-examples#iText7BuildingBlocksChapter2:RootElementexamples-c02e14_showtextaligned
... but I can't seem to translate this to vb.net. I can specify where the errors I get are, but I thought I'd be better asking this general question first in case there's a way to do this without using a paragraph.
Can anyone help please?
Thanks!
Well, after some more digging this code seems to work OK on the rotation part:
Dim pdf As New PdfDocument(New PdfReader(inpdf), New PdfWriter(outpdf))
Dim document As New Document(pdf)
document.ShowTextAligned("This is some test text", 400, 750, TextAlignment.CENTER, VerticalAlignment.MIDDLE, 0.5F * CSng(Math.PI))
document.Close()
End Sub
.... but it gets hidden behind existing content, so I need a way to make sure it's set to over content.

Comparing files not working as intended

hi guys could someone explain to me why this does not work.
I basically have to text files called Books and NewBooks...
The text files are populated from a web request and the info is then parsed into the text files...when I start the program Books and new books are identical and pretty much a copy of each other.
more web requests are done to update the NewBooks text file and when I compare them if there is a line in NewBooks that is not in Books it adds that line to a third text file called myNewBooks. Now my initial code that I will show here works as I expected
Dim InitialBooks = File.ReadAllLines("Books.json")
Dim TW As System.IO.TextWriter
'Create a Text file and load it into the TextWriter
TW = System.IO.File.CreateText("myNewBooks.JSON")
Dim NewBooks = String.Empty
Using reader = New StreamReader("NewBooks.json")
Do Until reader.EndOfStream
Dim current = reader.ReadLine
If Not InitialBooks.Contains(current) Then
NewBooks = current & Environment.NewLine
TW.WriteLine(NewBooks)
TW.Flush()
'Close the File
End If
Loop
End Using
TW.Close() : TW.Dispose()
but because part of the string in my text file lines contain a url which sometimes I find the same book with a different url... I was getting duplicate entries of books becuase the url was the only difference. So I thought that I would split the string before the url so that I just compare the title and description and region ...fyi a line in my text files look similar to this:
{ "Title": "My Title Here", "Description": "My Description Here", "Region": "My Region Here", "Url": "My Url Here", "Image": "My Image Here" };
So a fellow today helped me figure out how to split my line so it looks more like this:
{ "Title": "My Title Here", "Description": "My Description Here", "Region": "My Region Here", "Url"
which is great but now when I compare it does not see that the first line contains the split line and I don't understand why... here is the code after it was modified.
Dim InitialBooks = File.ReadAllLines("Books.json")
Dim TW As System.IO.TextWriter
'Create a Text file and load it into the TextWriter
TW = System.IO.File.CreateText("myNewBooks.JSON")
Dim NewBooks = String.Empty
Using reader = New StreamReader("NewBooks.json")
Do Until reader.EndOfStream
Dim current = reader.ReadLine
Dim splitAt As String = """Url"""
Dim index As Integer = current.IndexOf(splitAt)
Dim output As String = current.Substring(0, index + splitAt.Length)
If Not InitialBooks.Contains(output) Then
NewBooks = current & Environment.NewLine
TW.WriteLine(NewBooks)
TW.Flush()
'Close the File
End If
Loop
End Using
TW.Close() : TW.Dispose()
Your wisdom would be appreciated!!
Your OP is confusing.
If I understood correctly:
You have 3 files Books, NewBooks and MyBooks.
You download data from web, if that data is not located in Books, you add it to NewBooks, otherwise to MyBooks(duplicates).
Seeing that you are working with JSON i would do it the following way.
Load the Books, when downloading data check it and compare it with Books. Then write to proper file.
Imports System.Web.Script.Serialization ' for reading of JSON (+add the reference to System.Web.Extensions library)
Dim JSONBooks = New JavaScriptSerializer().DeserializeObject(Books_string)
Inspect JSONBooks with breakpoint. You will see how it looks.
When downlaoding data you can simply check if book exist in it, by title, url or whatever you want.
Since you shown only one book
Debug.Print(JSONBooks("Title")) 'returns >>>My Title Here
When you have more
JSONBooks(x)("Title") 'where x is book number.
So you can loop over all books and check what you need.
JSON array looks like this (if you need to construct it)
[{book1},{book2},...]

bitmap image printing using axiohm usbcomm dll

I am using an Axiohm thermal printer for printing POS receipt(USBCOMM.dll for communication). Currently, i am able to print the required details along with an image(.bmp file). Now i need to use a new image instead of the existing image. The new image contains barcode.
When i try printing the new image, all i get is some garbage values. Below is the code that i use. Same code works with old image but not with the new image. Is there any format for image that i need to follow.
Dim filepath As String = AppDomain.CurrentDomain.BaseDirectory + "Resources\PrinterDlls\unnamed.bmp"
Using fs = New FileStream(filepath, FileMode.Open, FileAccess.Read, FileShare.Read)
Dim inpt As Byte() = New Byte(fs.Length) {}
inpt(0) = &H1F
fs.Read(inpt, 1, CInt(fs.Length))
Dim ok As Boolean = Usb_WritePort(True, inpt, inpt.Length, written, IntPtr.Zero)
If Not ok OrElse written <> inpt.Length Then
Throw New Exception("USB write failed")
End If
End Using
Well, this is embarrassing that i am answering my own question. I searched for sometime to resolve and raised the question. Soon after, i came across this video in youtube that explain the bitmap image to create for thermal printing
https://www.youtube.com/watch?v=LdB33eWLjgU
Basically, you need to ensure 3 things while creating the image:
1. 8-bit
2. Greyscale
3. Save as .bmp
And the new image will work like a charm while printing. Also ensure the width is less than the paper width.

How can I test if a PDF document is PDF/A compliant using iTextSharp?

I have a existing PDF file and with iTextSharp I want to test if it is PDF/A compliant.
I don't want convert or create a file, just read and check if it is a PDF/A.
I have not tried anything because I did not find any methods or properties of the class PdfReader of iTextSharp, saying that the PDF is PDF/A. For now it would be enough to know how to verify that the document claims to be PDF/A compatible
Thanks
Antonio
After a long search i tried this way and seems to work:
Dim reader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sFilePdf)
Dim yMetadata As Byte() = reader.Metadata()
Dim bPDFA As Boolean = False
If Not yMetadata Is Nothing Then
Dim sXmlMetadata = System.Text.ASCIIEncoding.Default.GetString(yMetadata)
Dim xmlDoc As Xml.XmlDocument = New Xml.XmlDocument()
xmlDoc.LoadXml(sXmlMetadata)
Dim nodes As Xml.XmlNodeList = xmlDoc.GetElementsByTagName("pdfaid:conformance")
If nodes.Item(0).FirstChild.Value.ToUpper = "A" Then
bPDFA = True
End If
End If
Return bPDFA
I also found some reference to the class XmpReader, but not sufficient to do what I wanted

Text from webpage

I need to get some text from this web page. I want to use the trade feed for my program to analyse the sentiment of the markets.
I used the browser control and the get element command but its not working. The problem is that whenever my browser starts to open the page I get Java scripts errors.
I tried with DOM but seems that i dont quite understand what i need to do :)
Here is the code:
Dim code As String
Using client As New WebClient
code = client.DownloadString("http://openbook.etoro.com/ahanit/#/profile/Trades/")
End Using
Dim htmlDocument As IHTMLDocument2 = New HTMLDocument(code)
htmlDocument.write(htmlDocument)
Dim allElements As IHTMLElementCollection = htmlDocument.body.all
Dim allid As IHTMLElementCollection = allElements.tags("id")
Dim element As IHTMLElement
For Each element In allid
element.title = element.innerText
MsgBox(element.innerText)
Next
Update: So I tried the HTML Agility pack, as suggested in the comments, and I am stuck again on this code
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim goodnods As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("THE PROBLEM")
For Each node In goodnods
TextBox1.Text = htmldoc.DocumentNode.InnerText
Next
Any advice what to now?
Ok I think I know what the problem is somehow the div that I need is hidden and its not loaded when I load the web page just the source code. Does someone knows how to load all the hidden divs ??
Here is my new code
Dim doc As New HtmlAgilityPack.HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim nodes As HtmlNode = doc.GetElementbyId("feed-items")
Dim id As String = nodes.WriteTo()
TextBox1.Text = TextBox1.Text & vbCrLf & id
user1336635,
Welcome to SO! Something you might try is to check out his source code, figure out what javascript function is populating the field you want (using firebug - I assume it's the one that "trades result in profit" next to it), and then embedding that script into a web page that your webbrowser control loads. That's where I'd try to start. I checked his source code and searched for "trades result in profit" and didn't find anything which leads me to believe hunting for the element 'might' not be possible. Just a starting place until someone with more experience with this chimes in!! Best!
-sf