How to find xy position of text in pdf or image - vb.net

I want to have my code find the xy position of text in a pdf or image, so that I can crop the image out, this is so that I can include any diagrams that the question includes in the question (which consists of an image that text is put on top of), I am currently using the EJ2.PdfViewer from syncfusion but I am happy to use other packages that are more useful for my purposes.
My test code for reference if it will help:
Imports System
Imports Syncfusion.EJ2.PdfViewer
Module Program
Sub Main(args As String())
Dim extraction As PdfRenderer = New PdfRenderer()
extraction.Load("C:\math.pdf")
Dim textCollection As List(Of TextData) = New List(Of TextData)
Dim text As String = extraction.ExtractText(44, textCollection)
Console.WriteLine(text)
End Sub
End Module

To get position of text in a pdf , you can use some libs:
iText7: https://itextpdf.com/resources/api-documentation
Spire PDF: https://www.e-iceblue.com/Introduce/pdf-for-net-introduce.html
To get position of text in a Image:
Google Vision API: https://cloud.google.com/vision/docs/ocr

Related

How to make invisible text visible with iText 4 in an existing pdf

I have a PDF which is created by scanning software. One image per page and hidden OCR'ed text.
I want to remove the images and make the text visible.
I found info how to remove images (replace by another image) but found no way for making the invisible text visible.
Sample PDF with image and hidden text
I tried below method, but it does not work:
Public Shared Sub UnhideText(ByVal strFileName As String)
Dim pdf As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(strFileName)
Dim stp As iTextSharp.text.pdf.PdfStamper = New iTextSharp.text.pdf.PdfStamper(pdf, New IO.FileStream("e:\out.pdf", IO.FileMode.Create))
'This does not work, text remains unvisible. I guess SetTextRenderingMode applies only for new added text.
For pageNumber As Integer = 1 To pdf.NumberOfPages
Dim cb As iTextSharp.text.pdf.PdfContentByte = stp.GetOverContent(pageNumber)
cb.SetTextRenderingMode(iTextSharp.text.pdf.PdfContentByte.TEXT_RENDER_MODE_FILL)
Next
stp.Close()
End Sub

Is it possible to view multi page .Tif files in vb.net application?

I am hoping to be able to view .Tif files in my vb.net application - is there a component that can be used to do that. I've been messing around with Tiff Viewer but for some reason the file is stretched beyond anything - you can barely make out what it says. I tried to adjust the width of the viewer but it did not do much, it is still stretched horizontally.
Based on https://stackoverflow.com/a/401579/741136, this will load a multiframe tif into a collection of single frames:
Function LoadTif(filename As String) As List(Of Image)
Dim lstTif As New List(Of Image)
Dim bmp As Bitmap = DirectCast(Image.FromFile(filename), Bitmap)
For i As Integer = 0 To bmp.GetFrameCount(Imaging.FrameDimension.Page) - 1
bmp.SelectActiveFrame(Imaging.FrameDimension.Page, i)
Dim ms As New System.IO.MemoryStream
bmp.Save(ms, Imaging.ImageFormat.Tiff)
lstTif.Add(Image.FromStream(ms))
ms.Dispose()
Next i
Return lstTif
End Function

How to save data to text file and retrieve

I'm using VB.NET. I am able to load the pics from a folder into a flowlayoutpanel. And then load the clicked picture into a separate picturebox and display the picture's filepath in a label.
Now I want to be able to add rating and description to each of the image in the flowlayoutpanel and save it to a text file in the folder from which the pictures have been loaded. The app should load be able to load the rating and description on the next launch or when the selected image is changed. How do I accomplish this?
You should probably look at accessing the metadata of the pic. This way the info you want is carried with the pic. This is contained in the PropertyItems Class, which is a property of the Image class
Here's a link to an answered question about adding a comment to a jpg. Hope this helps.
Here's an untested conversion of that code in VB.net. You'll probably have to add a reference or 2 and import a couple of namespaces, but syntactically this is correct as near as I can tell.
Public Function SetImageComment(input As Image, comment As String) As Image
Using memStream As New IO.MemoryStream()
input.Save(memStream, Imaging.ImageFormat.Jpeg)
memStream.Position = 0
Dim decoder As New JpegBitmapDecoder(memStream, BitmapCreateOptions.PreservePixelFormat, BitmapCacheOption.OnLoad)
Dim metadata As BitmapMetadata
If decoder.Metadata Is Nothing Then
metadata = New BitmapMetadata("jpg")
Else
metadata = decoder.Metadata
End If
metadata.Comment = comment
Dim bitmapFrame = decoder.Frames(0)
Dim encoder As BitmapEncoder = New JpegBitmapEncoder()
encoder.Frames.Add(bitmapFrame.Create(bitmapFrame, bitmapFrame.Thumbnail, metadata, bitmapFrame.ColorContexts))
Dim imageStream As New IO.MemoryStream
encoder.Save(imageStream)
imageStream.Position = 0
input.Dispose()
input = Nothing
Return Image.FromStream(imageStream)
End Using
End Function

How to absolute position the image in existing pdf using itextsharp

Here is the code I have so far:
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Imports System.IO
Module Module1
Sub Main()
AddjImage("C:\test.png", "c:\pdfTemplate.pdf", "C:\output.pdf")
End Sub
Private Function AddjImage(ByVal strImageFileName As String, ByVal pdfTemplateFile As String, ByVal outputPdf As String) As Boolean
Try
Dim iPdfReader As PdfReader = New PdfReader(pdfTemplateFile)
Dim iPdfStamper As PdfStamper = New PdfStamper(iPdfReader, New FileStream(outputPdf, FileMode.Create))
Dim imgjImage As iTextSharp.text.Image
Dim bytContent As PdfContentByte
'Insert Image
imgjImage = iTextSharp.text.Image.GetInstance(strImageFileName)
imgjImage.Alignment = iTextSharp.text.Image.ALIGN_TOP
imgjImage.ScalePercent(78)
imgjImage.SetAbsolutePosition(445, 0)
bytContent = iPdfStamper.GetOverContent(1)
bytContent.AddImage(imgjImage)
iPdfStamper.FormFlattening = True
iPdfStamper.Close()
Return True
Catch ex As Exception
Return False
End Try
End Function
End Module
The pdf is in landscape layout. The page size is A4. I am trying to insert the image on right side of the pdf page. I want to align the image on x=445 and y=0 position.
I have couple of images with two sizes. They are:
image 1 with width=500px; height=910px;
image 2 with width=500px; height=400px;
The problem is, both the images are aligned to bottom instead of top. because of that the top portion of image 1 is cut off.
I tried your code(with modifications) to suit my button click event in a wpf app. The line below has to be altered to make the image go up. I feel the 0 you are using starts from bottom.
imgjImage.SetAbsolutePosition(445, 0)
to be altered to
imgjImage.SetAbsolutePosition(445, 200)
the 200 is not absolute, it has to be readjusted for your image actual size.

Is there some kind of display control in .NET that can handle color codes?

It seems the only options available to do multi-color on a string is either a bunch of label controls cleverly grouped together or to use a RichTextBox and play with the font properties as text is added to the control.
What I am looking for instead is some kind of control that can render some style of control codes out as color. Consider bash codes:
NORMAL='\e[0m'
GREEN='\e[0;32m'
BLUE='\e[0;34m'
echo -e "This text is ${GREEN}green${NORMAL} and this text is ${BLUE}blue${NORMAL}"
In the above, the words 'green' and 'blue' will be colorized with their respective colors. I was wondering if there was a control with some kind of feature like this, or will I have to code something myself?
Note, I only have the Express copy of VB 2010, and I would very much like to avoid third-party controls.
Are you specifically looking for something that understands ANSI control codes, or just something that accepts markup? If you just want something that accepts markup, you can use the RichTextBox.Rtf property to set all the control codes and text with a single string.
See http://msdn.microsoft.com/en-us/library/aa140277(v=office.10).aspx for the RTF specifications.
I would recommend programmatically generating a sample document, then reading the Rtf property and using the resulting RTF code as a template for what you should generate. For reference, here's a simple RTF document that has two color of text (plus the default) in Consolas (which fallback to Courier New):
{\rtf1\deff0{\fonttbl{\f0\fmodern\fcharset0 Consolas {\*\falt Courier New};}}
{\colortbl ;\red255\green0\blue0;\red0\green176\blue80;}
\cf1 Hello\cf0 , \cf2 world\cf0 .
}
There are a couple of other options. first you can paint text using the graphics object and DrawString method using any color font and style you want. This however can be a pain. The easiest way is to use a web browser control and use plain old html.
If you don't want to use RTF I wrote this little sample which will allow you to use RGB codes this is not complete solution as you would have to figure out a way to delmit the control chars. If you wanted to test it create a form and drop a button and a rich text box on it.
Imports System.Drawing
Imports System.Text.RegularExpressions
Public Class Form1
Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
Dim str As String = "This text is {#00FF00}green{#000000} and this text is {#0000FF}blue{#000000}"
PrintToRtf(str, RichTextBox1)
End Sub
Private Shared Sub PrintToRtf(Str As String, RTB As RichTextBox)
Dim mc As MatchCollection = Regex.Matches(Str, "\{\#(?<Red>[0-9A-Fa-f]{2})(?<Green>[0-9A-Fa-f]{2})(?<Blue>[0-9A-Fa-f]{2})\}")
Dim lp As Int32 = 0
For Each mtc As Match In mc
Dim subStr As String = Str.Substring(lp, mtc.Index - lp)
Dim R, G, B As Byte
R = Integer.Parse(mtc.Groups("Red").Value, Globalization.NumberStyles.AllowHexSpecifier)
G = Integer.Parse(mtc.Groups("Green").Value, Globalization.NumberStyles.AllowHexSpecifier)
B = Integer.Parse(mtc.Groups("Blue").Value, Globalization.NumberStyles.AllowHexSpecifier)
Dim clr As Color = Color.FromArgb(255, R, G, B)
RTB.SelectedText = subStr
RTB.SelectionColor = clr
lp = mtc.Index + mtc.Length
RTB.Select(RTB.TextLength, 0)
Next
End Sub
End Class