How to check if lines in string are separated by space? - vb.net

I'm building a program that gets the publisher of the book by scanning its title page and using OCR … since publishers are always at the bottom of the title page I'm thinking that a detecting lines that are separated by space is a solution but I don't know how to test for that. Here is my code:
Dim builder As New StringBuilder()
Dim reader As New StringReader(txtOCR.Text)
Dim iCounter As Integer = 0
While True
Dim line As String = reader.ReadLine()
If line Is Nothing Then Exit While
'i want to put the condition here
End While
txtPublisher.Text = builder.ToString()

Do you mean empty lines? Then you can do this:
Dim bEmpty As Boolean
And then inside the loop:
If line.Trim().Length = 0 Then
bEmpty = True
Else
If bEmpty Then
'...
End If
bEmpty = False
End If

Why not do the following: from the bottom, go up until you find the first non-empty line (no idea how the OCR works … maybe the bottom-most line is always non-empty, in which case this is redundant). In the next step, go up until the first empty line. The text in the middle is the publisher.
You don’t need the StringReader for that:
Dim lines As String() = txtOCR.Text.Split(Environment.NewLine)
Dim bottom As Integer = lines.Length - 1
' Find bottom-most non-empty line.
Do While String.IsNullOrWhitespace(lines(bottom))
bottom -= 1
Loop
' Find empty line above that
Dim top As Integer = bottom - 1
Do Until String.IsNullOrWhitespace(lines(top))
top -= 1
Loop
Dim publisherSubset As New String(bottom - top)()
Array.Copy(lines, top + 1, publisherSubset, 0, bottom - top)
Dim publisher As String = String.Join(Environment.NewLine, publisherSubset)
But to be honest I don’t think this is a particularly good approach. It’s inflexible and doesn’t cope well with unexpected formatting. I’d instead use a regular expression to describe what the publisher string (and its context) looks like. And maybe even that isn’t enough and you have to put some thought into describing the whole page to extrapolate which of the bits is the publisher.

Assuming the publisher is always on the last line and always comes after an empty line. Then perhaps the following?
Dim Lines as New List(Of String)
Dim currentLine as String = ""
Dim previousLine as String = ""
Using reader As StreamReader = New StreamReader(txtOCR.Txt)
currentLine = reader.ReadLine
If String.IsNullOrWhiteSpace(previousLine) then lines.Add(currentLine)
previousLine = currentLine
End Using
txtPublisher.Text = lines.LastOrDefault()
To ignore if the previous line is blank/empty:
Dim Lines as New List(Of String)
Using reader As StreamReader = New StreamReader(txtOCR.Txt)
lines.Add(reader.ReadLine)
End Using
txtPublisher.Text = lines.LastOrDefault()

Related

Through lines textbox startwith and endswith something

in my File Text I have the following things:
I have to make it start from somewhere and stop at a certain point.
but he only starts from that point, but he does not know how to stop at one point.
[Letters]
A
B
C
D
E
[Loop]
[Words]
Fish
Facebook
Google
Youtube
I should display Expected Output:
A
B
C
D
E
Then I should make it display
Fish
Facebook
Google
Youtube
but it shows me:
[Letters]
A
B
C
D
E
[Loop]
[Words]
Fish
Facebook
Google
Youtube
Code
Dim line As String
Using reader As StreamReader = New StreamReader(My.Application.Info.DirectoryPath & "\TestReader.txt")
line = reader.ReadLine
Dim sb As New StringBuilder
Do
Do
If reader.Peek < 0 Then 'Check that you haven't reached the end
Exit Do
End If
line = reader.ReadLine
If line.StartsWith("[Letters]") AndAlso line.EndsWith("[Loop]") Then 'Check if we have reached another check box.
Exit Do
End If
sb.AppendLine(line)
Loop
TextBox1.Text = sb.ToString
sb.Clear()
Loop Until reader.Peek < 0
End Using
This assumes that startPrefix and endPrefix will always be present in the file:
Dim startPrefix As String 'Set as required
Dim endPrefix As String 'Set as required
Dim lines As New List(Of String)
Using reader As New StreamReader("file path here")
Dim line As String
'Skip lines up to the first starting with the specified prefix.
Do
line = reader.ReadLine()
Loop Until line.StartsWith(startPrefix)
line = reader.ReadLine()
Do Until line.StartsWith(endPrefix)
lines.Add(line)
line = reader.ReadLine()
Loop
End Using
'Use lines here.
Are you really sure that you want to look for lines that start with those markers though? Wouldn't you really prefer to look for lines that are equal to those markers?
EDIT:
You could - and probably should - encapsulate that functionality in a method:
Private Function GetLinesBetween(filePath As String, startPrefix As String, endPrefix As String) As String()
Dim lines As New List(Of String)
Using reader As New StreamReader(filePath)
Dim line As String
'Skip lines up to the first starting with the specified prefix.
Do
line = reader.ReadLine()
Loop Until line.StartsWith(startPrefix)
line = reader.ReadLine()
'Take lines up to the first starting with the specified prefix.
Do Until line.StartsWith(endPrefix)
lines.Add(line)
line = reader.ReadLine()
Loop
End Using
Return lines.ToArray()
End Function

Creating multiple .txt files while restricting size of each

In my program, I collect bits of information on a massive scale, hundreds of thousands to millions of lines each. I am trying to limit each file I create to a certain size in order to be able to quickly open it and read the data. I am using a HashSet to collect all the data without duplicates.
Here's my code so far:
Dim Founds As HashSet(Of String)
Dim filename As String = (Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "\Sorted_byKING\sorted" + Label4.Text + ".txt")
Using writer As New System.IO.StreamWriter(filename)
For Each line As String In Founds
writer.WriteLine(line)
Next
Label4.Text = Label4.Text + 1 'Increments sorted1.txt, sorted2.txt etc
End Using
So, my question is:
How do I go about saving, let's say 250,000 lines in a text file before moving to another one and adding the next 250,000?
First of all, do not use Labels to simply store values. You should use variables instead, that's what variables are for.
Another advice, always use Path.Combine to concatenate paths, that way you don't have to worry about if each part of the path ends with a separator character or not.
Now, to answer your question:
If you'd like to insert the text line by line, you can use something like:
Sub SplitAndWriteLineByLine()
Dim Founds As HashSet(Of String) 'Don't forget to initialize and fill your HashSet
Dim maxLinesPerFile As Integer = 250000
Dim fileNum As Integer = 0
Dim counter As Integer = 0
Dim filename As String = String.Empty
Dim writer As IO.StreamWriter = Nothing
For Each line As String In Founds
If counter Mod maxLinesPerFile = 0 Then
fileNum += 1
filename = IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
$"Sorted_byKING\sorted{fileNum.ToString}.txt")
If writer IsNot Nothing Then writer.Close()
writer = New IO.StreamWriter(filename)
End If
writer.WriteLine(line)
counter += 1
Next
writer.Dispose()
End Sub
However, if you will be inserting the text from the HashSet as is, you probably don't need to write line by line, instead you can write each "bunch" of lines at once. You could use something like the following:
Sub SplitAndWriteAll()
Dim Founds As HashSet(Of String) 'Don't forget to initialize and fill your HashSet
Dim maxLinesPerFile As Integer = 250000
Dim fileNum As Integer = 0
Dim filename As String = String.Empty
For i = 0 To Founds.Count - 1 Step maxLinesPerFile
fileNum += 1
filename = IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
$"Sorted_byKING\sorted{fileNum.ToString}.txt")
IO.File.WriteAllLines(filename, Founds.Skip(i).Take(maxLinesPerFile))
Next
End Sub

VB.Net Find and Replace Line By Line Cycling Through File Names from a Datagridview

I was kindly helped before with this code by a guy name Steven, however I have hit yet another stumbling block, I keep messing my loops up.
I have some code that cycles through over 100 ish k of Find and Replaces and i have added into it the ability to cycle through a number of
files to Find and Replace items within said files, i have used ReadLine and WriteLine otherwise I have memory issues and it bombs out.
With the code below,i can get it to either cycle through and not find and replace or find and replace one of the files and not cycle
through the rest of the files. Where am i going wrong with the loops, im going insane. Many thanks, VBvirg
'The files to cycle through
For Each row2 As DataGridViewRow In DataGridView2.Rows
If Not row2.IsNewRow Then
Dim DateFile As String = row2.Cells(1).Value.ToString
Dim DateFileConv As String = row2.Cells(2).Value.ToString
Dim fName As String = DateFile
Dim wrtFile As String = DateFileConv
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
While True
'Thesre are the find and replaces
Dim line As String = strRead.ReadLine()
If line IsNot Nothing Then
For Each row As DataGridViewRow In DataGridView1.Rows
If Not row.IsNewRow Then
Dim Find1 As String = row.Cells(0).Value.ToString
Dim Replace1 As String = row.Cells(1).Value.ToString
line = line.Replace(Find1, Replace1)
End If
Next
strWrite.WriteLine(line)
End If
End While
End If
Next
Cursor.Current = Cursors.Default
MessageBox.Show("Finished Replacing")

how to highlight a text or word in a pdf file using iTextsharp?

I need to search a word in a existing pdf file and i want to highlight the text or word
and save the pdf file
I have an idea using PdfAnnotation.CreateMarkup we could find the position of the text and we can add bgcolor to it...but i dont know how to implement it :(
Please help me out
This is one of those "sounds easy but is actually really complicated" things. See Mark's posts here and here. Ultimately you'll probably be pointed to LocationTextExtractionStrategy. Good luck! If you actually find out how to do it post it here, there several people wondering exactly what you are wondering!
I've found how to do this, just in case someone needs to get words or sentences with locations (coordinates) from a PDF document you'll find this example Project
HERE
, I used VB.NET 2010 for this. Remember to add a reference to your iTextSharp DLL in this Project.
I added my own TextExtraction Strategy Class, based on Class LocationTextExtractionStrategy. I focused on TextChunks, because they already have these coordinates.
There are some known limitations like:
No multiple line searches (phrases), just char/s or word's or a one line sentence are allowed.
It Won't work with rotated text.
I didn't test on PDFs with landscape page orientation but i assume some modifications may be required for this.
In case you need to draw this HighLight/rectangles over a watermark you'll need to add/modify some code, but just code in the Form, this is not related to the text/locations extraction proccess.
#Jcis, I actually managed a workaround for handling multiple searches using your example as a starting point. I use your project as a reference in a c# project, and altered what it does. Instead of just highlighting I actually have it drawing a white rectangle around the search term, and then using the rectangle coordinates, place a form field. I also had to swap the contentbyte writing mode to getovercontent so that I block out the searched text entirely. What I actually did was to create a string array of search terms, and then using a for loop, I create as many different text fields as I need.
Test.Form1 formBuilder = new Test.Form1();
string[] fields = new string[] { "%AccountNumber%", "%MeterNumber%", "%EmailFieldHolder%", "%AddressFieldHolder%", "%EmptyFieldHolder%", "%CityStateZipFieldHolder%", "%emptyFieldHolder1%", "%emptyFieldHolder2%", "%emptyFieldHolder3%", "%emptyFieldHolder4%", "%emptyFieldHolder5%", "%emptyFieldHolder6%", "%emptyFieldHolder7%", "%emptyFieldHolder8%", "%SiteNameFieldHolder%", "%SiteNameFieldHolderWithExtraSpace%" };
//int a = 0;
for (int a = 0; a < fields.Length; )
{
string[] fieldNames = fields[a].Split('%');
string[] fieldName = Regex.Split(fieldNames[1], "Field");
formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, htmlToPdf, finalhtmlToPdf, fieldName[0]);
File.Delete(htmlToPdf);
System.Array.Clear(fieldNames, 0, 2);
System.Array.Clear(fieldName, 0, 1);
a++;
if (a == fields.Length)
{
break;
}
string[] fieldNames1 = fields[a].Split('%');
string[] fieldName1 = Regex.Split(fieldNames1[1], "Field");
formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, finalhtmlToPdf, htmlToPdf, fieldName1[0]);
File.Delete(finalhtmlToPdf);
System.Array.Clear(fieldNames1, 0, 2);
System.Array.Clear(fieldName1, 0, 1);
a++;
}
It bounces the PDFTextGetter function in your example back and forth between two files until I achieve the finished product. It works really well, and it would not have been possible without your initial project, so thank you for that. I also altered your VB to do the text field mapping like so;
For Each rect As iTextSharp.text.Rectangle In MatchesFound
cb.Rectangle(rect.Left, rect.Bottom + 1, rect.Width, rect.Height + 4)
Dim field As New TextField(stamper.Writer, rect, FieldName & Fields)
Dim form = stamper.AcroFields
Dim fieldKeys = form.Fields.Keys
stamper.AddAnnotation(field.GetTextField(), page)
Fields += 1
Next
Just figured I would share what I managed to do with your project as a backbone. It even increments the field names as I need them to. I also had to add a new parameter to your function, but that's not worth listing here. Thank you again for this great head start.
Thanks Jcis!
After a couple of hours of research and thinking, i found your solution, which helped me to solve my Problems.
there were 2 little bugs.
first: the stamper needs to be closed before the reader, otherwise it throws an exception.
Public Sub PDFTextGetter(ByVal pSearch As String, ByVal SC As StringComparison, ByVal SourceFile As String, ByVal DestinationFile As String)
Dim stamper As iTextSharp.text.pdf.PdfStamper = Nothing
Dim cb As iTextSharp.text.pdf.PdfContentByte = Nothing
Me.Cursor = Cursors.WaitCursor
If File.Exists(SourceFile) Then
Dim pReader As New PdfReader(SourceFile)
stamper = New iTextSharp.text.pdf.PdfStamper(pReader, New System.IO.FileStream(DestinationFile, FileMode.Create))
PB.Value = 0 : PB.Maximum = pReader.NumberOfPages
For page As Integer = 1 To pReader.NumberOfPages
Dim strategy As myLocationTextExtractionStrategy = New myLocationTextExtractionStrategy
'cb = stamper.GetUnderContent(page)
cb = stamper.GetOverContent(page)
Dim state As New PdfGState()
state.FillOpacity = 0.3F
cb.SetGState(state)
'Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases
strategy.UndercontentCharacterSpacing = cb.CharacterSpacing
strategy.UndercontentHorizontalScaling = cb.HorizontalScaling
'It's not really needed to get the text back, but we have to call this line ALWAYS,
'because it triggers the process that will get all chunks from PDF into our strategy Object
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pReader, page, strategy)
'The real getter process starts in the following line
Dim MatchesFound As List(Of iTextSharp.text.Rectangle) = strategy.GetTextLocations(pSearch, SC)
'Set the fill color of the shapes, I don't use a border because it would make the rect bigger
'but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
cb.SetColorFill(BaseColor.PINK)
'MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:
For Each rect As iTextSharp.text.Rectangle In MatchesFound
' cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
cb.SaveState()
cb.SetColorFill(BaseColor.YELLOW)
cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
cb.Fill()
cb.RestoreState()
Next
'cb.Fill()
PB.Value = PB.Value + 1
Next
stamper.Close()
pReader.Close()
End If
Me.Cursor = Cursors.Default
End Sub
second: your solution dont work, when the searched text is in the last line of the extraced text.
Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle)
Dim FoundMatches As New List(Of iTextSharp.text.Rectangle)
Dim sb As New StringBuilder()
Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk)
Dim bStart As Boolean, bEnd As Boolean
Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing
Dim sTextInUsedChunks As String = vbNullString
' For Each chunk As TextChunk In locationalResult
For j As Integer = 0 To locationalResult.Count - 1
Dim chunk As TextChunk = locationalResult(j)
If chunk.text.Contains(pSearchString) Then
Thread.Sleep(1)
End If
If ThisLineChunks.Count > 0 AndAlso (Not chunk.SameLine(ThisLineChunks.Last) Or j = locationalResult.Count - 1) Then
If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then
Dim sLine As String = sb.ToString
'Check how many times the Search String is present in this line:
Dim iCount As Integer = 0
Dim lPos As Integer
lPos = sLine.IndexOf(pSearchString, 0, pStrComp)
Do While lPos > -1
iCount += 1
If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length
lPos = sLine.IndexOf(pSearchString, lPos, pStrComp)
Loop
'Process each match found in this Text line:
Dim curPos As Integer = 0
For i As Integer = 1 To iCount
Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer
iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp)
curPos = iFromChar
iToChar = iFromChar + pSearchString.Length - 1
sCurrentText = vbNullString
sTextInUsedChunks = vbNullString
FirstChunk = Nothing
LastChunk = Nothing
'Get first and last Chunks corresponding to this match found, from all Chunks in this line
For Each chk As TextChunk In ThisLineChunks
sCurrentText = sCurrentText & chk.text
'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk)
If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then
FirstChunk = chk
bStart = True
End If
'Keep getting Text from Chunks while we are in the part where the matching String had been found
If bStart And Not bEnd Then
sTextInUsedChunks = sTextInUsedChunks & chk.text
End If
'If we get out the matching String part then get this Chunk (last Chunk)
If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then
LastChunk = chk
bEnd = True
End If
'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found
'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations
If bStart And bEnd Then
FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp))
curPos = curPos + pSearchString.Length
bStart = False : bEnd = False
Exit For
End If
Next
Next
End If
sb.Clear()
ThisLineChunks.Clear()
End If
ThisLineChunks.Add(chunk)
sb.Append(chunk.text)
Next
Return FoundMatches
End Function

StreamReader - Reading from lines

I have the following code;
Public Sub writetofile()
' 1: Append playername
Using writer As StreamWriter = New StreamWriter("highscores.txt", True)
writer.WriteLine(PlayerName)
End Using
' 2: Append score
Using writer As StreamWriter = New StreamWriter("highscores.txt", True)
writer.WriteLine(Score)
End Using
End Sub
What I now want to do is read all the odd lines of the file (the player names) and the even lines into two separate list boxes, how would I go about that??
I need to modify;
Using reader As StreamReader = New StreamReader("file.txt")
' Read one line from file
line = reader.ReadLine
End Using
I have used one of the following solutions but cannot get it working :(
Public Sub readfromfile()
Using reader As New StreamReader("scores.txt", True)
Dim line As Integer = 0
While Not reader.EndOfStream
If line Mod 2 = 0 Then
frmHighScores.lstScore.Items.Add(line)
Else
frmHighScores.lstScore.Items.Add(line)
End If
line += 1
End While
End Using
End Sub
You can use the Mod operator for this:
Using reader As New StreamReader("highscores.txt", True)
Dim line As Integer = 0
Dim text As String
Do
text = reader.ReadLine()
If text = Nothing Then
Exit Do
End If
If line Mod 2 = 0 Then
''# even line
Else
''# odd line
End If
line += 1
Loop
End Using
This approach also works for cases when it's not an even/odd pattern, but another number of repetions. Say you have 3 lines for each player:
player name 1
score 1
avatar url 1
player name 2
score 2
avatar url 2
...
Then you can get this pattern by using Mod with 3
Dim subLine As Integer = line Mod 3
If subLine = 0 Then
''# player name
ElseIf subLine = 1 Then
''# score
Else
''# avatar url
End If
line += 1
If you can reliably expect there to be an even number of lines in the file, then you can simplify this by reading two at a time.
Using reader As StreamReader = New StreamReader("file.txt")
While Not reader.EndOfStream
Dim player as String = reader.ReadLine()
Dim otherInfo as String = reader.ReadLine()
'Do whatever you like with player and otherInfo
End While
End Using
I don't really know the syntax of VB but something like this:
dim odd as boolean = True
Using reader As StreamReader = New StreamReader("file.txt")
line = reader.ReadLine
if odd then
' add to list A
else
' add to list B
end
odd = Not Odd
End Using
Another possibility is to just read the whole file as a block into a single string and the SPLIT the string on vbcrlf
Dim buf = My.Computer.FileSystem.ReadAllText(Filename)
Dim Lines() = Split(Buf, vbcrlf)
Then, lines will contain all the lines from the file, indexed.
So you could step through them to get each player and his other info.
For x = 0 to ubound(Lines)
'do whatever with each line
next
If the file was HUGE, you wouldn't necessarily want to do it this way, but for small files, it's a quick and easy way to handle it.