Inconsistent page count of a PDF document - pdf

I'm trying to get the number of pages in the PDF document. Some of my PDFs are created in Word (saved as PDF), some of them are Xeroxed into the directory (not sure if this matters).
After hours of research I've come to find out that this is easier said than done. The page count rarely comes back giving me the correct number of pages, even though most PDF's do in fact have /Count inside the Binary Code.
For example I've used the following code; it is supposed to open the document in Binary Mode, look for /Count or /N and get the number next to it which is supposed to give me the page count.
Public Sub pagecount(sfilename As String)
On Error GoTo a
Dim nFileNum As Integer
Dim s As String
Dim c As Integer
Dim pos, pos1 As Integer
pos = 0
pos1 = 0
c = 0
' Get an available file number from the system
nFileNum = FreeFile
'OPEN the PDF file in Binary mode
Open sfilename For Binary Lock Read Write As #nFileNum
' Get the data from the file
Do Until EOF(nFileNum)
Input #1, s
c = c + 1
If c <= 10 Then
pos = InStr(s, "/N")
End If
pos1 = InStr(s, "/count")
If pos > 0 Or pos1 > 0 Then
Close #nFileNum
s = Trim(Mid(s, pos, 10))
s = Replace(s, "/N", "")
s = Replace(s, "/count", "")
s = Replace(s, " ", "")
s = Replace(s, "/", "")
For i = 65 To 125
s = Replace(s, Chr(i), "")
Next
pages = Val(Trim(s))
If pages < 0 Then
pages = 1
End If
Close #nFileNum
Exit Sub
End If
'imp only 1000 lines searches
If c >= 1000 Then
GoTo a
End If
Loop
Close #nFileNum
Exit Sub
a:
Close #nFileNum
pages = 1
Exit Sub
End Sub
However, most of the time, it defaults to pages = 1 (under a:).
I've also updated this to 10000 to be sure that it hits the /Count line, yet it still does not give me the correct count.
If c >= 10000 Then
GoTo a
End If
I also came across this reddit
Is there another way to do this, something I can utilize in my app?
Any help is greatly appreciated.
Background:
This is for a legacy vb6 app where I'm attempting to let the user manipulate the PDF files. I added a ListBox that displays all PDF documents in a particular directory. When user double clicks on any one of the files, i display it in a WebBrowser component inside my application.
EDIT: Image containing the BinaryMode line Count for 3 different documents:
I double checked the page count, and /Count displays the correct page count for each of the three documents.

Regular expressions have limits, but I prefer to use them for searching for strings and I think this would be a good place to use one. You may want to play with the pattern because I did this relatively quickly with only a little testing.
Add a reference to Microsoft VBScript Regular Expressions 5.5 to your project. Then you can try the sample code below.
Private Sub Command1_Click()
Dim oRegEx As RegExp
Dim fHndl As Integer
Dim sContents As String
Dim oMatches As MatchCollection
On Error GoTo ErrCommand1_Click
'Open and read in the file
fHndl = FreeFile
Open some pdf file For Binary Access Read As fHndl
sContents = String(LOF(fHndl), vbNull)
Get #fHndl, 1, sContents
Close #fHndl 'We have the file contents so close it
fHndl = 0
'Instantiate and configure the RegEx
Set oRegEx = New RegExp
oRegEx.Global = True
oRegEx.Pattern = "((?:/Count )(\d+))"
Set oMatches = oRegEx.Execute(sContents)
'Look for a match
If oMatches.Count > 0 Then
If oMatches(0).SubMatches.Count > 0 Then
MsgBox CStr(oMatches(0).SubMatches(0)) & " Pages"
End If
End If
Exit Sub
ErrCommand1_Click:
Debug.Print "Error: " & CStr(Err.Number) & ", " & Err.Description
If Not oRegEx Is Nothing Then Set oRegEx = Nothing
If Not oMatches Is Nothing Then Set oMatches = Nothing
End Sub
An explanation of the RegEx pattern:
() creates a group
?: inside the parenthesis makes the group non-capturing
<</Linearized is a literal string
.* greedy quantifier, match any character 0 or more times
/N literal string
\d+ greedy qualtifier, match digits 1 or more times
>> literal string

Related

Fastest Method to (read, remove, write) to a Text File

I coded a simple program that reads from a Textfile Line by Line and If the current readed Line has alphabetics (a-z A-Z) it will write that Line into an other txt file.
If the current readed line doesn't have alphabetics it wont write that line into a new text file.
I created this for the purpose that I have members registering at my website and some of them are using only numbers as Username. I will filter them out and only save the alphabetic Names. (Focus on this Project please I know i could just use php stuff)
That works great already but it takes a while to read line by line and write into the other text file (Write speed 150kb in 1 Minute - Its not my drive I have a fast ssd).
So I wonder if there is a faster way. I could "readalllines" first but on large files it just freezes my program so I don't know if that works too (I want to focus on large +1gb files)
This is my code so far:
If System.IO.File.Exists(FILE_NAME) = True Then
Dim objReader As New System.IO.StreamReader(FILE_NAME)
Do While objReader.Peek() <> -1
Dim myFile As New FileInfo(output)
Dim sizeInBytes As Long = myFile.Length
If sizeInBytes > splitvalue Then
outcount += 1
output = outputold + outcount.ToString + ".txt"
File.Create(output).Dispose()
End If
count += 1
TextLine = objReader.ReadLine() & vbNewLine
Console.WriteLine(TextLine)
If CheckForAlphaCharacters(TextLine) Then
File.AppendAllText(output, TextLine)
Else
found += 1
Label2.Text = "Removed: " + found.ToString
TextBox1.Text = TextLine
End If
Label1.Text = "Checked: " + count.ToString
Loop
MessageBox.Show("Finish!")
End If
First of all, as hinted by #Sean Skelly updating UI controls - repeatedly - is an expensive operation.
But your bigger problem is File.AppendAllText:
If CheckForAlphaCharacters(TextLine) Then
File.AppendAllText(output, TextLine)
Else
found += 1
Label2.Text = "Removed: " + found.ToString
TextBox1.Text = TextLine
End If
AppendAllText(String, String)
Opens a file, appends the specified string to the file, and then
closes the file. If the file does not exist, this method creates a
file, writes the specified string to the file, then closes the file.
Source
You are repeatedly opening and closing a file, causing overhead. AppendAllText is a convenience method since it performs several operations in one single call but you can now see why it's not performing well in a big loop.
The fix is easy. Open the file once when you start your loop and close it at the end. Make sure that you always close the file properly even when an exception occurs. For that, you can either invoke the Close in a Finally block, or use a context manager, that is keep your file write operations within a Using block.
And you could remove the print to console as well. Display management has a cost too. Or you could print status updates every 10K lines or so.
When you've done all that, you should notice improved performance.
My Final Code - It works a lot faster now (500mbs in 1 minute)
Using sw As StreamWriter = File.CreateText(output)
For Each oneLine As String In File.ReadLines(FILE_NAME)
Try
If changeme = True Then
changeme = False
GoTo Again2
End If
If oneLine.Contains(":") Then
Dim TestString = oneLine.Substring(0, oneLine.IndexOf(":")).Trim()
Dim TestString2 = oneLine.Substring(oneLine.IndexOf(":")).Trim()
If CheckForAlphaCharacters(TestString) = False And CheckForAlphaCharacters(TestString2) = False Then
sw.WriteLine(oneLine)
Else
found += 1
End If
ElseIf oneLine.Contains(";") Or oneLine.Contains("|") Or oneLine.Contains(" ") Then
Dim oneLineReplac As String = oneLine.Replace(" ", ":")
Dim oneLineReplace As String = oneLineReplac.Replace("|", ":")
Dim oneLineReplaced As String = oneLineReplace.Replace(";", ":")
If oneLineReplaced.Contains(":") Then
Dim TestString3 = oneLineReplaced.Substring(0, oneLineReplaced.IndexOf(":")).Trim()
Dim TestString4 = oneLineReplaced.Substring(oneLineReplaced.IndexOf(":")).Trim()
If CheckForAlphaCharacters(TestString3) = False And CheckForAlphaCharacters(TestString4) = False Then
sw.WriteLine(oneLineReplaced)
Else
found += 1
End If
Else
errors += 1
textstring = oneLine
End If
Else
errors += 1
textstring = oneLine
End If
count += 1
Catch
errors += 1
textstring = oneLine
End Try
Next
End Using

trouble with interrupting a Batch file started with VBA

I have an Excel file with a lot of PC's names in a server, I want to execute the "systeminfo" command and get the OS out of it. Then the OS shall be put into an Excel cell automatically. To do so, I used the following codes, respectively in the VBA file and the batch file.
however, whenever the server can't reach a pc, the cmd window is stuck until I manually close it. Since the list is actually 148 names long, knowing of a way to automatically close those Windows after, say, 8 seconds would be really helpful.
I tried to look up for a way to multi-thread VBA, just to find out that It is a single-threaded Language. I then tried to start another batch file with the one I'm actually using as to forcefuly kill it afetr a set of time, but it seems that the second batch starts only after the first is terminated, making it useless.
VBA
Sub Test()
'
' Test Macro
' I'm not an expert in VBA, I just picked it up for this task, so a lot of code will result redundant. Bear with me
'
'
Dim i As Integer
'a is basically i-1.
a = 1
' I needed 148 cells for the project
Dim models(1 To 147) As String
For i = 2 To 148
models(a) = Cells(i, 3).Value
a = a + 1
Next i
a = 1
For i = 2 To 148
'not totally sure what the next five lines actually do, but "metodo" is the name of the batch file.
Dim strShellCommand As String
strShellCommand = "C:\Users\Administrator\Desktop\metodo.bat " + models(a)
Set oSh = CreateObject("WScript.Shell")
Set oEx = oSh.Exec(strShellCommand)
strBuf = oEx.StdOut.readAll
'I took out of the string everything that wasn't purely the OS name
Dim FinalString As String
FinalString = Right(strBuf, 26)
FinalString = Left(FinalString, 25)
'this is the line that prints the OS names into Excel cells
ActiveSheet.Cells(i, 10) = FinalString
a = a + 1
Next i
End Sub
then there is the Batch file
set nome=%1
shift
systeminfo /s %nome% |findstr /c:"Microsoft Windows "
u can do a control loop after the Set oEx = oSh.Exec(strShellCommand)
like :
Set oEx = oSh.Exec(strShellCommand)
LoopCount = 0
Do 'Control loop
wscript.Sleep 1000
If TimeOut > 0 Then LoopCount = LoopCount + 1
Loop Until (oEx.Status <> 0) Or (LoopCount > TimeOut * 8)
If oEx.Status = 0 Then 'Timeout occured
oEx.Terminate
ReturnValue = "[Process terminated after timeout!]" & VbCrLf
Else
ReturnValue = "[Process completed]" & VbCrLf
End If
each loop takes 1 second (wscript.Sleep 1000) and the (LoopCount > TimeOut * 8) sets the total time to 8 seconds
good luck

Reading text files with specific prefix

I have a folder with lots of text files each containing (but in random order) :
A = ...
B = ...
C = ...
Now I would like to import these text files into an excel-spreadsheet,
where each of the prefixes is organized in the colums, and the files are listed as rows
Example: 2 files
File 1:
A = 1
B = 2
C = 3
File 2:
A = 4
B = 5
C = 6
I would the excel to look like :
NR / A / B / C
1 / 1 /2 /3
2 / 4/ 5 /6
I am still learning VB, and this is just a bit over the top for me.
I have found a macro like this:
Sub Read_Text_Files()
Dim sPath As String, sLine As String
Dim oPath As Object, oFile As Object, oFSO As Object
Dim r As Long
'Files location
sPath = "C:\Test\"
r = 1
Set oFSO = CreateObject( _
"Scripting.FileSystemObject")
Set oPath = oFSO.GetFolder(sPath)
Application.ScreenUpdating = False
For Each oFile In oPath.Files
If LCase(Right(oFile.Name, 4)) = ".txt" Then
Open oFile For Input As #1 ' Open file for input.
Do While Not EOF(1) ' Loop until end of file.
Input #1, sLine ' Read data
If Left(sLine, 1) = "A=" Then 'Now i need to write this to the first column of that row
If Left(sLine, 1) = "B=" Then 'For the second column.
Range("A" & r).Formula = sLine ' Write data line
r = r + 1
Loop
Close #1 ' Close file.
End If
Next oFile
Application.ScreenUpdating = True
End Sub
Do you know how to open files in VBA for reading using syntax like Open and Line Input?
If not, read this: https://stackoverflow.com/a/11528932/2832561
I found this by googling for "VBA open file read"
Do you know how to work with and parse strings (and arrays) using functions like Mid, Left, Right, Split and Join?
If not, try reading this: http://www.exceltrick.com/formulas_macros/vba-split-function/
I found this by googling for "VBA String functions parse text"
Do you know how to work with Workbook and Worksheet objects and assign values to Range objects in Excel?
If not, try reading this: http://www.anthony-vba.kefra.com/vba/vbabasic2.htm
I found this by googling for "Workbook Worksheet Range VBA"
Once you have had a chance to try putting together a solution using these pieces, you can post specific questions on any issues you run into.

Reading the next line from a text file

I'm working on an RPG type game for a project and I am stuck.
Basically, this code searches for a name in a text file (structure: odds as names and evens as levels). It then needs to output the next line which is the level they where on. I have the counter (variable "count") to output the right text line in which the level is written but I can not use that count to read that line (using "FileSystem.LineInput(count)").
Here is my full code:
Sub LoadGame()
Dim filename, filepath, searchitem, question, read As String
Dim found As Boolean
Dim count As Integer = 1
filename = "save.txt"
filepath = CurDir() & "\" & filename
searchitem = name
FileOpen(1, filename, OpenMode.Input)
Do While Not EOF(1)
read = LineInput(1)
If read = searchitem Then
found = True
Exit Do
Else
found = False
End If
count = count + 1
Loop
If found = True Then
If count >= 3 Then
count = count + 1
End If
question = FileSystem.LineInput(count) ' This bit is broken
Console.WriteLine("Found save game... Loading: " & question)
Console.ReadLine()
Console.BackgroundColor = ConsoleColor.Black
Console.ForegroundColor = ConsoleColor.Red
Console.Clear()
Race(question)
Else
Console.WriteLine("No save game...")
Console.ReadLine()
End If
FileClose(1)
End Sub
I am not sure what is wrong but any help would be greatly appreciated (using VB 2010)
LineInput reads the next line of the specified file (parameter FileNumber).
Your file has the FileNumber 1 and the file pointer points to the desired line. Therefore, it should be sufficient to
question = FileSystem.LineInput(1)
In my oppinion, you should avoid those kinds of file access (per FileNumber) in .Net. This is just an old relict from VB6 times. In .Net you have easy-to-use classes such as StreamReader for that purpose. But if you want to do it the old-fashioned way, at least use the FreeFile method to define the file number.

Start reading massive text file from the end

I would ask if you could give me some alternatives in my problems.
basically I'm reading a .txt log file averaging to 8 million lines. Around 600megs of pure raw txt file.
I'm currently using streamreader to do 2 passes on those 8 million lines doing sorting and filtering important parts in the log file, but to do so, My computer is taking ~50sec to do 1 complete run.
One way that I can optimize this is to make the first pass to start reading at the end because the most important data is located approximately at the final 200k line(s) . Unfortunately, I searched and streamreader can't do this. Any ideas to do this?
Some general restriction
# of lines varies
size of file varies
location of important data varies but approx at the final 200k line
Here's the loop code for the first pass of the log file just to give you an idea
Do Until sr.EndOfStream = True 'Read whole File
Dim streambuff As String = sr.ReadLine 'Array to Store CombatLogNames
Dim CombatLogNames() As String
Dim searcher As String
If streambuff.Contains("CombatLogNames flags:0x1") Then 'Keyword to Filter CombatLogNames Packets in the .txt
Dim check As String = streambuff 'Duplicate of the Line being read
Dim index1 As Char = check.Substring(check.IndexOf("(") + 1) '
Dim index2 As Char = check.Substring(check.IndexOf("(") + 2) 'Used to bypass the first CombatLogNames packet that contain only 1 entry
If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then 'Stricter Filters for CombatLogNames
Dim endCLN As Integer = 0 'Signifies the end of CombatLogNames Packet
Dim x As Integer = 0 'Counter for array
While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick") 'Loops until the end keyword for CombatLogNames is seen
streambuff = sr.ReadLine 'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then
endCLN = 1 'Value change to determine end of CombatLogName packet
Else
ReDim Preserve CombatLogNames(x) 'Resizes the array while preserving the values
searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'")) 'Additional filtering to get only valuable data
CombatLogNames(x) = search(searcher)
x += 1 '+1 to Array counter
End If
End While
Else
'MsgBox("Something went wrong, Flame the coder of this program!!") 'Bug Testing code that is disabled
End If
Else
End If
If (sr.EndOfStream = True) Then
ReDim GlobalArr(CombatLogNames.Length - 1) 'Resizing the Global array to prime it for copying data
Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length) 'Just copying the array to make it global
End If
Loop
You CAN set the BaseStream to the desired reading position, you just cant set it to a specfic LINE (because counting lines requires to read the complete file)
Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
For i = 1 To 100
sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
Next
End Using
Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
sr.BaseStream.Seek(-100, SeekOrigin.End)
Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
While Not sr.EndOfStream
Dim line = sr.ReadLine
Console.WriteLine(line)
End While
End Using
For any later read attempt on the same file, you could simply save the final position (of the basestream) and on the next read to advance to that position before you start reading lines.
What worked for me was skipping first 4M lines (just a simple if counter > 4M surrounding everything inside the loop), and then adding background workers that did the filtering, and if important added the line to an array, while main thread continued reading the lines. This saved about third of the time at the end of a day.