PDF to plain text, Some difficult pages were encountered Adobe Acrobat XI - vba

Basic Problem:
For this PDF: https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=xSu2ZR
I am unable to convert/Save manually as plain text using Adobe Acrobat XI standard or the batch conversion script (below). The generated file is blank.
Full problem:
As part of my attempts to batch convert PDFs to text, I have run into a strange error where acrobat XI returns the following:
Disappointingly clicking ok generates the text file blank.
The following script to loop through PDF files and convert them to text files using acrobat: It works fine for most PDFs except ones with figures like above.
Sub LoopThroughFiles()
Dim StrFile As String
Dim pdfPath As String
StrFile = Dir("C:\temp\PDFs\")
fileRoot = "C:\temp\PDFs\"
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Do While Len(StrFile) > 0
Debug.Print StrFile
pdfPath = fileRoot & StrFile
Debug.Print pdfPath
success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".txt")
StrFile = Dir
On Error Resume Next
Loop
End Sub
'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object, success As Boolean
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
If success Then
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
End If
AcroXApp.Hide
AcroXApp.Exit
ConvertPdf2 = success 'report success/failure
End Function
The error appears to be jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text" If instead I use jsObj.SaveAs textPath, "com.adobe.acrobat.accesstext" the text file is generated but for my needs it is important the file generates is in the plain text format.
The reason for this can be seen below in a different PDF. These are the different types of text files generated:
Plain text (extends as sentences in the horizontal direction - this is required):
Access Text: (creates more of a body of text - this separated sentences by carriage return and is problematic)
I reckon this is a lost cause for these sorts of PDFs; disappointing, though, as many of the PDFs I need to convert are in this format. Appear to have been plagued with issues trying to solve this one.
Anyway just wondered if it may be possible to disable the popup message, and maybe this will allow the plain-text write to occur?
Alternatively can't think of much else.

It looks like your Acrobat version 11 has issues since "Works for Me" but using older version Reader 9, however its textport as plain text, is goingt to be what you get from pdftotext e.g. left aligned single lines, unsure if a 10 Pro or 20## might be good enough, when did Adobe massage the natural pdf output to richer ?
Reader 9 export as plain text
Opening in other viewers works well enough to save as word or wordpad
Or edit the PDF before save as Docx or convert to text
Using pdftotext will result in a layout reflecting the true output of characters on the page (I call that Plain Text). However your desire is to remove single line feeds (and possibly EOL hyphens). SO that can be done by any Find And Replace Text processing after extraction. Here I outline a possible method.
txt2par.cmd
#echo off
if not exist "%~dpn1.txt" goto help
REM because of method we need to append an extra new line to input (some cases may need two?)
echo/&echo Preparing files
echo/>temp_nl.txt&copy /b "%~dpn1.txt"+temp_nl.txt temp_out.txt >nul:
REM tool will not replace files in binary mode unless it sees there is a dummy backup to use !
echo temp_nl.txt >temp_out.txt.bak
echo/&echo Processing ...&echo/
REM 1st pass ensure binary line feeds are converted to some plain text
fart.exe -q -b --binary --c-style temp_out.txt "\x0D\x0A" "<NL>" >nul: 2>&1
REM 2nd pass ensure double "<NL><NL>" are converted back to single new line
fart.exe --c-style temp_out.txt "<NL><NL>" "\x0D\x0A\x0D\x0A"
echo/&echo de-hypenating line ends&echo/
REM 3rd pass remove hyphenation (Caution that may not always be desirable
fart.exe --c-style temp_out.txt "\x2D<NL>" "\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20\x20\x20\x20\x20" "\x20\x20\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe --c-style temp_out.txt "<NL>\x20\x20" "<NL>\x20"
REM 5th pass ensure remaining line markers are converted to single space
fart.exe --c-style temp_out.txt "<NL>" "\x20"
echo/
echo Done
pause
goto eof
:help
echo/
echo Input must be a filename.txt accepts drag and drop
echo/
echo Usage txt2par filename.txt
echo/
echo Will convert single line feeds to space and
echo convert double line feeds to single line gap
echo/
pause
That may be good enough for some sources however needs more consideration for your complex templated layouts. Possibly by not using whitespaces of two or more space bars (easiest done in more powerful string editor or else you jump unknown loops).

From: Plain Text From PDF without inserting line breaks but retaining carriage returns using VBA. Working solution but requires improvement
Change: Encoding:=1252 to 65001 for unusual characters.
Sub LoopThroughFiles()
Dim StrFile As String
Dim pdfPath As String
StrFile = Dir("C:\temp\PDFs\")
fileRoot = "C:\temp\PDFs\"
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Do While Len(StrFile) > 0
Debug.Print StrFile
n = StrFile
pdfPath = fileRoot & StrFile
Debug.Print pdfPath
'Convert to WordDoc
success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
StrFile = Dir
On Error Resume Next
oWd.Quit
'Convert to PlainText
Debug.Print pdfPath & ".doc"
success2 = GetTextFromWord(pdfPath & ".doc", n)
Loop
End Sub
'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object, success As Boolean
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
If success Then
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
AcroXAVDoc.Close False
End If
AcroXApp.Hide
AcroXApp.Exit
ConvertPdf2 = success 'report success/failure
End Function
Function GetTextFromWord(DocStr As String, n)
Dim filePath As String
Dim fso As FileSystemObject
Dim fileStream As TextStream
Dim oWd As Object, oDoc As Object, fileRoot As String
Const wdFormatText As Long = 2, wdCRLF As Long = 0
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
fileRoot = "C:\temp\PDFs" 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(DocStr)
On Error GoTo 0 'stop ignoring errors
Debug.Print n
If Not oDoc Is Nothing Then
filePath = fileRoot & n & ".txt" 'filename
Debug.Print filePath
oDoc.SaveAs2 Filename:=filePath, _
FileFormat:=wdFormatText, LockComments:=False, Password:="", _
AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
:=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
, AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
oDoc.Close False
End If
oWd.Quit
GetTextFromWord = success2
End Function

Related

Extracting images from Word document using VBA

I need to loop over some word documents, and extract images from a word document and save them in a separate folder.
I've tried the method of saving them as an HTML document, but it is not a good fit for my requirement.
Now, I'm looping through the images using inlineshapes object and then copy-pasting them on a publisher document and then saving them as an image. However, I'm facing a Runtime Automation error when I'm running the script.
For using the Publisher runtime library I've tried both early and late binding but I'm facing the error on both of them.
Can anyone please let me know what is the problem? Also, if anyone can explain why I'm facing this error, that'd be great. As per my understanding, it is due to memory allocation, but I'm not sure.
Here is the code block that I've been working on (fp, dp are folder paths, while filename is the word document name. I'm calling this sub in another sub that is looping over all the files in a folder):
Sub test(ByVal fp As String, ByVal dp As String, ByVal filename As String)
Dim doc As Document
Dim pubdoc As New Publisher.Document
Dim shp As InlineShape
'Application.Screenupdating = False
'Dim pubdoc As Object
'Set pubdoc = CreateObject("Publisher.Document")
Set doc = Documents.Open(fp)
With doc
i = .InlineShapes.Count
Debug.Print i
End With
For j = 1 To i
Set shp = doc.InlineShapes(j)
shp.Select
Selection.CopyAsPicture
pubdoc.Pages(1).Shapes.Paste
pubdoc.Pages(1).Shapes(1).SaveAsPicture (dp & Application.PathSeparator & j & ".jpg")
pubdoc.Pages(1).Shapes(1).Delete
Next
doc.Close (wdDoNotSaveChanges)
pubdoc.Close
'Application.Screenupdating = True
End Sub
Apart from this, if anyone has any suggestions to make this faster, I'm all ears. Thanks in advance!
Just add .zip to the end of the file name, expand the file and look in the word/media folder. All the files will be there, no programming necessary.
Extracting the pictures from a Filtered HTML document that was created from your original source document would be faster. However, you said that was not a good fit for you needs so ... here is example code that will locate pictures in your source document and paste them into a second document.
The speed problem of this type of code is caused by the CopyPicture working from a Selection command, so I recommend using a range instead. Of course the For/Next loop that is required is slower no matter what.
Sub CopyPasteAsPicture()
Dim doc As Word.Document, iShp As Word.InlineShape, shp As Word.Shape
Dim i As Integer, nDoc As Word.Document, rng As Word.Range
Set doc = ActiveDocument
If doc.Shapes.Count > 0 Then
For i = 1 To doc.Shapes.Count
Set shp = doc.Shapes(i)
If shp.Type = msoLinkedPicture Or shp.Type = msoPicture Then
'if you want only pictures extracted then you have
'to specify the type
shp.ConvertToInlineShape
'if you want all extracted pictures to be in the sequence
'they appear in the document then you have to convert
'floating shapes to inline shapes
End If
Next
End If
If doc.Content.InlineShapes.Count > 0 Then
Set nDoc = Word.Documents.Add
Set rng = nDoc.Content
For i = 1 To doc.Content.InlineShapes.Count
doc.Content.InlineShapes(i).Range.CopyAsPicture
rng.Paste
rng.Collapse Word.WdCollapseDirection.wdCollapseEnd
rng.Paragraphs.Add
rng.Collapse Word.WdCollapseDirection.wdCollapseEnd
Next
End If
End Sub
If you want to place all shapes (floating or inline) into a folder as image files, then the best way is to save the source document as a filtered HTML document. Here is the command:
htmDoc.SaveAs2 FileName:=LGPWorking & strFileName, AddToRecentFiles:=False, FileFormat:=Word.WdSaveFormat.wdFormatFilteredHTML
In the above the active document is assigned to the variable htmDoc. I am giving this new document a specific name and location. The output from this is not only the HTML file but also a directory by the same name with an appended "_Files" label. In the "x_Files" directory are all the image files.
If you only want selective images pulled from your original source document, or if you want images pulled from multiple source documents ... then you need to use the above code that I shared for placing only the images you want from one or more source document into a new Word document and then save that new document as an Filtered HTML.
When your routine is done, you can Kill the HTML document and only leave the Files directory.
I had to change a few things around, but this will allow to save a single image on a word document and go through a couple of cycles before it turns into a jpg on the other side, without any white space
filename = ActiveDocument.FullName
saveLocaton = "z:\temp\"
FolderName = "test"
On Error Resume Next
Kill "z:\temp\test_files\*" 'Delete all files
RmDir "z:\temp\test_files" 'Delete folder
ActiveDocument.SaveAs2 filename:="z:\temp\test.html", FileFormat:=wdFormatHTML
ActiveDocument.Close
Kill saveLocaton & FolderName & ".html"
Kill saveLocaton & FolderName & "_files\*.xml"
Kill saveLocaton & FolderName & "_files\*.html"
Kill saveLocaton & FolderName & "_files\*.thmx"
Name saveLocaton & FolderName & "_files\image00" & 1 & ".png" As saveLocaton & FolderName & "_files\" & test2 & "_00" & x & ".jpg"
Word.Application.Visible = True
Word.Application.Activate

VBA search and replace script produce long delays

[[UPDATE: Apologies...I forgot the code block. Edited.]]
This is a question from one of my faculty, so I'll quote him directly, and preface this by saying I don't know VB (just the conduit here):
"In order to help an international student, I need to convert Youtube transcripts and Closed Captionings into readable text.
I wrote this routine for the purpose of eliminating time code marks from Youtube transcripts by invoking Visual Basic from within Microsoft Word. Since time codes are always on a separate line and since time codes must contain “:”, I am searching for the character “:” and then deleting the whole line in which it occurs. For some reason, the routine I have written, when completed, forces Word into a long unresponsive period (around 60-100 seconds) after which it works perfectly well. Any suggestions for avoiding that delay or for solving this problem within Word, without writing macros?"
Sub Deleteyt()
Dim oRng As Word.Range
Dim oRngDelete As Word.Range
Set oRng = ActiveDocument.Range
With oRng.Find
.Text = ":"
While .Execute
oRng.Select
Set oRngDelete = ActiveDocument.Bookmarks("\Line").Range
oRngDelete.Delete
Wend
End With
End Sub
So the code works, there's just the long delay up front. Any ideas?
You don't need a macro for this - all you need is a wildcard Find/Replace, with:
Find = <[0-9]#:[0-9]#>*^13
Replace = nothing
Why not modifying the text file which contains the transcript? I would expect you end up with a .txt file.
See example below, highly insipred from: Text file in VBA: Open/Find Replace/SaveAs/Close File
Sub CleanFile()
Dim sBuf As String
Dim sTemp As String
Dim iFileNum As Integer
Dim sFileName As String
' Edit as needed
sFileName = application.GetOpenFilename
iFileNum = FreeFile
Open sFileName For Input As iFileNum
Do Until EOF(iFileNum)
Line Input #iFileNum, sBuf
'If the line does not contain a ':', then include the line in the memory sTemp
if If InStr(sBuf, ":") = 0 Then sTemp = sTemp & sBuf & vbCrLf
Loop
Close iFileNum
iFileNum = FreeFile
sFileName = Application.GetSaveAsFilename()
Open sFileName For Output As iFileNum
Print #iFileNum, sTemp
Close iFileNum
End Sub

How to count number of rows and to move files automatically with VBA macros?

My goal is to write a VBA macros that will allow:
to choose a folder with files to open
then to count number of rows in each file (each file contain only 1 sheet).
to move to another folder all the files that contain more than 1 row
I'm very new in VBA, so what i found is how to count number of rows from active worksheet, but i still can't manage automatically files opening and moving to another folder:
Sub RowCount()
Dim iAreaCount As Integer
Dim i As Integer
Worksheets("Sheet1").Activate
iAreaCount = Selection.Areas.Count
If iAreaCount <= 1 Then
MsgBox "The selection contains " & Selection.Rows.Count & " rows."
Else
For i = 1 To iAreaCount
MsgBox "Area " & i & " of the selection contains " & _
Selection.Areas(i).Rows.Count & " rows."
Next i
End If
End Sub
Could someone help with this, please?
This is actually easy. Really easy. :)
First, code to choose a folder to look into for Excel files. Used Google and searched for excel vba select folder dialog. First result yields this code:
Function GetFolder(strPath As String) As String
Dim fldr As FileDialog
Dim sItem As String
Set fldr = Application.FileDialog(msoFileDialogFolderPicker)
With fldr
.Title = "Select a Folder"
.AllowMultiSelect = False
.InitialFileName = strPath
If .Show <> -1 Then GoTo NextCode
sItem = .SelectedItems(1)
End With
NextCode:
GetFolder = sItem
Set fldr = Nothing
End Function
We'll get to using it for later. Next, we need a loop to count how many rows there are in each file/sheet. However, we can't count them without these files open. So, let's look for a code that opens workbooks in a loop. Googling excel vba open excel files in folder, we get the second result. First result is a deprecated method in Excel 2007 and up. I will be assuming you're running 2007 and up. Here's the code, applying the proper correction detailed by Siddharth Rout.
Sub OpenFiles()
Dim MyFolder As String
Dim MyFile As String
MyFolder = "Blah blah blah"
MyFile = Dir(MyFolder & "\*.xlsx")
Do While MyFile <> ""
Workbooks.Open Filename:=MyFolder & "\" & MyFile
MyFile = Dir
Loop
End Sub
Now, some semi-advanced best practices. Rather than opening each workbook/worksheet/file and counting the rows in each of the opened files (which is highly counter-intuitive), let's modify the above code to count the rows in each file as well, then move them to another folder if they have more than one (1) used row. We'll also change the above code to take into consideration as well the first function to get the folder we want to apply the second code to.
Sub OpenFiles()
Dim MyFolder As String
Dim MyFile As String
MyFolder = GetFolder("C:\users\yourname\Desktop" 'Modify as needed.
MyFile = Dir(MyFolder & "\*.xlsx") 'Modify as needed.
Do While MyFile <> ""
Workbooks.Open Filename:=MyFolder & "\" & MyFile
MyFile = Dir
Loop
End Sub
See what happened there? We called the GetFolder function and assigned it to MyFolder. We then concatenate MyFolder and a wildcarded string, then pass it to Dir so we can loop over the files. What's the remaining two things? Right, count the used rows AND moving the files. For the used rows, I'll hack a simple function to check the workbook's only sheet to see if the row is 2 or greater.
Function CountUsedRows(Wbk As Workbook) As Long
Dim WS As Worksheet
Set WS = Wbk.Sheets(1)
CountUsedRows = WS.Range("A" & Rows.Count).End(xlUp).Row 'Modify as necessary.
End Function
Now that's simple enough. Next, let's write a simple code to move the files. For personal purposes, I'll write a code to copy instead. It'll be up to you to modify it for moving, as that's a rather sensitive operation and if it messes up... well. Hmm. But something here tells me that there's a much better option. Copying can cause all manners of error from permission denial to erroneous copying. Since we've got the file open, why not just save them instead to the new folder?
Now, let's tie them all together neatly.
Sub OpenFiles()
Dim MyFolder As String
Dim MyFile As String
Dim TargetWB As Workbook
MyFolder = GetFolder("C:\Users\yourname\Desktop") 'Modify as needed.
MyFile = Dir(MyFolder & "\*.xlsx") 'Modify as needed.
Application.ScreenUpdating = False
Application.DisplayAlerts = False
Do While MyFile <> ""
Set TargetWB = Workbooks.Open(Filename:=MyFolder & "\" & MyFile)
With TargetWB
If CountUsedRows(TargetWB) > 1 Then
.SaveAs "C:\Users\yourname\Desktop\Blah\CopyOf" & MyFile 'Modify as needed.
End If
.Close
End With
MyFile = Dir
Loop
Shell "explorer.exe C:\Users\yourname\Desktop\Blah", vbMaximizedFocus 'Open the folder.
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
Function GetFolder(strPath As String) As String
Dim fldr As FileDialog
Dim sItem As String
Set fldr = Application.FileDialog(msoFileDialogFolderPicker)
With fldr
.Title = "Select a Folder"
.AllowMultiSelect = False
.InitialFileName = strPath
If .Show <> -1 Then GoTo NextCode
sItem = .SelectedItems(1)
End With
NextCode:
GetFolder = sItem
Set fldr = Nothing
End Function
Function CountUsedRows(Wbk As Workbook) As Long
Dim WS As Worksheet
Set WS = Wbk.Sheets(1)
CountUsedRows = WS.Range("A" & Rows.Count).End(xlUp).Row 'Modify as necessary.
End Function
Tried and tested. Let us know if this works for you.
Nice answer from Manhattan: that's exactly how I use Excel's built-in functionality to select a folder and fetch a set of file names.
However, there's an interesting side-question in there:
Are those single-sheet Excel files workbooks, of just .csv text files?
If they have a .csv extension, you don't need to open them in Excel to count the rows!
Here's the code to do it:
Fast VBA for Counting Rows in a CSV file
Public Function FileRowCount(FilePath As String, Optional RowDelimiter As String = vbCr) As Long
' Returns the row count of a text file, including the header row
' Returns - 1 on error
' Unicode-compliant, works on UTF-8, UTF-16, ASCII, with or without a Byte order Marker.
' Reads a typical 30Mb file over the network in 200-300ms. Hint: always copy to a local folder.
' If you're scanning files for use with a SQL driver, use basSQL.TableRowCount: it's 20x slower,
' but it returns a proper test of the file's usability as a SQL 'table'
' Nigel Heffernan Excellerando.Blogspot.com 2015
' Unit test:
' s=Timer : for i = 0 to 99 : n=FileRowCount("C:\Temp\MyFile.csv") : Next i : Print Format(n,"&num;,&num;&num;0") & " rows in " & FORMAT((Timer-s)/i,"0.000") & " sec"
' Network performance on a good day: reads ~ 150 MB/second, plus an overhead of 70 ms for each file
' Local-drive performance: ~ 4.5 GB/second, plus an overhead of 4 ms for each file
On Error Resume Next
Dim hndFile As Long
Dim lngRowCount As Long
Dim lngOffset As Long
Dim lngFileLen As Long
Const CHUNK_SIZE As Long = 8192
Dim strChunk As String * CHUNK_SIZE
If Len(Dir(FilePath, vbNormal)) &LT; 1 Then
FileRowCount = -1
Exit Function
End If
' trap the error of a folder path without a filename:
If FileName(FilePath) = "" Then
FileRowCount = -1
Exit Function
End If
hndFile = FreeFile
Open FilePath For Binary Access Read Shared As &num;hndFile
lngFileLen = LOF(hndFile)
lngOffset = 1
Do Until EOF(hndFile)
Get &num;hndFile, , strChunk
FileRowCount = FileRowCount + UBound(Split(strChunk, RowDelimiter))
Loop
Close &num;hndFile
Erase arrBytes
End Function
Public Function FileName(Path As String) As String
' Strip the folder and path from a file's path string, leaving only the file name
' This does not check for the existence or accessibility of the file:
' all we're doing here is string-handling
' Nigel Heffernan Excellerando.Blogspot.com 2011
Dim strPath As String
Dim arrPath() As String
Const BACKSLASH As String * 1 = "\"
strPath = Trim(Path)
If strPath = "" Then Exit Function
If Right$(strPath, 1) = BACKSLASH Then Exit Function
arrPath = Split(strPath, BACKSLASH)
If UBound(arrPath) = 0 Then ' does not contain "\"
FileName = Path
Else
FileName = arrPath(UBound(arrPath))
End If
Erase arrPath
End Function
Note the use of the Split function to count the row separators: VBA's string-handling is generally slow, especially when you concatenate strings, but there are a couple of places where VBA can perform a string manipulation without internal allocation and deallocation; if you know where they are, you'll find that parts of your code run as fast as a 'C' developer's best work.
Warning: Horrible Hack
Strictly speaking, I should declare Dim arrBytes(CHUNK_SIZE) As Byte and use this Byte array instead of strChunk to receive the Get from a file opened for binary read.
There are two reasons for not doing it the 'right' way:The last Get, which will set end-of-file TRUE, will extract less data from the file than the full 'chunk'. What happens next is that these last few bytes of the file are written into the array without clearing out the data from the previous 'Get'. So you have to do additional plumbing, counting bytes off against LOF(#hwndFile) to detect the 'Last Get' and branching into a statement that clears the buffer, or allocates a smaller byte array and uses that instead;The code will only cope with UTF-8 2-byte encoded character sets, or with single-byte encoded ASCII 'Latin' text if you do a bit of byte-array substitution around your row delimiters.The VBA String type is a byte array with a wrapper that allows your code (or rather, the compiler) to handle all that complexity in the background.
However, it's much faster to go back into the primordial C, using old-school Get statements, than using later libraries like Scripting.FileSystemObject. Also, you have some ability to examine the incoming data at the byte level, to debug issues where you're getting '???????' characters instead of the text you were expecting.
Anyway: this is late to the game, as StackOverflow answers go, and it's an answer to the less-interesting part of your question. But it's going to be interesting to people who need a quick rowcount in their data files, and your question comes at the top of the list when they search for that.

VBA Scripting Runtime Library with Unix file - LF as End Of Line instead of CRLF issue

I'm writing a very simple parser to read text files into Excel.
Files are exceeding available lines in Excel 2012, so I have to go for a line by line approach.
I've tested Microsoft Scripting Runtime Library, TextStream Object and ReadLine method.
It works fine as far as I have Windows files with CRLF as end of line, while it fails when only LF is marking end of line.
I've seen lots of solutions outside VBA, is there any viable solution within VBA?
thanks in advance
Start by making with a test file that uses only LFs as line endings in c:\temp\lfs.txt
This will read the file into a string:
Dim FileNum As Integer
Dim sBuf As String
Dim sTemp As String
FileNum = FreeFile
Open "c:\temp\lfs.txt" For Input As FileNum
While Not EOF(FileNum)
Line Input #FileNum, sBuf
sTemp = sTemp & sBuf
Wend
Close FileNum
' Now, what do we have? First the string itself:
Debug.Print sTemp
' Are there any CRs in it?
Debug.Print Replace(sTemp, vbCr, "CR")
' LFs?
Debug.Print Replace(sTemp, vbLf, "LF")
' Replace the LFs with CRLFs:
Debug.Print Replace(sTemp, vbLf, vbCrLf)
Now if you write it back out to file, you should be able to use it
I used the following very simple text file separated by LFs. The following vba code workes for me. May be you could post short sample of your text file data where the TextStream.ReadLine method fails with?
Private Const TEXT_FILE_PATH As String = "C:\Temp\VBA\textFileForUnix.txt"
Public Sub test()
Dim targetSheet As Worksheet
Set targetSheet = ThisWorkbook.ActiveSheet
ReadText sheet:=targetSheet, textFilePath:=TEXT_FILE_PATH
End Sub
Private Sub ReadText(ByRef sheet As Worksheet, ByRef textFilePath As String)
Dim scriptingFileSystem As Scripting.FileSystemObject
Dim scriptingFile As Scripting.File
Dim scriptingStream As Scripting.TextStream
Set scriptingFileSystem = New Scripting.FileSystemObject
Set scriptingFile = scriptingFileSystem.GetFile(textFilePath)
Set scriptingStream = scriptingFile.OpenAsTextStream(ForReading)
Dim r As Long
Dim c As Byte
With scriptingStream
r = 1
c = 1
Do While Not .AtEndOfStream
sheet.Cells(r, c).Value = .ReadLine
r = r + 1
Loop
.Close
End With
Set scriptingFile = Nothing
Set scriptingFileSystem = Nothing
End Sub
This one-liner will fix any "unusual" end-of-line I've ever came across to a proper CRLF. In particular it will fix either single LF's or single CR's or LF-CR's.
result$ = Replace(Replace(Replace(Replace(Replace(txt$, vbLf & vbCr, vbLf), vbCrLf, vbLf), vbLf, vbCrLf), vbCr, vbCrLf), vbCrLf & vbLf, vbCrLf)

Use VBS to copy from Notepad to Word

I'm trying to create a script to convert PDF to plain text, then copy the plain text into Word. (We do a lot of reformatting corrupt documents from scratch where I work.) I have a script that's working perfectly except for one thing: when pasting into Word, it doesn't paste the whole file. With longer files, I'm only getting part of the text.
'string to hold file path
Dim strDMM
strDMM = "[path]"
'make this directory if it doesn't exits
On Error Resume Next
MkDir strDMM
On Error GoTo 0
'get the file name to process
Dim TheFile
TheFile = InputBox("What is the file name?" & chr(13) & chr(13) & "(Example: [name].pdf)", "Name of File")
'declare some acrobat variables
Dim AcroXApp
Dim AcroXAVDoc
Dim AcroXPDDoc
'open acrobat
Set AcroXApp = CreateObject("AcroExch.App")
AcroXApp.Hide
'open the document we want
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open "[path to desktop]" & TheFile, "Acrobat" 'users are instructed to save to the Desktop for ease of access here
'make sure the acrobat window is active
AcroXAVDoc.BringToFront
'I don't know what this does. I copied it from code online.
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
'activate JavaScript commands w/Acrobat
Dim jsObj
Set jsObj = AcroXPDDoc.GetJSObject
'save the file as plain text
jsObj.SaveAs strDMM & "pdf-plain-text.txt", "com.adobe.acrobat.plain-text"
'close the file and exit acrobat
AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit
'declare constants for manipulating the text files
Const ForReading = 1
Const ForWriting = 2
'Create a File System Object
Dim objFSO
Set objFSO = CreateObject("Scripting.FileSystemObject")
'read file and get text
dim objFile
set objFile=objFSO.OpenTextFile( strDMM & TheFile, ForReading)
Dim strText
strText=objFile.ReadAll
'Create a Word Object
Dim objWord
set objWord = CreateObject("Word.Application")
'make Word visible
With objWord
.Visible = True
End With
'Add method used to create a blank document
Dim objDoc
Set objDoc=objWord.Documents.Add()
'create a shorter variable to pass commands to Word
Dim objSelection
set objSelection=objWord.Selection
'type the read text into Word; this is the part that's failing
objSelection.TypeText strText
objFile.Close
I've tried multiple files with the same result. The funny thing is, it pastes the same material from file A each time, but when copying from file B, it pastes a different amount of material. In other words, if A gives me 8 pages of 60 on the first run, I get those same 8 pages each time. File B might give me 14 pages of 60, then it gives me the same 14 pages each time. This only changes if I delete material from the .txt file. If I delete several paragraphs from A, then run the script, I might get 12 pages. Then I get those same 12 every time. But there's no pattern (that I can discern) to predict where it cuts off.
I can't find any EOF characters, and when I read from notepad and write to notepad, the whole thing is copied perfectly. The problem is somewhere in the transfer to Word.
Is there something I'm missing? Is there a limit to the size of a string that Word can write with TypeText? (I would think that if that were the case, I wouldn't get documents of varying length, right? Shouldn't they all stop at n characters if that's the limit?)
I've read about additional libraries that let VBS work with the clipboard, but I'm a total noob and don't know if that's a more elegant solution or how to make it work. I'm also not sure that on my work computer I have the necessary access to install those libraries.
Any help is appreciated!
There is no need to read a file into Word, you can insert a text file from disk
Dim objWord
'Dim objDoc
Set objWord = CreateObject("Word.Application")
'make Word visible
With objWord
.Visible = True
'Add method used to create a blank document
.Documents.Add
.Selection.InsertFile FileNameAndPath
End With
The basic problem, which you hinted at, is that the String data type is limited to 65,400 characters. With an unknown file length, it is better to read in one line at a time and write it to Word. There is a good discussion of something similar here. The following code should help you get where you wan to go:
'read file and get text
dim objFile
set objFile=objFSO.OpenTextFile( strDMM & TheFile, ForReading)
'Don't do this!
'Dim strText
'strText=objFile.ReadAll
'Create a Word Object
Dim objWord
set objWord = CreateObject("Word.Application")
'make Word visible
With objWord
.Visible = True
End With
'Add method used to create a blank document
Dim objDoc
Set objDoc=objWord.Documents.Add()
'create a shorter variable to pass commands to Word
Dim objSelection
set objSelection=objWord.Selection
'Read one line at a time from the text file and
'type that line into Word until the end of the file is reached
Dim strLine
Do Until objFile.AtEndOfStream
strLine = objFile.ReadLine
objSelection.TypeText strLine
Loop
objFile.Close
Hope that helps!