Limiting the amount of files grabbed from system.io.directory.getfiles - vb.net

I've got a folder browser dialogue populating the directory location (path) of a system.io.directory.getfiles. The issue is if you accidentally select a folder with hundereds or thousands of files (which there's no reason you would ever need this for this app) it will lock up the app while it grabs all the files. All I'm grabbing are the directory locations as strings and want to put a limit on the amount of files that can be grabbed. Here's my current code that isn't working.
If JigFolderBrowse.ShowDialog = DialogResult.OK Then
Dim dirs(50) As String
dirs = System.IO.Directory.GetFiles(JigFolderBrowse.SelectedPath.ToString, "*", System.IO.SearchOption.AllDirectories)
If dirs.Length> 50 Then
MsgBox("Too Many Files Selected" + vbNewLine + "Select A Smaller Folder To Be Organized")
Exit Sub
End If
'Seperate Each File By Type
For i = 0 To dirs.Length - 1
If Not dirs(i).Contains("~$") Then
If dirs(i).Contains(".SLDPRT") Or dirs(i).Contains(".sldprt") Then
PartsListBx.Items.Add(dirs(i))
ElseIf dirs(i).Contains(".SLDASM") Or dirs(i).Contains(".sldasm") Then
AssemListBx.Items.Add(dirs(i))
ElseIf dirs(i).Contains(".SLDDRW") Or dirs(i).Contains(".slddrw") Then
DrawingListBx.Items.Add(dirs(i))
ElseIf dirs(i).Contains(".pdf") Or dirs(i).Contains(".PDF") Then
PDFsListBx.Items.Add(dirs(i))
ElseIf dirs(i).Contains(".DXF") Or dirs(i).Contains(".dxf") Then
DXFsListBx.Items.Add(dirs(i))
ElseIf Not dirs(i).Contains(".db") Then
OtherFilesListBx.Items.Add(dirs(i))
End If
End If

The Directory.GetFiles method always retrieves the full list of matching files before returning. There is no way to limit it (outside of specifying a more narrow search pattern, that is). There is, however, the Directory.EnumerateFiles method which does what you need. From the MSDN article:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
So, for instance, you could do something like this:
dirs = Directory.
EnumerateFiles(
JigFolderBrowse.SelectedPath.ToString(),
"*",
SearchOption.AllDirectories).
Take(50).
ToArray()
Take is a LINQ extension method which returns only the first x-number of items from any IEnumerable(Of T) list. So, in order for that line to work, you'll need to import the System.Linq namespace. If you can't, or don't want to, use LINQ, you can just implement your own method that does the same sort of thing (iterates an IEnumerable list in a for loop and returns after reading only the first 50 items).
Side Note 1: Unused Array
Also, it's worth mentioning, in your code, you initialize your dirs variable to point to a 50-element string array. You then, in the very next line, set it to point to a whole new array (the one returned by the Directory.GetFiles method). While it's not breaking functionality, it is unnecessarily inefficient. You're creating that extra array, just giving the garbage collector extra work to do, for no reason. You never use that first array. It just gets dereferenced and discarded in the very next line. It would be better to create the array variable as null:
Dim dirs() As String
Or
Dim dirs() As String = Nothing
Or, better yet:
Dim dirs() As String = Directory.
EnumerateFiles(
JigFolderBrowse.SelectedPath.ToString(),
"*",
SearchOption.AllDirectories).
Take(50).
ToArray()
Side Note 2: File Extension Comparisons
Also, it looks like you are trying to compare the file extensions in a case-insensitive way. There are two problems with the way you are doing it. First, you only comparing it against two values: all lowercase (e.g. ".pdf") and all uppercase (e.g. ".PDF). That won't work with mixed-case (e.g. ".Pdf").
It is admittedly annoying that the String.Contains method does not have a case-sensitivity option. So, while it's a little hokey, the best option would be to make use of the String.IndexOf method, which does have a case-insensitive option:
If dirs(i).IndexOf(".pdf", StringComparison.CurrentCultureIgnoreCase) <> -1 Then
However, the second problem, which invalidates my last point of advice, is that you are checking to see if the string contains the particular file extension rather than checking to see if it ends with it. So, for instance, a file name like "My.pdf.zip" will still match, even though it's extension is ".zip" rather than ".pdf". Perhaps this was your intent, but, if not, I would recommend using the Path.GetExtension method to get the actual extension of the file name and then compare that. For instance:
Dim ext As String = Path.GetExtension(dirs(i))
If ext.Equals("pdf", StringComparison.CurrentCultureIgnoreCase) Then
' ...

Related

How can I split names of directories in certain path from its full path (VB.NET)

Hello world!
I've ran into a problem. I am getting directories contained in certain path and I need to separate the path VB.NET's giving me (like this:
"D:\ApplicationFolder\Addons\Pack_1",
"D:\ApplicationFolder\Addons\Pack_2" ...
Only into this:
"Pack_1", "Pack_2"
So far I've tried this, but I can't get into a solution, I am lost...
Dim ADDONPACKS_DIRECTORIES As String() = Directory.GetDirectories(ADDONS_PATH) ' GETTING ALL DIRECTORIES (PATHS) IN THIS PATH
For Each ADDONPACKS_DIRECTORY In ADDONPACKS_DIRECTORIES ' TRYING TO SPLIT FULL PATH OF THESE DIRECTORIES TO GET ONLY THE NAME OF THESE DIRECTORIES
ADDONPACKS_DIRECTORY.Split()
Dim ADDONPACKS_LENGTH As Integer = ADDONPACKS_DIRECTORY.Length()
MsgBox(ADDONPACKS_DIRECTORY(2))
Next
' Here I want to assign names of these directories onto a label. But the fields only show letters instead of the path segments.
Addonpack1.Text = ADDONPACKS_DIRECTORIES(0)
Addonpack2.Text = ADDONPACKS_DIRECTORIES(1)
Addonpack3.Text = ADDONPACKS_DIRECTORIES(2)
Addonpack4.Text = ADDONPACKS_DIRECTORIES(3)
Addonpack5.Text = ADDONPACKS_DIRECTORIES(4)
'Addonpack6.Text = ADDONPACKS_DIRECTORY(5)
Any ideas? I really appreciate further help.
string.Split() is a Function: it returns a value.
Here: ADDONPACKS_DIRECTORY.Split(), you are splitting the string using the default separator (a white space) but the result is not assigned to anything, so it's lost (but it wouldn't be useful anyway).
This: MsgBox(ADDONPACKS_DIRECTORY(2)), will show only one char of the current Directory path. A string is a collection (an array) of chars. You're asking to show the 3rd.
If you think you won't need the complete directory listing anymore, you could Split the initial collection directly:
Dim ADDONPACKS_DIRECTORIES As String() = Directory.GetDirectories(ADDONS_PATH).
Select(Function(d) d.Split("\"c).Last()).ToArray()
Addonpack1.Text = ADDONPACKS_DIRECTORIES(0)
'(...)
If you instead are going to use that collection of Paths later, you could Split each path and assign the result to each TextBox.Text property, leaving the original collection untouched:
Addonpack1.Text = ADDONPACKS_DIRECTORIES(0).Split("\"c).Last()
Addonpack2.Text = ADDONPACKS_DIRECTORIES(1).Split("\"c).Last()
'(...)
Do you know beforehand how many Addons you will have?
If not, a TextBox for each path might not be the right object to use as the output.
Maybe, you could use a single multiline TextBox. It's Lines() property will hold the array of all the Sub-Paths you appended.
Using the first snippet, it could be something like this:
For Each subpath As String In ADDONPACKS_DIRECTORIES
TextBox1.AppendText(subpath & Environment.NewLine)
Next
Note:
As LarsTech noted in the comments, you could use Path.GetFileName() insted of splitting the path using the path separator.
It would work with both file names and path names, because Path.GetFileName returns the substring of a path when it first finds a path separator, parsing the string from the end to the start, no matter if the substring represents a path or a file name.
Addonpack1.Text = Path.GetFileName(ADDONPACKS_DIRECTORIES(0))
'(...)

Getting a count of "find" from multiple .txt files

I am currently running a find and replace through hundreds of .txt files at a time. I am looking for a way to pull a count of the number of finds for my value.
Here is the code I am currently running, hoping to be able to add to or modify this code.
Dim flatfiles As String() = IO.Directory.GetFiles("C:\DATA\TEST\", "*.txt").Where(Function(x) File.ReadAllText(x).Contains("Bob")).ToArray
For Each f As String In flatfiles
Dim contents As String = File.ReadAllText(f)
File.WriteAllText(f, contents.Replace("Bob", "Bill"))
Next
One (inefficient) way to do this would be to include a counter outside of the For Each loop (Dim itemsFound as Integer = 0), then increment it by the count of find in each file, using something like:
itemsFound = itemsFound + (Regex.Split(contents, find).Length - 1)
Regex.Split splits the string up whenever it finds find, which means the count you're looking for is one less than the number of items in the list.
I would say as well, that you're calling File.ReadAllText twice in your code, so you could improve it by getting rid of the Where code, and just check in your For each loop (seeing as you're now counting the number of instances in the file anyway, it's easy enough to check for 0 occurrences). Alternatively, you could replace the .Where code to store the contents of the file in an array rather than the name of the files (although this can be dangerous if the files are large); or you could even just do it all in Linq if you want some obfuscated code...

VB.NET (2013) - Check string against huge file

I have a text file that is 125Mb in size, it contains 2.2 million records. I have another text file which doesn't match the original but I need to find out where it differs. Normally, with a smaller file I would read each line and process it in some way, or read the whole file into a string and do likewise, however the two files are too big for that and so I would like to create something to achieve my goal. Here's what I currently have.. excuse the mess of it.
Private Sub refUpdateBtn_Click(sender As Object, e As EventArgs) Handles refUpdateBtn.Click
Dim refOrig As String = refOriginalText.Text 'Original Reference File
Dim refLatest As String = refLatestText.Text 'Latest Reference
Dim srOriginal As StreamReader = New StreamReader(refOrig) 'start stream of original file
Dim srLatest As StreamReader = New StreamReader(refLatest) 'start stream of latest file
Dim recOrig, recLatest, baseDIR, parentDIR, recOutFile As String
baseDIR = vb.Left(refOrig, InStrRev(refOrig, ".ref") - 1) 'find parent folder
parentDIR = Path.GetDirectoryName(baseDIR) & "\"
recOutFile = parentDIR & "Updated.ref"
Me.Text = "Processing Reference File..." 'update the application
Update()
If Not File.Exists(recOutFile) Then
FileOpen(55, recOutFile, OpenMode.Append)
FileClose(55)
End If
Dim x As Integer = 0
Do While srLatest.Peek() > -1
Application.DoEvents()
recLatest = srLatest.ReadLine
recOrig = srOriginal.ReadLine ' check the original reference file
Do
If Not recLatest.Equals(recOrig) Then
recOrig = srOriginal.ReadLine
Else
FileOpen(55, recOutFile, OpenMode.Append)
Print(55, recLatest & Environment.NewLine)
FileClose(55)
x += 1
count.Text = "Record No: " & x
count.Refresh()
srOriginal.BaseStream.Seek(0, SeekOrigin.Begin)
GoTo 1
End If
Loop
1:
Loop
srLatest.Close()
srOriginal.Close()
FileClose(55)
End Sub
It's got poor programming and scary loops, but that's because I'm not a professional coder, just a guy trying to make his life easier.
Currently, this uses a form to insert the original file and the latest file and outputs each line that matches into a new file. This is less than perfect, but I don't know how to cope with the large file sizes as streamreader.readtoend crashes the program. I also don't need the output to be a copy of the latest input, but I don't know how to only output the records it doesn't find. Here's a sample of the records each file has:
doc:ARCHIVE.346CCBD3B06711E0B40E00163505A2EF
doc:ARCHIVE.346CE683B29811E0A06200163505A2EF
doc:ARCHIVE.346CEB15A91711E09E8900163505A2EF
doc:ARCHIVE.346CEC6AAA6411E0BEBB00163505A2EF
The program I have currently works... to a fashion, however I know there are better ways of doing it and I'm sure much better ways of using the CPU and memory, but I don't know this level of programming. All I would like is for you to take a look and offer your best answers to all or some of the code. Tell me what you think will make it better, what will help with one line, or all of it. I have no time limit on this because the code works, albeit slowly, I would just like someone to tell me where my code could be better and what I could do to get round the huge file sizes.
Your code is slow because it is doing a lot of file IO. You're on the right track by reading one line at a time, but this can be improved.
Firstly, I've created some test files based off the data that you provided. Those files contain three million lines and are about 130 MB in size (2.2 million records was less than 100 MB so I've increased the number of lines to get to the file size that you state).
Reading the entire file into a single string uses up about 600 MB of memory. Do this with two files (which I assume you were doing) and you have over 1GB of memory used, which may have been causing the crash (you don't say what error was shown, if any, when the crash occurred, so I can only assume that it was an OutOfMemoryException).
Here's a few tips before I go through your code:
Use Using Blocks
This won't help with performance, but it does make your code cleaner and easier to read.
Whenever you're dealing with a file (or anything that implements the IDisposable interface), it's always a good idea to use a Using statement. This will automatically dispose of the file (which closes the file), even if an error happens.
Don't use FileOpen
The FileOpen method is outdated (and even stated as being slow in its documentation). There are better alternatives that you are already (almost) using: StreamWriter (the cousin of StreamReader).
Opening and closing a file two million times (like you are doing inside your loop) won't be fast. This can be improved by opening the file once outside the loop.
DoEvents() is evil!
DoEvents is a legacy method from back in the VB6 days, and it's something that you really want to avoid, especially when you're calling it two million times in a loop!
The alternative is to perform all of your file processing on a separate thread so that your UI is still responsive.
Using a separate thread here is probably overkill, and there are a number of intricacies that you need to be aware of, so I have not used a separate thread in the code below.
So let's look at each part of your code and see what we can improve.
Creating the output file
You're almost right here, but you're doing some things that you don't need to do. GetDirectoryName works with file names, so there's no need to remove the extension from the original file name first. You can also use the Path.Combine method to combine a directory and file name.
recOutFile = Path.Combine(Path.GetDirectoryName(refOrig), "Updated.ref")
Reading the files
Since you're looping through each line in the "latest" file and finding a match in the "original" file, you can continue to read one line at a time from the "latest" file.
But instead of reading a line at a time from the "original" file, then seeking back to the start when you find a match, you will be better off reading all of those lines into memory.
Now, instead of reading the entire file into memory (which took up 600 MB as I mentioned earlier), you can read each line of the file into an array. This will use up less memory, and is quite easy to do thanks to the File class.
originalLines = File.ReadAllLines(refOrig)
This reads all of the lines from the file and returns a String array. Searching through this array for matches will be slow, so instead of reading into an array, we can read into a HashSet(Of String). This will use up a bit more memory, but it will be much faster to seach through.
originalLines = New HashSet(Of String)(File.ReadAllLines(refOrig))
Searching for matches
Since we now have all of the lines from the "original" line in an array or HashSet, searching for a line is very easy.
originalLines.Contains(recLatest)
Putting it all together
So let's put all of this together:
Private Sub refUpdateBtn_Click(sender As Object, e As EventArgs)
Dim refOrig As String
Dim refLatest As String
Dim recOutFile As String
Dim originalLines As HashSet(Of String)
refOrig = refOriginalText.Text 'Original Reference File
refLatest = refLatestText.Text 'Latest Reference
recOutFile = Path.Combine(Path.GetDirectoryName(refOrig), "Updated.ref")
Me.Text = "Processing Reference File..." 'update the application
Update()
originalLines = New HashSet(Of String)(File.ReadAllLines(refOrig))
Using latest As New StreamReader(refLatest),
updated As New StreamWriter(recOutFile, True)
Do
Dim line As String
line = latest.ReadLine()
' ReadLine returns Nothing when it reaches the end of the file.
If line Is Nothing Then
Exit Do
End If
If originalLines.Contains(line) Then
updated.WriteLine(line)
End If
Loop
End Using
End Sub
This uses around 400 MB of memory and takes about 4 seconds to run.

Best Way to Sort GetFiles

I am trying to sort the following files in this order:
TMP_SDF_1180741.PDF
TMP_SDF_1179715.PDF
TMP_SDF_1162371.PDF
TMP_SDF_1141511.PDF
TMP_SDF_1131750.PDF
TMP_SDF_1117362.PDF
TMP_SDF_1104199.PDF
TMP_SDF_1082698.PDF
TMP_SDF_1062921.PDF
TMP_SDF_1043875.PDF
TMP_SDF_991514.PDF
TMP_SDF_970621.PDF
TMP_SDF_963154.PDF
TMP_SDF_952954.PDF
TMP_SDF_948067.PDF
TMP_SDF_917669.PDF
TMP_SDF_904315.PDF
TMP_SDF_899902.PDF
TMP_SDF_892398.PDF
TMP_SDF_882024.PDF
But the actual output is this:
TMP_SDF_991514.PDF
TMP_SDF_970621.PDF
TMP_SDF_963154.PDF
TMP_SDF_952954.PDF
TMP_SDF_948067.PDF
TMP_SDF_917669.PDF
TMP_SDF_904315.PDF
TMP_SDF_899902.PDF
TMP_SDF_892398.PDF
TMP_SDF_882024.PDF
TMP_SDF_1180741.PDF
TMP_SDF_1179715.PDF
TMP_SDF_1162371.PDF
TMP_SDF_1141511.PDF
TMP_SDF_1131750.PDF
TMP_SDF_1117362.PDF
TMP_SDF_1104199.PDF
TMP_SDF_1082698.PDF
TMP_SDF_1062921.PDF
TMP_SDF_1043875.PDF
I have tried researching sort methods by GetFiles but when I apply them, i get errors about system collections not able to bind to a 1-dimensional array and it is frustrating. Here is my code:
Dim di As New IO.DirectoryInfo("C:\temp")
Dim aryFi As IO.FileInfo() = di.GetFiles("*.PDF")
Dim fi As IO.FileInfo
For Each fi In aryFi
My.Computer.FileSystem.RenameFile("C:\TEMP\" & fi.Name, listBox1.SelectedItem.ToString & ".pdf")
listBox1.SelectedIndex = listBox1.SelectedIndex - 1
Next
I am renaming files to be a1 a2 a3 etc so that when I combine in PDF, they are in chronological order. The way i want the sorting, will place them in chronological order. I am sure there is an easier way. As you can tell, the higher the number in the PDF file (1180741) the most recent date of the content of the file. While 882024 would be the oldest file content.
As has been stated in the comments, you need to sort them numerically rather than alphabetically. I don't know the specific sorting algorithm that is used by Windows Explorer, or if it's possible to use the same library, but it's certainly possible to write your own algorithm that sorts however you want.
The first step in doing that is to extract just the numeric part that you want to use as the sort key. Without knowing more details, it's hard to say what the best option for that would be. If you know that the number always starts at a particular character position in the string, you could simply use String.SubString. If it's always delimited by "_" and "." you could use String.Split. If you need something more complex, or if you need the parsing rules to be configurable, you may want to consider using RegEx. As an example, here's a simple example method that uses String.Split:
Public Function GetSortKey(fileName As String) As Integer
Return Integer.Parse(fileName.Split({"_"c, "."c})(2))
End Function
Once you have a method that extracts the sort key for a given file name, you can use it to sort them like this:
di.GetFiles("*.PDF").OrderBy(Function(x) GetSortKey(x.Name))
Perhaps you could take advantage of some tools that you have at your hands
Dim reg As RegEx = new RegEx("\d+")
Dim ordered = new List(Of OrderedFiles)()
for each s in Directory.GetFiles("C:\temp", "*.PDF")
Dim aFile = new OrderedFiles()
aFile.FileName = s
aFile.Sequence = Convert.ToInt32(reg.Match(s).Value)
ordered.Add(aFile)
Next
for each aFile in ordered.OrderByDescending(Function(x) x.Sequence)
Console.WriteLine(Path.GetFileName(aFile.FileName))
Next
End Sub
Class OrderedFiles
Public FileName as String
Public Sequence as Integer
End Class
In this example you have a custom class with the filename and the numeric part that you want to sort. Then a Regex expression that matches any numeric value in your files is applied to your files to build a instance of the class with the name and the numeric part. At the end of the loop just call the Linq method that orders your list by descending order

Populate a userform Combobox with a list of subdirectory names in a defined directory

I apologize if this question was answered previously on this board. My searches didn't turn up what I'm looking for. I am a VBA novice and would like to know if there is a way to populate a userform combobox with the names of all subdirectories contained within a predefined directory (I need the list to be updated every time the userform is launched). I've seen some code that does this on other websites but they were for earlier versions of Excel and I could not get them to work. I am using Excel 2007. I appreciate any help you may be able to provide.
Option Explicit
Private Sub UserForm_Initialize()
Dim name
For Each name In ListDirectory(Path:="C:\", AttrInclude:=vbDirectory, AttrExclude:=vbSystem Or vbHidden)
Me.ComboBox1.AddItem name
Next name
End Sub
Function ListDirectory(Path As String, AttrInclude As VbFileAttribute, Optional AttrExclude As VbFileAttribute = False) As Collection
Dim Filename As String
Dim Attribs As VbFileAttribute
Set ListDirectory = New Collection
' first call to Dir() initializes the list
Filename = Dir(Path, AttrInclude)
While Filename <> ""
Attribs = GetAttr(Path & Filename)
' to be added, a file must have the right set of attributes
If Attribs And AttrInclude And Not (Attribs And AttrExclude) Then
ListDirectory.Add Filename, Path & Filename
End If
' fetch next filename
Filename = Dir
Wend
End Function
A few notes, since you said you had little experience with VBA.
Always have Option Explicit in effect. No excuses.
Dir() is used in VB to list files.
Collections are a lot more convenient than arrays in VBA.
There are named parameters available in function calls (name:=value). You don't have to use them, but they help to make sense of long argument lists. Argument order is irrelevant if you use named parameters. You cannot mix named and unnamed parameters, though.
You can have optional arguments with default values.
Note that assigning to the function name (ListDirectory in this case) sets the result of a function. You can therefore use the function name directly as a variable inside that function.
Set AttrInclude to -1 if you want to return all types of files. Conveniently, -1 is the numerical value of True., i.e. ListDirectory("C:\", True).
Set AttrExclude to 0 if you want to exclude no files. Conveniently, 0 is the numerical value of False., i.e. ListDirectory("C:\", True, False), which also is the default.
All logical operators in VB 6.0 are bit-wise, hence you can check whether a file is a directory by using If Attribs And VbDirectory Then ...
You can combine multiple bit values with Or, e.g. vbSystem Or vbHidden.
Consequently, you can filter directories with a simple bit-wise logic check.
Use the Object Browser (hit F2) to inspect available Functions, Types and Constants, for example the constants in the VbFileAttribute enum.