File.OpenRead is too slow

File.OpenRead is too slow - vb.net

I have an endpoint written in VB.NET and returning a Stream (it allows you to download a zip file). The problem is that when the file is quite large (> 300MB) the "File.OpenRead" instruction takes a few tens of seconds (the file is on the network). Is there anything that allows me to speed up the reading of the file?
Private Function DownloadZipFile() As Stream Implements ILUpdateWS.DownloadZipFile
...
Dim fstream = File.OpenRead(downloadFilePath)
Return fstream
...
End Function

Related

Subtleties in Reading/ Writing Binary Data with Java

One phenomenon I've noticed with Java file reads using a byte array buffer, is that just like C's fread(), if I don't dynamically control the length of the final read, and the total size of the data being read is not a multiple of the buffer size, then excess garbage data could be read into the file. When performing binary I/O, some copied files would be rendered somewhat corrupted.
The garbage values could possibly be values previously stored in the buffer that were not overwritten since the final read was not of full buffer length.
While looking over various tutorials, all methods of reading binary data was similar to the code below:
InputStream inputStream = new FileInputStream("prev_font.ttf");;
OutputStream outputStream = new FileOutputStream("font.ttf");
byte buffer[] = new byte[512];
while((read = inputStream.read(buffer)) != -1)
{
outputStream.write(buffer, 0, read);
}
outputStream.close();
inputStream.close();
But while reading from an input stream from a file packaged in a JAR, I couldn't make a copy of the file properly. I would output as an invalid file of that type.
Since I was quite new to JAR access, I could not pinpoint whether the issue was with my resource file pathing or something else. So it took quite a bit of time to realize what was going on.
All codes I came across had a vital missing portion. The read amount should not be the entire buffer, but only the amount that is read:
InputStream inputStream = new FileInputStream("prev_font.ttf");
OutputStream outputStream = new FileOutputStream(font.ttf");
byte dataBuffer[] = new byte[512];
int read;
while((read = inputStream.read(dataBuffer)) != -1)
{
outputStream.write(dataBuffer, 0, read);
}
outputStream.close();
inputStream.close();
Now that's all fine now, but why was something so major not mentioned in any of the tutorials? Did I simply look at bad tutorials, or was Java supposed to handle the oveflow reads and my implementation was off somehow? It was simply unexpected.
Please correct me if any of my statements were wrong, and kindly provide alternative solutions to handling the issue if there are any.

There isn't much difference between the code blocks you've provided except for minor typos which mean that they won't compile. The buffer is not corrupted by read, but the output file is corrupted if the number of bytes read is not provided to the writer for each iteration of the loop.
To copy a file - say src -> dst just use try with resources and the built in transferTo:
Path src = Path.of("prev_font.ttf");
Path dst = Path.of("font.ttf");
try(InputStream in = Files.newInputStream(src);
OutputStream out = Files.newOutputStream(dst)) {
in.transferTo(out);
}
Or call one of the built in methods of Files:
Files.copy(src, dst);
// or
Files.copy(src, dst, StandardCopyOption.REPLACE_EXISTING);

FileStream faster way to read and write big file

I have a speed problem and memory efficiency, I'm reading the cutting the big chunk of bytes from the .bin files then writing it in another file, the problem is that to read the file i need to create a huge byte array for it:
Dim data3(endOFfile) As Byte ' end of file is here around 270mb size
Using fs As New FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None)
fs.Seek(startOFfile, SeekOrigin.Begin)
fs.Read(data3, 0, endOFfile)
End Using
Using vFs As New FileStream(Environment.GetFolderPath(Environment.SpecialFolder.Desktop) & "\test.bin", FileMode.Create) 'save
vFs.Write(data3, 0, endOFfile)
End Using
so it takes a long time to procedure, what's the more efficient way to do it?
Can I somehow read and write in the same file stream without using a bytes array?

I've never done it this way but I would think that the Stream.CopyTo method should be the easiest method and as quick as anything.
Using inputStream As New FileStream(...),
outputStream As New FileStream(...)
inputStream.CopyTo(outputStream)
End Using
I'm not sure whether that overload will read all the data in one go or use a default buffer size. If it's the former or you want to specify a buffer size other than the default, there's an overload for that:
inputStream.CopyTo(outputStream, bufferSize)
You can experiment with different buffer sizes to see whether it makes a difference to performance. Smaller is better for memory usage but I would expect bigger to be faster, at least up to a point.
Note that the CopyTo method requires at least .NET Framework 4.0. If you're executing this code on the UI thread, you might like to call CopyToAsync instead, to avoid freezing the UI. The same two overloads are available, plus a third that accepts a CancellationToken. I'm not going to teach you how to use Async/Await here, so research that yourself if you want to go that way. Note that CopyToAsync requires at least .NET Framework 4.5.

Reading a CSV file with 50M lines, how to improve performance

I have a data file in CSV (Comma-Separated-Value) format that has about 50 million lines in it.
Each line is read into a string, parsed, and then used to fill in the fields of an object of type FOO. The object then gets added to a List(of FOO) that ultimately has 50 million items.
That all works, and fits in memory (at least on an x64 machine), but its SLOW. It takes like 5 minutes every time load and parse the file into the list. I would like to make it faster. How can I make it faster?
The important parts of the code are shown below.
Public Sub LoadCsvFile(ByVal FilePath As String)
Dim s As IO.StreamReader = My.Computer.FileSystem.OpenTextFileReader(FilePath)
'Find header line
Dim L As String
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Exit While
End While
'Parse data lines
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Dim T As FOO = FOO.FromCSV(L)
Add(T)
End While
s.Close()
End Sub
Public Class FOO
Public time As Date
Public ID As UInt64
Public A As Double
Public B As Double
Public C As Double
Public Shared Function FromCSV(ByVal X As String) As FOO
Dim T As New FOO
Dim tokens As String() = X.Split(",")
If Not DateTime.TryParse(tokens(0), T.time) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ISO 8601 timestamp")
End If
If Not UInt64.TryParse(tokens(1), T.ID) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ID")
End If
If Not Double.TryParse(tokens(2), T.A) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for A")
End If
If Not Double.TryParse(tokens(3), T.B) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for B")
End If
If Not Double.TryParse(tokens(4), T.C) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for C")
End If
Return T
End Function
End Class
I did some benchmarking and here are the results.
The complete algorithm above took 314 seconds to load the whole file and put the objects into the list.
With the body of FromCSV() reduced to just returning a new object of type FOO with default field values, the whole process took 84 seconds. Therefore it appears that processing the line of text into the object fields is taking 230 seconds (73% of the total time).
Doing everything but parsing the ISO 8601 date string takes 175 seconds. Therefore it appears that processing the date string takes 139 seconds, which is 60% of the text processing time, just for that one field.
Just reading the lines in the file without any processing or object creating takes 41 seconds.
Using StreamReader.ReadBlock to read the whole file in chunks of about 1KB takes 24s, but its a minor improvement in the grand scheme of things and probably not worth the added complexity. In order to use TryParse I would now need to manually create the temporary strings rather than using String.Split().
At this point the only path I see is to just display status to the user every few seconds so they don't wonder if the program is frozen or something.
UPDATE
I created two new functions. One can save the dataset from memory into a binary file using System.IO.BinaryWriter. The other function can load that binary file back into memory using System.IO.BinaryReader. The binary versions were considerably faster than CSV versions, and the binary files take up much less space.
Here are the benchmark results (same dataset for all tests):
LOAD CSV: 340s
SAVE CSV: 312s
SAVE BIN: 29s
LOAD BIN: 41s
CSV FILE SIZE: 3.86GB
BIN FILE SIZE: 1.63GB

I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.
As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.
Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.
static List<Foo> Read(string file)
{
// estimate of the average row length based on Andrew Morton's 4GB/50m
const int AverageRowLength = 80;
var textReader = File.OpenText(file);
// specifying the DateFormat will cause TryParseExact to be used.
var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
var csvReader = CsvDataReader.Create(textReader, csvOpts);
// estimate number of rows to avoid growing the list.
var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);
var data = new List<Foo>(estimatedRows);
while (csvReader.Read())
{
if (csvReader.RowFieldCount < 5) continue;
var item = new Foo()
{
time = csvReader.GetDateTime(0),
ID = csvReader.GetInt64(1),
A = csvReader.GetDouble(2),
B = csvReader.GetDouble(3),
C = csvReader.GetDouble(4)
};
data.Add(item);
}
return data;
}
I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.
Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.
Here are some examples of how to do this in C# (don't know VB very well, sorry).
static List<Foo> Read(string file)
{
var stream = File.OpenRead(file);
// the exact number of records can be determined by looking at the length of the file.
var recordCount = stream.Length / 40;
var data = new List<Foo>(recordCount);
var br = new BinaryReader(stream);
for (int i = 0; i < recordCount; i++)
{
var ticks = br.ReadInt64();
var id = br.ReadInt64();
var a = br.ReadDouble();
var b = br.ReadDouble();
var c = br.ReadDouble();
var f = new Foo()
{
time = new DateTime(ticks),
ID = id,
A = a,
B = b,
C = c,
};
data.Add(f);
}
return data;
}
static void Write(List<Foo> data, string file)
{
var stream = File.Create(file);
var bw = new BinaryWriter(stream);
foreach(var item in data)
{
bw.Write(item.time.Ticks);
bw.Write(item.ID);
bw.Write(item.A);
bw.Write(item.B);
bw.Write(item.C);
}
}
This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.
This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.
If you find this solution to work, I'd love to hear the results.

TcpClient maximum packet size for sending data

I am building a communication library based on the Net.Sockets.TcpClient class. During some unit tests I wanted to test how large a datapacket could be before running into problems. My theory was that the actual size would not matter because the TcpClient would split the data into parts because of its internal sendbuffer. But the actual size did matter because somewhere around 600KB I discovered loss of data.
What my test does is create a local server and a local client that connect with each other. Then it sends a specific (large) package in a loop to test if the server receives it well. I left out all checks but after sending the data a check runs that the data is exactly the same as what was send before looping again. So there is never more data in the pipe than the size that I specified. This code is the client part in my unit test. The server part is a whole library so I cannot post that.
Using myClient As New Net.Sockets.TcpClient()
myClient.BeginConnect("127.0.0.1", ServerPort, Nothing, Nothing)
'Code that checks if the connection has been made
'Create a large string
Dim SendString as String = StrDup(1024000, "A")
Dim SendBytes as Byte() = Text.Encoding.ASCII.GetBytes(SendString)
'Loop the test
For i As Int32 = 1 To 10000
Wait.Reset()
myClient.Client.Send(SendBytes)
Wait.WaitOne(10000) 'Wait for the server to acknowledge
'Run checks to make sure the data is good, otherwise end loop
Next
myClient.Close()
End Using
What happens is that at some random point the server does not receive all data. It looks like sending 1.024.000 bytes of data works most of the times but not always. The iteration at which it fails is random but a loop will never finish 10.000 iterations successfully. I tested the loop with 512.000 bytes and that works. I also tested 600.000 bytes and that failed. I do not know what the actual size is at which it starts failing because it does not seem to be a hard limit. I cannot figure out the problem. Is the TcpClient somehow limited or do I exceed an internal buffer of some kind? I checked the SendBufferSize of the TcpClient and it was 65536. I have no idea if that has anything to do with it. Packages larger than that buffer seem to be sending just fine.

File Compressed by GZIP grows instead of shrinking

I used the code below to compress files and they keep growing instead of shrinking. I comressed a 4 kb file and it became 6. That is understandable for a small file because of the compression overhead. I tried a 400 mb file and it became 628 mb after compressing. What is wrong? See the code. (.net 2.0)
Public Sub Compress(ByVal infile As String, ByVal outfile As String)
Dim sourceFile As FileStream = File.OpenRead(inFile)
Dim destFile As FileStream = File.Create(outfile)
Dim compStream As New GZipStream(destFile, CompressionMode.Compress)
Dim myByte As Integer = sourceFile.ReadByte()
While myByte <> -1
compStream.WriteByte(CType(myByte, Byte))
myByte = sourceFile.ReadByte()
End While
sourceFile.Close()
destFile.Close()
End Sub

If the underlying file is itself highly unpredictable (already compressed or largely random) then attempting to compress it will cause the file to become bigger.
Going from 400 to 628Mb sounds highly improbable as an expansion factor since the deflate algorithm (used for GZip) tends towards a maximum expansion factor of 0.03% The overhead of the GZip header should be negligible.
Edit: The 4.0 c# release indicates that the compression libraries have been improved to not cause significant expansion of uncompressable data. This suggests that they were not implementing the "fallback to the raw stream blocks" mode. Try using SharpZipLib's library as a quick test. That should provide you with close to identical performance when the stream is incompressible by deflate. If it does consider moving to that or waiting for the 4.0 release for a more performant BCL implementation. Note that the lack of compression you are getting strongly suggests that there is no point you attempting to compress further anyway

Are you sure that writing byte by byte to the stream is a really good idea? It will certainly not have ideal performance characteristics and maybe that's what confuses the gzip compressing algorithm too.
Also, it might happen that the data you are trying to compress is just not really well-compressable. If I were you I would try your code with a text document of the same size as text documents tend to compress much better than random binary.
Also, you could try using a pure DeflateStream as opposed to a GZipStream as they both use the same compression algorithm (deflate), the only difference is that gzip adds some additional data (like error checking) so a DeflateStream might yield smaller results.
My VB.NET is a bit rusty so I'll rather not try to write a code example in VB.NET. Instead, here's how you should do it in C#, it should be relatively straightforward to translate it to VB.NET for someone with a bit of experience: (or maybe someone who is good at VB.NET could edit my post and translate it to VB.NET)
FileStream sourceFile;
GZipStream compStream;
byte[] buffer = new byte[65536];
int bytesRead = 0;
while (bytesRead = sourceFile.Read(buffer, 0, 65536) > 0)
{
compStream.Write(buffer, 0, bytesRead);
}

This is a known anomaly with the built-in GZipStream (And DeflateStream).
I can think of two workarounds:
use an alternative compressor.
build some logic that examines the size of the "compressed" output and compares it to the size of the input. If larger, chuck the output and just store the data.
DotNetZip includes a "fixed" GZipStream based on a managed port of zlib. (It takes approach #1 from above). The Ionic.Zlib.GZipStream can replace the built-in GZipStream in your apps with a simple namespace swap.

Thank you all for good answers. Earlier on I tried to compress .wmv files and one text file. I changed the code to DeflateStream and it seems to work now. Cheers.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas