Reading a CSV file with 50M lines, how to improve performance - vb.net

I have a data file in CSV (Comma-Separated-Value) format that has about 50 million lines in it.
Each line is read into a string, parsed, and then used to fill in the fields of an object of type FOO. The object then gets added to a List(of FOO) that ultimately has 50 million items.
That all works, and fits in memory (at least on an x64 machine), but its SLOW. It takes like 5 minutes every time load and parse the file into the list. I would like to make it faster. How can I make it faster?
The important parts of the code are shown below.
Public Sub LoadCsvFile(ByVal FilePath As String)
Dim s As IO.StreamReader = My.Computer.FileSystem.OpenTextFileReader(FilePath)
'Find header line
Dim L As String
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Exit While
End While
'Parse data lines
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Dim T As FOO = FOO.FromCSV(L)
Add(T)
End While
s.Close()
End Sub
Public Class FOO
Public time As Date
Public ID As UInt64
Public A As Double
Public B As Double
Public C As Double
Public Shared Function FromCSV(ByVal X As String) As FOO
Dim T As New FOO
Dim tokens As String() = X.Split(",")
If Not DateTime.TryParse(tokens(0), T.time) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ISO 8601 timestamp")
End If
If Not UInt64.TryParse(tokens(1), T.ID) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ID")
End If
If Not Double.TryParse(tokens(2), T.A) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for A")
End If
If Not Double.TryParse(tokens(3), T.B) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for B")
End If
If Not Double.TryParse(tokens(4), T.C) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for C")
End If
Return T
End Function
End Class
I did some benchmarking and here are the results.
The complete algorithm above took 314 seconds to load the whole file and put the objects into the list.
With the body of FromCSV() reduced to just returning a new object of type FOO with default field values, the whole process took 84 seconds. Therefore it appears that processing the line of text into the object fields is taking 230 seconds (73% of the total time).
Doing everything but parsing the ISO 8601 date string takes 175 seconds. Therefore it appears that processing the date string takes 139 seconds, which is 60% of the text processing time, just for that one field.
Just reading the lines in the file without any processing or object creating takes 41 seconds.
Using StreamReader.ReadBlock to read the whole file in chunks of about 1KB takes 24s, but its a minor improvement in the grand scheme of things and probably not worth the added complexity. In order to use TryParse I would now need to manually create the temporary strings rather than using String.Split().
At this point the only path I see is to just display status to the user every few seconds so they don't wonder if the program is frozen or something.
UPDATE
I created two new functions. One can save the dataset from memory into a binary file using System.IO.BinaryWriter. The other function can load that binary file back into memory using System.IO.BinaryReader. The binary versions were considerably faster than CSV versions, and the binary files take up much less space.
Here are the benchmark results (same dataset for all tests):
LOAD CSV: 340s
SAVE CSV: 312s
SAVE BIN: 29s
LOAD BIN: 41s
CSV FILE SIZE: 3.86GB
BIN FILE SIZE: 1.63GB

I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.
As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.
Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.
static List<Foo> Read(string file)
{
// estimate of the average row length based on Andrew Morton's 4GB/50m
const int AverageRowLength = 80;
var textReader = File.OpenText(file);
// specifying the DateFormat will cause TryParseExact to be used.
var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
var csvReader = CsvDataReader.Create(textReader, csvOpts);
// estimate number of rows to avoid growing the list.
var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);
var data = new List<Foo>(estimatedRows);
while (csvReader.Read())
{
if (csvReader.RowFieldCount < 5) continue;
var item = new Foo()
{
time = csvReader.GetDateTime(0),
ID = csvReader.GetInt64(1),
A = csvReader.GetDouble(2),
B = csvReader.GetDouble(3),
C = csvReader.GetDouble(4)
};
data.Add(item);
}
return data;
}
I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.
Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.
Here are some examples of how to do this in C# (don't know VB very well, sorry).
static List<Foo> Read(string file)
{
var stream = File.OpenRead(file);
// the exact number of records can be determined by looking at the length of the file.
var recordCount = stream.Length / 40;
var data = new List<Foo>(recordCount);
var br = new BinaryReader(stream);
for (int i = 0; i < recordCount; i++)
{
var ticks = br.ReadInt64();
var id = br.ReadInt64();
var a = br.ReadDouble();
var b = br.ReadDouble();
var c = br.ReadDouble();
var f = new Foo()
{
time = new DateTime(ticks),
ID = id,
A = a,
B = b,
C = c,
};
data.Add(f);
}
return data;
}
static void Write(List<Foo> data, string file)
{
var stream = File.Create(file);
var bw = new BinaryWriter(stream);
foreach(var item in data)
{
bw.Write(item.time.Ticks);
bw.Write(item.ID);
bw.Write(item.A);
bw.Write(item.B);
bw.Write(item.C);
}
}
This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.
This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.
If you find this solution to work, I'd love to hear the results.

Related

get every other index in for loop

I have an interesting issue. I have a string that's in html and I need to parse a table so that I can get the data I need out of that table and present it in a way that looks good on a mobile device. So I use regex and it works just fine but now I'm porting my code to using Kotlin and the solution I have is not porting over well. Here is what the solution looks currently:
var pointsParsing = Regex.Matches(htmlBody, "<td.*?>(.*?)</td>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
var pointsSb = new StringBuilder();
for (var i = 0; i < pointsParsing.Count; i+= 2)
{
var pointsTitle = pointsParsing[i].Groups[1].Value.Replace("&", "&");
var pointsValue = pointsParsing[i+1].Groups[1].Value;
pointsSb.Append($"{pointsTitle} {pointsValue} {pointsVerbiage}\n");
}
return pointsSb.ToString();
as you see, each run in the loop I get two results from the regex search and as a result I tell the for loop to increment by two to avoid collision.
However I don't seem to have this ability within Kotlin, I know how to get the index in a for loop but no idea on how to tell it to skip by 2 so I don't accidentally get something I already parsed on the last loop lap.
how would I tell the for loop to work the way I need it to in Kotlin?
You might be looking for chunked which lets you split an iterable into chunks of e.g. 2 elements:
ptsListResults.chunked(2).forEach { data -> // data is a list of (up to) two elements
val pointsTitle = data[0].groups[1]!!.value
val pointsValue = data[1].groups[1]!!.value
// etc
}
so that's more explicit about breaking your list up into meaningful chunks, and operating within the structure of those chunks, rather that manipulating indices.
There's also windowed which is a bit more complex and gives you more options, one of which is disallowing partial windows (i.e. chunks at the end that don't have the required number of elements). Probably doesn't apply here but just so's you know!
I found a solution that looks to work and thought I'd share.
thanks to this SO answer I see how you can skip over the indexes.
val pointsListSearch = "<td.*?>(.*?)</td>".toRegex()
val pointsListSearchResults = pointsListSearch.findAll(htmlBody)
val pointsSb = StringBuilder()
val ptsListResults = pointsListSearchResults.toList()
for (i in ptsListResults.indices step 2)
{
val pointsTitle = ptsListResults[i].groups[1]!!.value
val pointsValue = ptsListResults[i+1].groups[1]!!.value
pointsSb.append("${pointsTitle}: ${pointsValue}")
}

File.OpenRead is too slow

I have an endpoint written in VB.NET and returning a Stream (it allows you to download a zip file). The problem is that when the file is quite large (> 300MB) the "File.OpenRead" instruction takes a few tens of seconds (the file is on the network). Is there anything that allows me to speed up the reading of the file?
Private Function DownloadZipFile() As Stream Implements ILUpdateWS.DownloadZipFile
...
Dim fstream = File.OpenRead(downloadFilePath)
Return fstream
...
End Function

Why did this hashmap stop working out of no where?

I have used this HashMap for a few days now and no problems at all. Now I get an error about FloatingDecimal,parseDouble, ReadWrite, FileReadWrite, and Looping error.
the last thing I did to the program was adding $%.2f.formant to my ducts.second element it ran a few times I left to eat and came back to this!
I was able to narrow it down to when it pulls the data from the file and converts it to the hashmap setting.
Data in the file example 111,shoes,59.00
val fileName = "src/products.txt"
var products = HashMap<Int, Pair<String, Double>>()
var inputFD =File(fileName).forEachLine {
var pieces = it.split(",")
// println(pieces)
products [pieces [0].toInt()] = Pair(pieces [1].trim(),pieces[2].toDouble())
}
The data type in the file was altered when reading back in causing the whole program to crash. I wanted my double to be example 9.99 and when I read the file back in I added a $ sign meant for the front in view only. When the program was looking for a double (9.99) is only had the option of ($9.99) causing the error.

Byte InputRange from file

How to construct easily a raw byte-by-byte InputRange/ForwardRange/RandomAccessRange from a file?
file.byChunk(4096).joiner
This reads a file in 4096-byte chunks and lazily joins the chunks together into a single ubyte input range.
joiner is from std.algorithm, so you'll have to import it first.
The easiest way to make a raw byte range from a file is to just read it all right into memory:
import std.file;
auto data = cast(ubyte[]) read("filename");
// data is a full-featured random access range of the contents
If the file is too large for that to be reasonable, you could try a memory-mapped file http://dlang.org/phobos/std_mmfile.html and use the opSlice to get an array off it. Since it is an array, you get full range features, but since it is memory mapped by the operating system, you get lazy reading as you touch the file.
For a simple InputRange, there's LockingTextReader (undocumented) in Phobos, or you could construct one yourself over byChunk or even fgetc, the C function. fgetc would be the easiest to write:
struct FileByByte {
ubyte front;
void popFront() { front = cast(ubyte) fgetc(fp); }
bool empty() { return feof(fp); }
FILE* fp;
this(FILE* fp) { this.fp = fp; popFront(); /* prime it */ }
}
I haven't actually tested that but i'm pretty sure it'd work. (BTW the file open and close is separate from this because ranges are supposed to be just views into data, not managed containers. You wouldn't want the file closed just because you passed this range into a function.)
This is not a forward nor random access range though. Those are trickier to do on streams without a lot of buffering code and I think that'd be a mistake to try to write - generally, ranges should be cheap, not emulating features the underlying container doesn't natively support.
EDIT: The other answer has a non-buffering way! https://stackoverflow.com/a/30278933/1457000 That's awesome.

How to create a lazy-evaluated range from a file?

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?
You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.