Iterating over files, splitting by pattern - kotlin

I'm trying to wrap my head around file stream processing. Got input looking like this:
bla
blubb
blubber
testcode
There's several files all looking like the above. Right now, I'm using a single file approach that reads the whole file into memory and splits it:
Files.newBufferedReader("myfile").use { f ->
f.readText().splitToSequence("\n\n").forEach {
// do my stuff
}
}
Now, I'm trying to generalize this to larger inputs (making it impractical to hold the file in memory) and several files. Ideally, I'd treat a whole directory of input files as a single stream of lines I split on \n\n and work on the parts. How would I do this?

You can read file as a sequence of text lines and then regroup those lines taking empty line as a delimiter:
File("myfile").useLines { lines ->
val lineBlocks: Sequence<List<String>> = buildSequence {
val block = mutableListOf<String>()
for (line in lines) {
when {
line.isNotEmpty() -> block.add(line)
block.isNotEmpty() -> {
yield(block.toList())
block.clear()
}
}
}
if (block.isNotEmpty()) yield(block.toList())
}
lineBlocks.forEach {
println(it.joinToString())
}
}
Here you get the result in lineBlocks, which is a sequence where each element is a list of lines in a single block.

Related

Need to filter lines from a sequence but keep it as a single sequence

I need to read a file from disk, filter out some rows based on conditions, then return the result as a single stream/sequence, not a sequence of strings. The file is too large to hold in memory all at once, so it must be treated as a Stream/Sequence throughout processing. This is what I tried.
File(filename).bufferedReader()
// break into lines
.lineSequence()
// filter each line based on condition
.filter{meetsSomeCondition(it)}
// add newline back in
.map{(it+"\n").byteInputStream()}
// reduce back into a single stream with Java's SequenceInputStream
.reduce<InputStream, ByteArrayInputStream> { acc, i -> SequenceInputStream(acc, i) }
This works when testing on a small file, but when using a large file it errors with a StackOverFlow exception. It seems that Java's SequenceInputStream can't handle repeatedly nesting itself like I do with the reduce call.
I see that SequenceInputStream also has a way of accepting an Enumeration argument that takes a List of elements. But that's the problem, as far as I can tell, it doesn't seem to accept a Stream.
Your code does not really do what you think it does. reduce() is a terminal operation, meaning that it consumes all elements in the sequence. After reduce() line the whole file has been read already. Also, it is not SequenceInputStream that does not support such reduce operation. You created a very long chain of objects. SequenceInputStream objects does not really know they were chained like this and they can't do too much about it.
Instead, you need to keep a sequence "alive" and create an InputStream that will read from the source sequence whenever required. I don't think there is an utility like this in the stdlib. It is a very specialized requirement.
The easiest is to create a sequence of bytes and then provide InputStream which reads from it:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.flatMap { "$it\n".toByteArray().asIterable() }
.asInputStream()
fun Sequence<Byte>.asInputStream() = object : InputStream() {
val iter = iterator()
override fun read() = if (iter.hasNext()) iter.next().toUByte().toInt() else -1
}
However, sequence of bytes isn't really the best for performance. We can optimize it by reading line by line, so creating a sequence of strings or byte arrays:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.map { "$it\n".toByteArray() }
.asInputStream()
fun Sequence<ByteArray>.asInputStream() = object : InputStream() {
val iter = iterator()
var curr = iter.next()
var pos = 0
override fun read(): Int {
return when {
pos < curr.size -> curr[pos++].toUByte().toInt()
!iter.hasNext() -> -1
else -> {
curr = iter.next()
pos = 0
read()
}
}
}
}
(Note this implementation of asInputStream() will fail for empty sequence)
Still, there is much room for improvement regarding the performance. We read from sequence line by line, but we still read from InputStream byte by byte. To improve it further we would need to implement more methods of InputStream to read in bigger chunks. If you really care about the performance then I suggest looking into BufferedInputStream implementation and try to re-use some of its codebase.
Also, remember to close the file reader that was created in the first step. It won't close automatically when InputStream will be closed.

Fastest way to read huge text file (6 GB) line by line [duplicate]

I have large txt file with 100000 lines.
I need to start n-count of threads and give every thread unique line from this file.
What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemory exception. Any ideas?
You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:
Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
// your code here
});
After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support #dtb's answer above that using the following approach is the fastest:
Parallel.ForEach(File.ReadLines(catalogPath), line =>
{
});
My tests also showed the following:
File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.
I have included an example of this pattern for reference, since it is not included on this page:
var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(catalogPath))
inputLines.Add(line);
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
string[] lineFields = line.Split('\t');
int genomicId = int.Parse(lineFields[3]);
int taxId = int.Parse(lineFields[0]);
catalog.TryAdd(genomicId, taxId);
});
});
Task.WaitAll(readLines, processLines);
Here are my benchmarks:
I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.
Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.
Something like:
public class ParallelReadExample
{
public static IEnumerable LineGenerator(StreamReader sr)
{
while ((line = sr.ReadLine()) != null)
{
yield return line;
}
}
static void Main()
{
// Display powers of 2 up to the exponent 8:
StreamReader sr = new StreamReader("yourfile.txt")
Parallel.ForEach(LineGenerator(sr), currentLine =>
{
// Do your thing with currentLine here...
} //close lambda expression
);
sr.Close();
}
}
Think it would work. (No C# compiler/IDE here)
If you want to limit the number of threads to n, the easiest way is to use AsParallel() along with WithDegreeOfParallelism(n) to limit the thread count:
string filename = "C:\\TEST\\TEST.DATA";
int n = 5;
foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
// Process line.
}
As #dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to:
1) do a File.ReadAllLines() into an array
2) Use a Parallel.For loop to iterate over the array.
You can read more performance benchmarks here.
The basic gist of the code you would have to write is:
string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
DoStuff(AllLines[x]);
//whatever you need to do
});
With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

When performing a collections operation, is it possible to modify the underlying collection?

For example, I have the following code to recursively copy a directory's contents.
private fun copyContentDirectory(directory : File): List<File> {
val files = directory.listFiles().toList()
val filesToTransform = mutableListOf<File>()
// Add each file + directory. Then, recursively add the files in each directory.
files
.onEach { filesToTransform += it }
.filter { it.isDirectory }
.forEach { filesToTransform += copyContentDirectory(it) }
return filesToTransform
}
Is it possible to have something like the following? If not, why not?
private fun copyContentDirectory(directory : File): List<File> {
return directory.listFiles().toList()
.filter { it.isDirectory }
.onEach { <thisList> += copyContentDirectory(it) }
}
Where thisList is some symbol that allows me to reference the underlying list. Does such a thing exist?
As per comments, your intentions aren't very clear.
Looking at the second example, the obvious answer would seem to be to replace this line:
.onEach { <thisList> += copyContentDirectory(it) }
with one using flatMap(), e.g.:
.flatMap{ copyContentDirectory(it) }
That collects together the results from all the recursive calls, and returns them as a single list — which I think is what you want.
However, that just reveals deeper problems:
Despite the name, the method isn't actually copying anything, just collecting together a list.
The list will always be empty — it recurses over directories, but never returns any files, so will only every be combining empty lists.
Here's a version which addresses the second problem. I've also renamed it, recast it as an extension function, and used partition() to avoid filtering twice.  (The first result is those files matching the predicate, i.e. directories, over which it recurses; the second is files not matching, i.e. non-directories, which it includes directly.)  And because listFiles() can return null in some circumstances, it has to handle that too.
private fun File.listContents(): List<File>
= listFiles()
?.partition{ it.isDirectory }
?.let{ it.first.flatMap{ it.listContents() } + it.second }
?: listOf()
(That doesn't address the copying, but the question doesn't indicate how you plan to approach that.)

File I/O in Kotlin with (potentially) unknown encoding

This is my first attempt to learn and use Kotlin. I have a simple task: read a file line by line, preprocess each line and put a specific portion into a map. Each line is tab-separated.
When trying to preprocess, things start going horribly wrong. I tried to debug, and instead of the normal characters, this is what I can see:
In between every two adjacent readable characters, there is the strange-looking block with horizontal lines.
Here is the code I wrote:
fun mapUserToId(path: String): MutableMap<String, Int> {
val user2id = mutableMapOf<String, Int>()
val bufferedReader = File(path).bufferedReader()
bufferedReader.useLines { lines ->
lines.drop(1).forEach { // dropping the first line with column headers
val components: List<String> = it.trim().split("\t") // split by tab delimiter
val user: String = components[2]
println(user.length) // length is nearly double due to the strange block-like characters
val id: String = components[3]
user2id[user] = id.toInt() // fails due to number format exception, because of those block-like characters
}
}
return user2id
}
This looks like a charset issue, but I can't figure out what the charset could be, and how to specify that charset in the above code. Opening the file in vim looks perfectly normal (as in, one would suspect that this file has UTF-8 encoding).
This is, indeed, an encoding issue. The problem is resolved by specifying the encoding while creating the buffered reader as follows:
val bufferedReader: BufferedReader = File(path).bufferedReader(Charsets.UTF_16)

Why won't Kotlin print the string I select from a .txt file unless it's the last line?

I'm using to open a text file, randomly select a line, and format a string that includes the randomly selected line. The string is then printed to the console, but for some reason it won't work unless the last line of the file gets randomly selected.
Text file:
Neversummer
Abelhaven
Phandoril
Tampa
Sanortih
Trell
Zan'tro
Hermi Hermi
Curlthistle Forest
Code:
import java.io.File
fun main() {
var string = File("data/towns.txt")
.readText()
.split("\n")
.shuffled()
.first()
println("$string has printed")
}
Output when last line is selected:
Curlthistle Forest has printed
Output when any other line is selected:
has printed
As suggested by dyukha in the comment section it is indeed a platform specific issue. I prefer the solution (s)he provided using readLines() since you can condense two function calls into one.
However, should you ever need to check for the line delimiter in a platform independent manner you should use the built-in System.lineSeparator() property (Since Java 7).
import java.io.File
fun main() {
var string = File("data/towns.txt")
.readText()
.split(System.lineSeparator())
.shuffled()
.first()
println("$string has printed")
}
...
Still, I do recommend that you use readLines() since it packages the functionality of both .readText() and .split(System.lineSeparator()).