How to use the same iterator twice, once for counting and once for iteration? - iterator

It seems that an iterator is consumed when counting. How can I use the same iterator for counting and then iterate on it?
I'm trying to count the lines in a file and then print them. I am able to read the file content, I'm able to count the lines count, but then I'm no longer able to iterate over the lines as if the internal cursor was at the end of the iterator.
use std::fs::File;
use std::io::prelude::*;
fn main() {
let log_file_name = "/home/myuser/test.log";
let mut log_file = File::open(log_file_name).unwrap();
let mut log_content: String = String::from("");
//Reads the log file.
log_file.read_to_string(&mut log_content).unwrap();
//Gets all the lines in a Lines struct.
let mut lines = log_content.lines();
//Uses by_ref() in order to not take ownership
let count = lines.by_ref().count();
println!("{} lines", count); //Prints the count
//Doesn't enter in the loop
for value in lines {
println!("{}", value);
}
}
Iterator doesn't have a reset method, but it seems the internal cursor is at the end of the iterator after the count. Is it mandatory to create a new Lines by calling log_content.lines() again or can I reset the internal cursor?
For now, the workaround that I found is create a new iterator:
use std::fs::File;
use std::io::prelude::*;
fn main() {
let log_file_name = "/home/myuser/test.log";
let mut log_file = File::open(log_file_name).unwrap();
let mut log_content: String = String::from("");
//Reads the log file.
log_file.read_to_string(&mut log_content).unwrap();
//Counts all and consume the iterator
let count = log_content.lines().count();
println!("{} lines", count);
//Creates a pretty new iterator
let lines = log_content.lines();
for value in lines {
println!("{}", value);
}
}

Calling count consumes the iterator, because it actually iterates until it is done (i.e. next() returns None).
You can prevent consuming the iterator by using by_ref, but the iterator is still driven to its completion (by_ref actually just returns the mutable reference to the iterator, and Iterator is also implemented for the mutable reference: impl<'a, I> Iterator for &'a mut I).
This still can be useful if the iterator contains other state you want to reuse after it is done, but not in this case.
You could simply try forking the iterator (they often implement Clone if they don't have side effects), although in this case recreating it is just as good (most of the time creating an iterator is cheap; the real work is usually only done when you drive it by calling next directly or indirectly).
So no, (in this case) you can't reset it, and yes, you need to create a new one (or clone it before using it).

The other answers have already well-explained that you can either recreate your iterator or clone it.
If the act of iteration is overly expensive or it's impossible to do multiple times (such as reading from a network socket), an alternative solution is to create a collection of the iterator's values that will allow you to get the length and the values.
This does require storing every value from the iterator; there's no such thing as a free lunch!
use std::fs;
fn main() {
let log_content = fs::read_to_string("/home/myuser/test.log").unwrap();
let lines: Vec<_> = log_content.lines().collect();
println!("{} lines", lines.len());
for value in lines {
println!("{}", value);
}
}

Iterators can generally not be iterated twice because there might be a cost to their iteration. In the case of str::lines, each iteration needs to find the next end of line, which means scanning through the string, which has some cost. You could argue that the iterator could save those positions for later reuse, but the cost of storing them would be even bigger.
Some Iterators are even more expensive to iterate, so you really don't want to do it twice.
Many iterators can be recreated easily (here calling str::lines a second time) or be cloned. Whichever way you recreate an iterator, the two iterators are generally completely independent, so iterating will mean you'll pay the price twice.
In your specific case, it is probably fine to just iterate the string twice as strings that fit in memory shouldn't be so long that merely counting lines would be a very expensive operation. If you believe this is the case, first benchmark it, second, write your own algorithm as Lines::count is probably not optimized as much as it could since the primary goal of Lines is to iterate lines.

Related

Need to filter lines from a sequence but keep it as a single sequence

I need to read a file from disk, filter out some rows based on conditions, then return the result as a single stream/sequence, not a sequence of strings. The file is too large to hold in memory all at once, so it must be treated as a Stream/Sequence throughout processing. This is what I tried.
File(filename).bufferedReader()
// break into lines
.lineSequence()
// filter each line based on condition
.filter{meetsSomeCondition(it)}
// add newline back in
.map{(it+"\n").byteInputStream()}
// reduce back into a single stream with Java's SequenceInputStream
.reduce<InputStream, ByteArrayInputStream> { acc, i -> SequenceInputStream(acc, i) }
This works when testing on a small file, but when using a large file it errors with a StackOverFlow exception. It seems that Java's SequenceInputStream can't handle repeatedly nesting itself like I do with the reduce call.
I see that SequenceInputStream also has a way of accepting an Enumeration argument that takes a List of elements. But that's the problem, as far as I can tell, it doesn't seem to accept a Stream.
Your code does not really do what you think it does. reduce() is a terminal operation, meaning that it consumes all elements in the sequence. After reduce() line the whole file has been read already. Also, it is not SequenceInputStream that does not support such reduce operation. You created a very long chain of objects. SequenceInputStream objects does not really know they were chained like this and they can't do too much about it.
Instead, you need to keep a sequence "alive" and create an InputStream that will read from the source sequence whenever required. I don't think there is an utility like this in the stdlib. It is a very specialized requirement.
The easiest is to create a sequence of bytes and then provide InputStream which reads from it:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.flatMap { "$it\n".toByteArray().asIterable() }
.asInputStream()
fun Sequence<Byte>.asInputStream() = object : InputStream() {
val iter = iterator()
override fun read() = if (iter.hasNext()) iter.next().toUByte().toInt() else -1
}
However, sequence of bytes isn't really the best for performance. We can optimize it by reading line by line, so creating a sequence of strings or byte arrays:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.map { "$it\n".toByteArray() }
.asInputStream()
fun Sequence<ByteArray>.asInputStream() = object : InputStream() {
val iter = iterator()
var curr = iter.next()
var pos = 0
override fun read(): Int {
return when {
pos < curr.size -> curr[pos++].toUByte().toInt()
!iter.hasNext() -> -1
else -> {
curr = iter.next()
pos = 0
read()
}
}
}
}
(Note this implementation of asInputStream() will fail for empty sequence)
Still, there is much room for improvement regarding the performance. We read from sequence line by line, but we still read from InputStream byte by byte. To improve it further we would need to implement more methods of InputStream to read in bigger chunks. If you really care about the performance then I suggest looking into BufferedInputStream implementation and try to re-use some of its codebase.
Also, remember to close the file reader that was created in the first step. It won't close automatically when InputStream will be closed.

Why can't I use map on Kotlins Regex Result sequence

I worked with Kotlin's Regex API to get occurences of some regular expression. I wanted to convert the finding directly into another object so I intuitively used map() on the result sequence.
I was very surprised that the map function is never called but forEach is working. This example should make it clear:
val regex = "a.".toRegex()
val txt = "abacad"
var counter = 0
regex.findAll(txt).forEach { counter++ }
println(counter) // 3
regex.findAll(txt).map { counter++ }
println(counter) // still 3 since map is not called
regex.findAll(txt).forEach { counter++ }
println(counter) // 6
My question is why? Did I oversee it in the documentation?
(tested on Kotlin 1.5.30)
findAll() returns a Sequence<MatchResult>. Operations on Sequence are classified either as intermediate or terminal. The documentation for the functions declares which type they are. map and onEach are intermediate. Their action is deferred until a terminal operation is made. forEach is terminal.
Manipulating a Sequence with map returns a new Sequence that will perform the mapping function only when it is actually iterated, such as by a call to forEach or using it in a for loop.
This is the purpose of Sequence, to defer mutating functional calls. It can reduce allocations of intermediate Lists, or in some cases avoid applying the mutations on every single item, such as if the terminal call in the chain is a find() call.

Rust Inspect Iterator: cannot borrow `*` as immutable because it is also borrowed as mutable

Why can't I push to this vector during inspect and do contains on it during skip_while?
I've implemented my own iterator for my own struct Chain like this:
struct Chain {
n: u32,
}
impl Chain {
fn new(start: u32) -> Chain {
Chain { n: start }
}
}
impl Iterator for Chain {
type Item = u32;
fn next(&mut self) -> Option<u32> {
self.n = digit_factorial_sum(self.n);
Some(self.n)
}
}
Now what I'd like to do it take while the iterator is producing unique values. So I'm inspect-ing the chain and pushing to a vector and then checking it in a take_while scope:
let mut v = Vec::with_capacity(terms);
Chain::new(i)
.inspect(|&x| {
v.push(x)
})
.skip_while(|&x| {
return v.contains(&x);
})
However, the Rust compile spits out this error:
error: cannot borrow `v` as immutable because it is also borrowed as mutable [E0502]
...
borrow occurs due to use of `v` in closure
return v.contains(&x);
^
previous borrow of `v` occurs here due to use in closure; the mutable borrow prevents subsequent moves, borrows, or modification of `v` until the borrow ends
.inspect(|&x| {
v.push(x)
})
Obviously I don't understand the concept of "borrowing". What am I doing wrong?
The problem here is that you're attempting to create both a mutable and an immutable reference to the same variable, which is a violation of Rust borrowing rules. And rustc actually does say this to you very clearly.
let mut v = Vec::with_capacity(terms);
Chain::new(i)
.inspect(|&x| {
v.push(x)
})
.skip_while(|&x| {
return v.contains(&x);
})
Here you're trying to use v in two closures, first in inspect() argument, second in skip_while() argument. Non-move closures capture their environment by reference, so the environment of the first closure contains &mut v, and that of the second closure contains &v. Closures are created in the same expression, so even if it was guaranteed that inspect() ran and dropped the borrow before skip_while() (which I is not the actual case, because these are iterator adapters and they won't be run at all until the iterator is consumed), due to lexical borrowing rules this is prohibited.
Unfortunately, this is one of those examples when the borrow checker is overly strict. What you can do is to use RefCell, which allows mutation through a shared reference but introduces some run-time cost:
use std::cell::RefCell;
let mut v = RefCell::new(Vec::with_capacity(terms));
Chain::new(i)
.inspect(|x| v.borrow_mut().push(*x))
.skip_while(|x| v.borrow().contains(x))
I think it may be possible to avoid runtime penalty of RefCell and use UnsafeCell instead, because when the iterator is consumed, these closures will only run one after another, not at the same time, so there should never be a mutable and an immutable references outstanding at the same time. It could look like this:
use std::cell::UnsafeCell;
let mut v = UnsafeCell::new(Vec::with_capacity(terms));
Chain::new(i)
.inspect(|x| unsafe { (&mut *v.get()).push(*x) })
.skip_while(|x| unsafe { (&*v.get()).contains(x) })
But I may be wrong, and anyway, the overhead of RefCell is not that high unless this code is running in a really tight loop, so you should only use UnsafeCell as a last resort, only when nothing else works, and exercise extreme caution when working with it.

How to rewrite this in terms of R.compose

var take = R.curry(function take(count, o) {
return R.pick(R.take(count, R.keys(o)), o);
});
This function takes count keys from an object, in the order, in which they appear. I use it to limit a dataset which was grouped.
I understand that there are placeholder arguments, like R.__, but I can't wrap my head around this particular case.
This is possible thanks to R.converge, but I don't recommend going point-free in this case.
// take :: Number -> Object -> Object
var take = R.curryN(2,
R.converge(R.pick,
R.converge(R.take,
R.nthArg(0),
R.pipe(R.nthArg(1),
R.keys)),
R.nthArg(1)));
One thing to note is that the behaviour of this function is undefined since the order of the list returned by R.keys is undefined.
I agree with #davidchambers that it is probably better not to do this points-free. This solution is a bit cleaner than that one, but is still not to my mind as nice as your original:
// take :: Number -> Object -> Object
var take = R.converge(
R.pick,
R.useWith(R.take, R.identity, R.keys),
R.nthArg(1)
);
useWith and converge are similar in that they accept a number of function parameters and pass the result of calling all but the first one into that first one. The difference is that converge passes all the parameters it receives to each one, and useWith splits them up, passing one to each function. This is the first time I've seen a use for combining them, but it seems to make sense here.
That property ordering issue is supposed to be resolved in ES6 (final draft now out!) but it's still controversial.
Update
You mention that it will take some time to figure this out. This should help at least show how it's equivalent to your original function, if not how to derive it:
var take = R.converge(
R.pick,
R.useWith(R.take, R.identity, R.keys),
R.nthArg(1)
);
// definition of `converge`
(count, obj) => R.pick(R.useWith(R.take, R.identity, R.keys)(count, obj),
R.nthArg(1)(count, obj));
// definition of `nthArg`
(count, obj) => R.pick(R.useWith(R.take, R.identity, R.keys)(count, obj), obj);
// definition of `useWith`
(count, obj) => R.pick(R.take(R.identity(count), R.keys(obj)), obj);
// definition of `identity`
(count, obj) => R.pick(R.take(count, R.keys(obj)), obj);
Update 2
As of version 18, both converge and useWith have changed to become binary. Each takes a target function and a list of helper functions. That would change the above slightly to this:
// take :: Number -> Object -> Object
var take = R.converge(R.pick, [
R.useWith(R.take, [R.identity, R.keys]),
R.nthArg(1)
]);

Modifying self in `iter_mut().map(..)`, aka mutable functional collection operations

How do I convert something like this:
let mut a = vec![1, 2, 3, 4i32];
for i in a.iter_mut() {
*i += 1;
}
to a one line operation using map and a closure?
I tried:
a.iter_mut().map(|i| *i + 1).collect::<Vec<i32>>();
The above only works if I reassign it to a. Why is this? Is map getting a copy of a instead of a mutable reference? If so, how can I get a mutable reference?
Your code dereferences the variable (*i) then adds one to it. Nowhere in there does the original value get changed.
The best way to do what you asked is to use Iterator::for_each:
a.iter_mut().for_each(|i| *i += 1);
This gets an iterator of mutable references to the numbers in your vector. For each item, it dereferences the reference and then increments it.
You could use map and collect, but doing so is non-idiomatic and potentially wasteful. This uses map for the side-effect of mutating the original value. The "return value" of assignment is the unit type () - an empty tuple. We use collect::<Vec<()>> to force the Iterator adapter to iterate. This last bit ::<...> is called the turbofish and allows us to provide a type parameter to the collect call, informing it what type to use, as nothing else would constrain the return type.:
let _ = a.iter_mut().map(|i| *i += 1).collect::<Vec<()>>();
You could also use something like Iterator::count, which is lighter than creating a Vec, but still ultimately unneeded:
a.iter_mut().map(|i| *i += 1).count();
As Ry- says, using a for loop is more idiomatic:
for i in &mut a {
*i += 1;
}