Why are the strings in my iterator being concatenated? - iterator

My original goal is to fetch a list of words, one on each line, and to put them in a HashSet, while discarding comment lines and raising I/O errors properly. Given the file "stopwords.txt":
a
# this is actually a comment
of
the
this
I managed to make the code compile like this:
fn stopword_set() -> io::Result<HashSet<String>> {
let words = Result::from_iter(
BufReader::new(File::open("stopwords.txt")?)
.lines()
.filter(|r| match r {
&Ok(ref l) => !l.starts_with('#'),
_ => true
}));
Ok(HashSet::from_iter(words))
}
fn main() {
let set = stopword_set().unwrap();
println!("{:?}", set);
assert_eq!(set.len(), 4);
}
Here's a playground that also creates the file above.
I would expect to have a set of 4 strings at the end of the program. To my surprise, the function actually returns a set containing a single string with all words concatenated:
{"aofthethis"}
thread 'main' panicked at 'assertion failed: `(left == right)` (left: `1`, right: `4`)'
Led by a piece of advice in the docs for FromIterator, I got rid of all calls to from_iter and used collect instead (Playground), which has indeed solved the problem.
fn stopword_set() -> io::Result<HashSet<String>> {
BufReader::new(File::open("stopwords.txt")?)
.lines()
.filter(|r| match r {
&Ok(ref l) => !l.starts_with('#'),
_ => true
}).collect()
}
Why are the previous calls to from_iter leading to unexpected inferences, while collect() works just as intended?

A simpler reproduction:
use std::collections::HashSet;
use std::iter::FromIterator;
fn stopword_set() -> Result<HashSet<String>, u8> {
let input: Vec<Result<_, u8>> = vec![Ok("foo".to_string()), Ok("bar".to_string())];
let words = Result::from_iter(input.into_iter());
Ok(HashSet::from_iter(words))
}
fn main() {
let set = stopword_set().unwrap();
println!("{:?}", set);
assert_eq!(set.len(), 2);
}
The problem is that here, we are collecting from the iterator twice. The type of words is Result<_, u8>. However, Result also implements Iterator itself, so when we call from_iter on that at the end, the compiler sees that the Ok type must be String due to the method signature. Working backwards, you can construct a String from an iterator of Strings, so that's what the compiler picks.
Removing the second from_iter would solve it:
fn stopword_set() -> Result<HashSet<String>, u8> {
let input: Vec<Result<_, u8>> = vec![Ok("foo".to_string()), Ok("bar".to_string())];
Result::from_iter(input.into_iter())
}
Or for your original:
fn stopword_set() -> io::Result<HashSet<String>> {
Result::from_iter(
BufReader::new(File::open("stopwords.txt")?)
.lines()
.filter(|r| match r {
&Ok(ref l) => !l.starts_with('#'),
_ => true
}))
}
Of course, I'd normally recommend using collect instead, as I prefer the chaining:
fn stopword_set() -> io::Result<HashSet<String>> {
BufReader::new(File::open("stopwords.txt")?)
.lines()
.filter(|r| match r {
&Ok(ref l) => !l.starts_with('#'),
_ => true,
})
.collect()
}

Related

Conditionally Acting on Specific Error Using and_then in Rust

How do I know what error happened when piping Results via and_then in Rust? For example using nested matches I could do this:
use std::num::ParseIntError;
fn get_num(n: u8) -> Result<u8, ParseIntError> {
// Force a ParseIntError
"foo ".parse()
}
fn add_one(n: u8) -> Result<u8, ParseIntError> {
Ok(n + 1)
}
fn main() {
match get_num(1) {
Ok(n) => {
match add_one(n) {
Ok(n) => println!("adding one gives us: {}", n),
Err(why) => println!("Error adding: {}", why)
}
},
Err(why) => println!("Error getting number: {}", why),
}
}
Or I can use and_then to pipe the result of the first function to the second and avoid a nested match:
match get_num(1).and_then(add_one) {
Ok(n) => println!("Adding one gives us: {}", n),
Err(why) => println!("Some error: {}.", why)
}
How do I conditionally and idiomatically determine the error in the second form, using and_then? Do I have to type check the error? Sure I can display the error like I did above, but let's say I want to preface it with something related to the kind of error it is?
The normal way would be to use map_err to convert the initial errors to something richer you control e.g.
enum MyError {
Parsing,
Adding
}
get_num(1).map_err(|_| MyError::Parsing)
.and_then(|v| add_one(v).map_err(|_| MyError::Adding)
sadly as you can see this requires using a lambda for the and_then callback: and_then must return the same error type as the input, so here it has to yield a Result<_, MyError>, wjhich is not the return type of add_one.
And while in other situations we could recover somewhat with ?, here the same original error type is split into two different values.

Why does .flat_map() with .chars() not work with std::io::Lines, but does with a vector of Strings?

I am trying to iterate over characters in stdin. The Read.chars() method achieves this goal, but is unstable. The obvious alternative is to use Read.lines() with a flat_map to convert it to a character iterator.
This seems like it should work, but doesn't, resulting in borrowed value does not live long enough errors.
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
let mut lines = stdin.lock().lines();
let mut chars = lines.flat_map(|x| x.unwrap().chars());
}
This is mentioned in Read file character-by-character in Rust, but it does't really explain why.
What I am particularly confused about is how this differs from the example in the documentation for flat_map, which uses flat_map to apply .chars() to a vector of strings. I don't really see how that should be any different. The main difference I see is that my code needs to call unwrap() as well, but changing the last line to the following does not work either:
let mut chars = lines.map(|x| x.unwrap());
let mut chars = chars.flat_map(|x| x.chars());
It fails on the second line, so the issue doesn't appear to be the unwrap.
Why does this last line not work, when the very similar line in the documentation doesn't? Is there any way to get this to work?
Start by figuring out what the type of the closure's variable is:
let mut chars = lines.flat_map(|x| {
let () = x;
x.unwrap().chars()
});
This shows it's a Result<String, io::Error>. After unwrapping it, it will be a String.
Next, look at str::chars:
fn chars(&self) -> Chars
And the definition of Chars:
pub struct Chars<'a> {
// some fields omitted
}
From that, we can tell that calling chars on a string returns an iterator that has a reference to the string.
Whenever we have a reference, we know that the reference cannot outlive the thing that it is borrowed from. In this case, x.unwrap() is the owner. The next thing to check is where that ownership ends. In this case, the closure owns the String, so at the end of the closure, the value is dropped and any references are invalidated.
Except the code tried to return a Chars that still referred to the string. Oops. Thanks to Rust, the code didn't segfault!
The difference with the example that works is all in the ownership. In that case, the strings are owned by a vector outside of the loop and they do not get dropped before the iterator is consumed. Thus there are no lifetime issues.
What this code really wants is an into_chars method on String. That iterator could take ownership of the value and return characters.
Not the maximum efficiency, but a good start:
struct IntoChars {
s: String,
offset: usize,
}
impl IntoChars {
fn new(s: String) -> Self {
IntoChars { s: s, offset: 0 }
}
}
impl Iterator for IntoChars {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
let remaining = &self.s[self.offset..];
match remaining.chars().next() {
Some(c) => {
self.offset += c.len_utf8();
Some(c)
}
None => None,
}
}
}
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
let lines = stdin.lock().lines();
let chars = lines.flat_map(|x| IntoChars::new(x.unwrap()));
for c in chars {
println!("{}", c);
}
}
See also:
How can I store a Chars iterator in the same struct as the String it is iterating on?
Is there an owned version of String::chars?

Simplification possible in example using print! and flush?

I started programming Rust a couple of days ago by working through the official documentation. Now I'm trying to challenge my understanding of Rust by working through the book "Exercises for Programmers" by Brian P. Hogan (The Pragmatic Programmers).
The first exercise is to write a program that asks the user for a name and prints out a greeting using that name. Input, string concatenation and output should be done in three distinct steps.
What is your name? Patrick
Hello, Patrick, nice to meet you.
The name will be entered at the same line as the initial prompt. Here's my solution:
use std::io;
use std::io::Write;
fn main() {
print!("What is your name? ");
match io::stdout().flush() {
Ok(_) => print!(""),
Err(error) => println!("{}", error),
}
let mut name = String::new();
match io::stdin().read_line(&mut name) {
Ok(_) => {
name = name.trim().to_string();
if name.len() > 0 {
let greeting = "Hello, ".to_string() + &name + &", nice to meet you!".to_string();
println!("{}", greeting);
} else {
println!("No name entered, goodbye.");
}
}
Err(error) => println!("{}", error),
}
}
The print! macro doesn't actually output the prompt until I call flush. flush needs error handling, so I need both to handle the Ok and the Err case. In case of Ok, there's nothing useful to do, so I just print! an empty string.
Is there a shorter way to handle this? Maybe the error handling can be skipped or simplified somehow, or the whole print!/flush approach is the wrong one. (Everything works fine, but I could write this shorter in C, after all...)
As other people have said, make sure to read the error handling chapter.
In most cases, you don't want to use println! to report errors. Either you should return the error from your function and let the caller deal with it, or you should use panic! to abort that thread and potentially the process.
match io::stdout().flush() {
Ok(_) => print!(""),
Err(error) => println!("{}", error),
}
Instead of printing nothing (which is inefficient), you can just... do nothing:
match io::stdout().flush() {
Ok(_) => (),
Err(error) => println!("{}", error),
}
Since you don't care about the success case, you can use an if let:
if let Err(error) = io::stdout().flush() {
println!("{}", error);
}
Replacing the println with a panic! would be even better:
if let Err(error) = io::stdout().flush() {
panic!("{}", error);
}
This is almost exactly what Option::unwrap does (source), except it also returns the successful value when present:
pub fn unwrap(self) -> T {
match self {
Some(val) => val,
None => panic!("called `Option::unwrap()` on a `None` value"),
}
}
However, it's even better to use Option::expect which allows you to specify an additional error message:
io::stdout().flush().expect("Unable to flush stdout");
Applying that twice:
use std::io::{self, Write};
fn main() {
print!("What is your name? ");
io::stdout().flush().expect("Unable to flush stdout");
let mut name = String::new();
io::stdin()
.read_line(&mut name)
.expect("Unable to read the line");
let name = name.trim();
if !name.is_empty() {
println!("Hello, {}, nice to meet you!", name);
} else {
println!("No name entered, goodbye.");
}
}
Note that there's no need to re-allocate a String, you can just shadow name, and there's no need to use format just to print out stuff.
Since Rust 1.26.0, you could also choose to return a Result from main:
use std::io::{self, Write};
fn main() -> Result<(), io::Error> {
print!("What is your name? ");
io::stdout().flush()?;
let mut name = String::new();
io::stdin().read_line(&mut name)?;
let name = name.trim();
if !name.is_empty() {
println!("Hello, {}, nice to meet you!", name);
} else {
println!("No name entered, goodbye.");
}
Ok(())
}
but I could write this shorter in C, after all...
I would encourage / challenge you to attempt this. Note that every memory allocation in this program is checked, as is every failure case dealing with the standard output. Many people are not aware that C's printf returns an error code that you should be checking. Try outputting to a pipe that has been closed for an example.

How to compose mutable Iterators?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I would like to make an iterator that generates a stream of prime numbers. My general thought process was to wrap an iterator with successive filters so for example you start with
let mut n = (2..N)
Then for each prime number you mutate the iterator and add on a filter
let p1 = n.next()
n = n.filter(|&x| x%p1 !=0)
let p2 = n.next()
n = n.filter(|&x| x%p2 !=0)
I am trying to use the following code, but I can not seem to get it to work
struct Primes {
base: Iterator<Item = u64>,
}
impl<'a> Iterator for Primes<'a> {
type Item = u64;
fn next(&mut self) -> Option<u64> {
let p = self.base.next();
match p {
Some(n) => {
let prime = n.clone();
let step = self.base.filter(move |&: &x| {x%prime!=0});
self.base = &step as &Iterator<Item = u64>;
Some(n)
},
_ => None
}
}
}
I have toyed with variations of this, but I can't seem to get lifetimes and types to match up. Right now the compiler is telling me
I can't mutate self.base
the variable prime doesn't live long enough
Here is the error I am getting
solution.rs:16:17: 16:26 error: cannot borrow immutable borrowed content `*self.base` as mutable
solution.rs:16 let p = self.base.next();
^~~~~~~~~
solution.rs:20:28: 20:37 error: cannot borrow immutable borrowed content `*self.base` as mutable
solution.rs:20 let step = self.base.filter(move |&: &x| {x%prime!=0});
^~~~~~~~~
solution.rs:21:30: 21:34 error: `step` does not live long enough
solution.rs:21 self.base = &step as &Iterator<Item = u64>;
^~~~
solution.rs:15:39: 26:6 note: reference must be valid for the lifetime 'a as defined on the block at 15:38...
solution.rs:15 fn next(&mut self) -> Option<u64> {
solution.rs:16 let p = self.base.next();
solution.rs:17 match p {
solution.rs:18 Some(n) => {
solution.rs:19 let prime = n.clone();
solution.rs:20 let step = self.base.filter(move |&: &x| {x%prime!=0});
...
solution.rs:20:71: 23:14 note: ...but borrowed value is only valid for the block suffix following statement 1 at 20:70
solution.rs:20 let step = self.base.filter(move |&: &x| {x%prime!=0});
solution.rs:21 self.base = &step as &Iterator<Item = u64>;
solution.rs:22 Some(n)
solution.rs:23 },
error: aborting due to 3 previous errors
Why won't Rust let me do this?
Here is a working version:
struct Primes<'a> {
base: Option<Box<Iterator<Item = u64> + 'a>>,
}
impl<'a> Iterator for Primes<'a> {
type Item = u64;
fn next(&mut self) -> Option<u64> {
let p = self.base.as_mut().unwrap().next();
p.map(|n| {
let base = self.base.take();
let step = base.unwrap().filter(move |x| x % n != 0);
self.base = Some(Box::new(step));
n
})
}
}
impl<'a> Primes<'a> {
#[inline]
pub fn new<I: Iterator<Item = u64> + 'a>(r: I) -> Primes<'a> {
Primes {
base: Some(Box::new(r)),
}
}
}
fn main() {
for p in Primes::new(2..).take(32) {
print!("{} ", p);
}
println!("");
}
I'm using a Box<Iterator> trait object. Boxing is unavoidable because the internal iterator must be stored somewhere between next() calls, and there is nowhere you can store reference trait objects.
I made the internal iterator an Option. This is necessary because you need to replace it with a value which consumes it, so it is possible that the internal iterator may be "absent" from the structure for a short time. Rust models absence with Option. Option::take replaces the value it is called on with None and returns whatever was there. This is useful when shuffling non-copyable objects around.
Note, however, that this sieve implementation is going to be both memory and computationally inefficient - for each prime you're creating an additional layer of iterators which takes heap space. Also the depth of stack when calling next() grows linearly with the number of primes, so you will get a stack overflow on a sufficiently large number:
fn main() {
println!("{}", Primes::new(2..).nth(10000).unwrap());
}
Running it:
% ./test1
thread '<main>' has overflowed its stack
zsh: illegal hardware instruction (core dumped) ./test1

Implementing Decodable for a wrapper around a fixed size vector

Background: the serialize crate is undocumented, deriving Decodable doesn't work. I've also looked at existing implementations for other types and find the code difficult to follow.
How does the decoding process work, and how do I implement Decodable for this struct?
pub struct Grid<A> {
data: [[A,..GRIDW],..GRIDH]
}
The reason why #[deriving(Decodable)] doesn't work is that [A,..GRIDW] doesn't implement Decodable, and it's impossible to implement a trait for a type when both are defined outside of this crate, which is the case here. So the only solution I can see is to manually implement Decodable for Grid.
And this is as far as I've gotten
impl <A: Decodable<D, E>, D: Decoder<E>, E> Decodable<D, E> for Grid<A> {
fn decode(decoder: &mut D) -> Result<Grid<A>, E> {
decoder.read_struct("Grid", 1u, ref |d| Ok(Grid {
data: match d.read_struct_field("data", 0u, ref |d| Decodable::decode(d)) {
Ok(e) => e,
Err(e) => return Err(e)
},
}))
}
}
Which gives an error at Decodable::decode(d)
error: failed to find an implementation of trait
serialize::serialize::Decodable for [[A, .. 20], .. 20]
It's not really possible to do this nicely at the moment for a variety of reasons:
We can't be generic over the length of a fixed length array (the fundamental issue)
The current trait coherence restrictions means we can't write a custom trait MyDecodable<D, E> { ... } with impl MyDecodable<D, E> for [A, .. GRIDW] (and one for GRIDH) and a blanket implementation impl<A: Decodable<D, E>> MyDecodable<D, E> for A. This forces a trait-based solution into using an intermediary type, which then makes the compiler's type inference rather unhappy and AFAICT impossible to satisfy.
We don't have associated types (aka "output types"), which I think would allow the type inference to be slightly sane.
Thus, for now, we're left with a manual implementation. :(
extern crate serialize;
use std::default::Default;
use serialize::{Decoder, Decodable};
static GRIDW: uint = 10;
static GRIDH: uint = 5;
fn decode_grid<E, D: Decoder<E>,
A: Copy + Default + Decodable<D, E>>(d: &mut D)
-> Result<Grid<A>, E> {
// mirror the Vec implementation: try to read a sequence
d.read_seq(|d, len| {
// check it's the required length
if len != GRIDH {
return Err(
d.error(format!("expecting length {} but found {}",
GRIDH, len).as_slice()));
}
// create the array with empty values ...
let mut array: [[A, .. GRIDW], .. GRIDH]
= [[Default::default(), .. GRIDW], .. GRIDH];
// ... and fill it in progressively ...
for (i, outer) in array.mut_iter().enumerate() {
// ... by reading each outer element ...
try!(d.read_seq_elt(i, |d| {
// ... as a sequence ...
d.read_seq(|d, len| {
// ... of the right length,
if len != GRIDW { return Err(d.error("...")) }
// and then read each element of that sequence as the
// elements of the grid.
for (j, inner) in outer.mut_iter().enumerate() {
*inner = try!(d.read_seq_elt(j, Decodable::decode));
}
Ok(())
})
}));
}
// all done successfully!
Ok(Grid { data: array })
})
}
pub struct Grid<A> {
data: [[A,..GRIDW],..GRIDH]
}
impl<E, D: Decoder<E>, A: Copy + Default + Decodable<D, E>>
Decodable<D, E> for Grid<A> {
fn decode(d: &mut D) -> Result<Grid<A>, E> {
d.read_struct("Grid", 1, |d| {
d.read_struct_field("data", 0, decode_grid)
})
}
}
fn main() {}
playpen.
It's also possible to write a more "generic" [T, .. n] decoder by using macros to instantiate each version, with special control over how the recursive decoding is handled to allow nested fixed-length arrays to be handled (as required for Grid); this requires somewhat less code (especially with more layers, or a variety of different lengths), but the macro solution:
may be harder to understand, and
the one I give there may be less efficient (there's a new array variable created for every fixed length array, including new Defaults, while the non-macro solution above just uses a single array and thus only calls Default::default once for each element in the grid). It may be possible to expand to a similar set of recursive loops, but I'm not sure.