Return lazy iterator that depends on data allocated within the function - iterator

I am new to Rust and reading The Rust Programming Language, and in the Error Handling section there is a "case study" describing a program to read data from a CSV file using the csv and rustc-serialize libraries (using getopts for argument parsing).
The author writes a function search that steps through the rows of the csv file using a csv::Reader object and collect those entries whose 'city' field match a specified value into a vector and returns it. I've taken a slightly different approach than the author, but this should not affect my question. My (working) function looks like this:
extern crate csv;
extern crate rustc_serialize;
use std::path::Path;
use std::fs::File;
fn search<P>(data_path: P, city: &str) -> Vec<DataRow>
where P: AsRef<Path>
{
let file = File::open(data_path).expect("Opening file failed!");
let mut reader = csv::Reader::from_reader(file).has_headers(true);
reader.decode()
.map(|row| row.expect("Failed decoding row"))
.filter(|row: &DataRow| row.city == city)
.collect()
}
where the DataRow type is just a record,
#[derive(Debug, RustcDecodable)]
struct DataRow {
country: String,
city: String,
accent_city: String,
region: String,
population: Option<u64>,
latitude: Option<f64>,
longitude: Option<f64>
}
Now, the author poses, as the dreaded "exercise to the reader", the problem of modifying this function to return an iterator instead of a vector (eliminating the call to collect). My question is: How can this be done at all, and what are the most concise and idiomatic ways of doing it?
A simple attempt that i think gets the type signature right is
fn search_iter<'a,P>(data_path: P, city: &'a str)
-> Box<Iterator<Item=DataRow> + 'a>
where P: AsRef<Path>
{
let file = File::open(data_path).expect("Opening file failed!");
let mut reader = csv::Reader::from_reader(file).has_headers(true);
Box::new(reader.decode()
.map(|row| row.expect("Failed decoding row"))
.filter(|row: &DataRow| row.city == city))
}
I return a trait object of type Box<Iterator<Item=DataRow> + 'a> so as not to have to expose the internal Filter type, and where the lifetime 'a is introduced just to avoid having to make a local clone of city. But this fails to compile because reader does not live long enough; it's allocated on the stack and so is deallocated when the function returns.
I guess this means that reader has to be allocated on the heap (i.e. boxed) from the beginning, or somehow moved off the stack before the function ends. If I were returning a closure, this is exactly the problem that would be solved by making it a move closure. But I don't know how to do something similar when I'm not returning a function. I've tried defining a custom iterator type containing the needed data, but I couldn't get it to work, and it kept getting uglier and more contrived (don't make too much of this code, I'm only including it to show the general direction of my attempts):
fn search_iter<'a,P>(data_path: P, city: &'a str)
-> Box<Iterator<Item=DataRow> + 'a>
where P: AsRef<Path>
{
struct ResultIter<'a> {
reader: csv::Reader<File>,
wrapped_iterator: Option<Box<Iterator<Item=DataRow> + 'a>>
}
impl<'a> Iterator for ResultIter<'a> {
type Item = DataRow;
fn next(&mut self) -> Option<DataRow>
{ self.wrapped_iterator.unwrap().next() }
}
let file = File::open(data_path).expect("Opening file failed!");
// Incrementally initialise
let mut result_iter = ResultIter {
reader: csv::Reader::from_reader(file).has_headers(true),
wrapped_iterator: None // Uninitialised
};
result_iter.wrapped_iterator =
Some(Box::new(result_iter.reader
.decode()
.map(|row| row.expect("Failed decoding row"))
.filter(|&row: &DataRow| row.city == city)));
Box::new(result_iter)
}
This question seems to concern the same problem, but the author of the answer solves it by making the concerned data static, which I don't think is an alternative for this question.
I am using Rust 1.10.0, the current stable version from the Arch Linux package rust.

CSV 1.0
As I alluded to in the answer for older versions of the crate, the best way of solving this is for the CSV crate to have an owning iterator, which it now does: DeserializeRecordsIntoIter
use csv::ReaderBuilder; // 1.1.1
use serde::Deserialize; // 1.0.104
use std::{fs::File, path::Path};
#[derive(Debug, Deserialize)]
struct DataRow {
country: String,
city: String,
accent_city: String,
region: String,
population: Option<u64>,
latitude: Option<f64>,
longitude: Option<f64>,
}
fn search_iter(data_path: impl AsRef<Path>, city: &str) -> impl Iterator<Item = DataRow> + '_ {
let file = File::open(data_path).expect("Opening file failed");
ReaderBuilder::new()
.has_headers(true)
.from_reader(file)
.into_deserialize::<DataRow>()
.map(|row| row.expect("Failed decoding row"))
.filter(move |row| row.city == city)
}
Before version 1.0
The straightest path to convert the original function would be to simply wrap the iterator. However, doing so directly will lead to problems because you cannot return an object that refers to itself and the result of decode refers to the Reader. If you could surmount that, you cannot have an iterator return references to itself.
One solution is to simply re-create the DecodedRecords iterator for each call to your new iterator:
fn search_iter<'a, P>(data_path: P, city: &'a str) -> MyIter<'a>
where
P: AsRef<Path>,
{
let file = File::open(data_path).expect("Opening file failed!");
MyIter {
reader: csv::Reader::from_reader(file).has_headers(true),
city: city,
}
}
struct MyIter<'a> {
reader: csv::Reader<File>,
city: &'a str,
}
impl<'a> Iterator for MyIter<'a> {
type Item = DataRow;
fn next(&mut self) -> Option<Self::Item> {
let city = self.city;
self.reader
.decode()
.map(|row| row.expect("Failed decoding row"))
.filter(|row: &DataRow| row.city == city)
.next()
}
}
This could have overhead associated with it, depending on the implementation of decode. Additionally, this might "rewind" back to the beginning of the input — if you substituted a Vec instead of a csv::Reader, you would see this. However, it happens to work in this case.
Beyond that, I'd normally open the file and create the csv::Reader outside of the function and pass in the DecodedRecords iterator and transform it, returning a newtype / box / type alias around the underlying iterator. I prefer this because the structure of your code mirrors the lifetimes of the objects.
I'm a little surprised that there isn't an implementation of IntoIterator for csv::Reader, which would also solve the problem because there would not be any references.
See also:
How can I store a Chars iterator in the same struct as the String it is iterating on?
Is there an owned version of String::chars?
What is the correct way to return an Iterator (or any other trait)?

Related

Testing serialize/deserialize functions for serde "with" attribute

Serde derive macros come with the ability to control how a field is serialized/deserialized through the #[serde(with = "module")] field attribute. The "module" should have serialize and deserialize functions with the right arguments and return types.
An example that unfortunately got a bit too contrived:
use serde::{Deserialize, Serialize};
#[derive(Debug, Default, PartialEq, Eq)]
pub struct StringPair(String, String);
mod stringpair_serde {
pub fn serialize<S>(sp: &super::StringPair, ser: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
ser.serialize_str(format!("{}:{}", sp.0, sp.1).as_str())
}
pub fn deserialize<'de, D>(d: D) -> Result<super::StringPair, D::Error>
where
D: serde::Deserializer<'de>,
{
d.deserialize_str(Visitor)
}
struct Visitor;
impl<'de> serde::de::Visitor<'de> for Visitor {
type Value = super::StringPair;
fn expecting(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "a pair of strings separated by colon (:)")
}
fn visit_str<E>(self, s: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Ok(s.split_once(":")
.map(|tup| super::StringPair(tup.0.to_string(), tup.1.to_string()))
.unwrap_or(Default::default()))
}
}
}
#[derive(Serialize, Deserialize)]
struct UsesStringPair {
// Other fields ...
#[serde(with = "stringpair_serde")]
pub stringpair: StringPair,
}
fn main() {
let usp = UsesStringPair {
stringpair: StringPair("foo".to_string(), "bar".to_string()),
};
assert_eq!(
serde_json::json!(&usp).to_string(),
r#"{"stringpair":"foo:bar"}"#
);
let usp: UsesStringPair = serde_json::from_str(r#"{"stringpair":"baz:qux"}"#).unwrap();
assert_eq!(
usp.stringpair,
StringPair("baz".to_string(), "qux".to_string())
)
}
Testing derived serialization for UsesStringPair is trivial with simple assertions. But I have looked at serde_test example as that makes sense to me too.
However, I want to be able to independently test the stringpair_serde::{serialize, deserialize} functions (e.g. if my crate provides just mycrate::StringPair and mycrate::stringpair_serde, and UsesStringPair is for the crate users to implement).
One way I've looked into is creating a serde_json::Serializer (using new, requires a io::Write implementation, which I couldn't figure out how to create and use trivially, but that's a separate question) and calling serialize with the created Serializer, then making assertions on the result as before. However, that does not test any/all implementations of serde::Serializer, just the one provided in serde_json.
I'm wondering if there's a method like in the serde_test example that works for ser/deser functions provided by a module.

Why does .flat_map() with .chars() not work with std::io::Lines, but does with a vector of Strings?

I am trying to iterate over characters in stdin. The Read.chars() method achieves this goal, but is unstable. The obvious alternative is to use Read.lines() with a flat_map to convert it to a character iterator.
This seems like it should work, but doesn't, resulting in borrowed value does not live long enough errors.
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
let mut lines = stdin.lock().lines();
let mut chars = lines.flat_map(|x| x.unwrap().chars());
}
This is mentioned in Read file character-by-character in Rust, but it does't really explain why.
What I am particularly confused about is how this differs from the example in the documentation for flat_map, which uses flat_map to apply .chars() to a vector of strings. I don't really see how that should be any different. The main difference I see is that my code needs to call unwrap() as well, but changing the last line to the following does not work either:
let mut chars = lines.map(|x| x.unwrap());
let mut chars = chars.flat_map(|x| x.chars());
It fails on the second line, so the issue doesn't appear to be the unwrap.
Why does this last line not work, when the very similar line in the documentation doesn't? Is there any way to get this to work?
Start by figuring out what the type of the closure's variable is:
let mut chars = lines.flat_map(|x| {
let () = x;
x.unwrap().chars()
});
This shows it's a Result<String, io::Error>. After unwrapping it, it will be a String.
Next, look at str::chars:
fn chars(&self) -> Chars
And the definition of Chars:
pub struct Chars<'a> {
// some fields omitted
}
From that, we can tell that calling chars on a string returns an iterator that has a reference to the string.
Whenever we have a reference, we know that the reference cannot outlive the thing that it is borrowed from. In this case, x.unwrap() is the owner. The next thing to check is where that ownership ends. In this case, the closure owns the String, so at the end of the closure, the value is dropped and any references are invalidated.
Except the code tried to return a Chars that still referred to the string. Oops. Thanks to Rust, the code didn't segfault!
The difference with the example that works is all in the ownership. In that case, the strings are owned by a vector outside of the loop and they do not get dropped before the iterator is consumed. Thus there are no lifetime issues.
What this code really wants is an into_chars method on String. That iterator could take ownership of the value and return characters.
Not the maximum efficiency, but a good start:
struct IntoChars {
s: String,
offset: usize,
}
impl IntoChars {
fn new(s: String) -> Self {
IntoChars { s: s, offset: 0 }
}
}
impl Iterator for IntoChars {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
let remaining = &self.s[self.offset..];
match remaining.chars().next() {
Some(c) => {
self.offset += c.len_utf8();
Some(c)
}
None => None,
}
}
}
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
let lines = stdin.lock().lines();
let chars = lines.flat_map(|x| IntoChars::new(x.unwrap()));
for c in chars {
println!("{}", c);
}
}
See also:
How can I store a Chars iterator in the same struct as the String it is iterating on?
Is there an owned version of String::chars?

Convert vector of enum values into an another vector

I have the following code which generates a vector of bytes from the passed vector of enum values:
#[derive(Debug, PartialEq)]
pub enum BertType {
SmallInteger(u8),
Integer(i32),
Float(f64),
String(String),
Boolean(bool),
Tuple(BertTuple),
}
#[derive(Debug, PartialEq)]
pub struct BertTuple {
pub values: Vec<BertType>
}
pub struct Serializer;
pub trait Serialize<T> {
fn to_bert(&self, data: T) -> Vec<u8>;
}
impl Serializer {
fn enum_value_to_binary(&self, enum_value: BertType) -> Vec<u8> {
match enum_value {
BertType::SmallInteger(value_u8) => self.to_bert(value_u8),
BertType::Integer(value_i32) => self.to_bert(value_i32),
BertType::Float(value_f64) => self.to_bert(value_f64),
BertType::String(string) => self.to_bert(string),
BertType::Boolean(boolean) => self.to_bert(boolean),
BertType::Tuple(tuple) => self.to_bert(tuple),
}
}
}
// some functions for serialize bool/integer/etc. into Vec<u8>
// ...
impl Serialize<BertTuple> for Serializer {
fn to_bert(&self, data: BertTuple) -> Vec<u8> {
let mut binary: Vec<u8> = data.values
.iter()
.map(|&item| self.enum_value_to_binary(item)) // <-- what the issue there?
.collect();
let arity = data.values.len();
match arity {
0...255 => self.get_small_tuple(arity as u8, binary),
_ => self.get_large_tuple(arity as i32, binary),
}
}
}
But when compiling, I receive an error with iterating around map:
error: the trait bound `std::vec::Vec<u8>: std::iter::FromIterator<std::vec::Vec<u8>>` is not satisfied [E0277]
.collect();
^~~~~~~
help: run `rustc --explain E0277` to see a detailed explanation
note: a collection of type `std::vec::Vec<u8>` cannot be built from an iterator over elements of type `std::vec::Vec<u8>`
error: aborting due to previous error
error: Could not compile `bert-rs`.
How can I fix this issue with std::iter::FromIterator?
The problem is that enum_value_to_binary returns a Vec<u8> for each element in values. So you end up with an Iterator<Item=Vec<u8>> and you call collect::<Vec<u8>>() on that, but it doesn't know how to flatten the nested vectors. If you want all the values to be flattened into one Vec<u8>, then you should use flat_map instead of map:
let mut binary: Vec<u8> = data.values
.iter()
.flat_map(|item| self.enum_value_to_binary(item).into_iter())
.collect();
Or, slightly more idiomatic and performant, you can just have enum_value_to_binary return an iterator directly.
Also, the iter method returns an Iterator<Item=&'a T>, which means you are just borrowing the elements, but self.enum_value_to_binary wants to take ownership over the value. There's a couple of ways to fix that. One option would be to use into_iter instead of iter, which will give you the elements by value. If you do that, you'll move the arity variable up to before the binary variable, since creating the binary variable will take ownership (move) data.values.
The other option would be to change self.enum_value_to_binary to take it's argument by reference.
Also possible that you meant for the type of binary to actually be Vec<Vec<u8>>.

Implementing a "cautious" take_while using Peekable

I'd like to use Peekable as the basis for a new cautious_take_while operation that acts like take_while from IteratorExt but without consuming the first failed item. (There's a side question of whether this is a good idea, and whether there are better ways to accomplish this goal in Rust -- I'd be happy for hints in that direction, but mostly I'm trying to understand where my code is breaking).
The API I'm trying to enable is basically:
let mut chars = "abcdefg.".chars().peekable();
let abc : String = chars.by_ref().cautious_take_while(|&x| x != 'd');
let defg : String = chars.by_ref().cautious_take_while(|&x| x != '.');
// yielding (abc = "abc", defg = "defg")
I've taken a crack at creating a MCVE here, but I'm getting:
:10:5: 10:19 error: cannot move out of borrowed content
:10 chars.by_ref().cautious_take_while(|&x| x != '.');
As far as I can tell, I'm following the same pattern as Rust's own TakeWhile in terms of my function signatures, but I'm seeing different different behavior from the borrow checker. Can someone point out what I'm doing wrong?
The funny thing with by_ref() is that it returns a mutable reference to itself:
pub trait IteratorExt: Iterator + Sized {
fn by_ref(&mut self) -> &mut Self { self }
}
It works because the Iterator trait is implemented for the mutable pointer to Iterator type. Smart!
impl<'a, I> Iterator for &'a mut I where I: Iterator, I: ?Sized { ... }
The standard take_while function works because it uses the trait Iterator, that is automatically resolved to &mut Peekable<T>.
But your code does not work because Peekable is a struct, not a trait, so your CautiousTakeWhileable must specify the type, and you are trying to take ownership of it, but you cannot, because you have a mutable pointer.
Solution, do not take a Peekable<T> but &mut Peekable<T>. You will need to specify the lifetime too:
impl <'a, T: Iterator, P> Iterator for CautiousTakeWhile<&'a mut Peekable<T>, P>
where P: FnMut(&T::Item) -> bool {
//...
}
impl <'a, T: Iterator> CautiousTakeWhileable for &'a mut Peekable<T> {
fn cautious_take_while<P>(self, f: P) -> CautiousTakeWhile<&'a mut Peekable<T>, P>
where P: FnMut(&T::Item) -> bool {
CautiousTakeWhile{inner: self, condition: f,}
}
}
A curious side effect of this solution is that now by_ref is not needed, because cautious_take_while() takes a mutable reference, so it does not steal ownership. The by_ref() call is needed for take_while() because it can take either Peekable<T> or &mut Peekable<T>, and it defaults to the first one. With the by_ref() call it will resolve to the second one.
And now that I finally understand it, I think it might be a good idea to change the definition of struct CautiousTakeWhile to include the peekable bit into the struct itself. The difficulty is that the lifetime has to be specified manually, if I'm right. Something like:
struct CautiousTakeWhile<'a, T: Iterator + 'a, P>
where T::Item : 'a {
inner: &'a mut Peekable<T>,
condition: P,
}
trait CautiousTakeWhileable<'a, T>: Iterator {
fn cautious_take_while<P>(self, P) -> CautiousTakeWhile<'a, T, P> where
P: FnMut(&Self::Item) -> bool;
}
and the rest is more or less straightforward.
This was a tricky one! I'll lead with the meat of the code, then attempt to explain it (if I understand it...). It's also the ugly, unsugared version, as I wanted to reduce incidental complexity.
use std::iter::Peekable;
fn main() {
let mut chars = "abcdefg.".chars().peekable();
let abc: String = CautiousTakeWhile{inner: chars.by_ref(), condition: |&x| x != 'd'}.collect();
let defg: String = CautiousTakeWhile{inner: chars.by_ref(), condition: |&x| x != '.'}.collect();
println!("{}, {}", abc, defg);
}
struct CautiousTakeWhile<'a, I, P> //'
where I::Item: 'a, //'
I: Iterator + 'a, //'
P: FnMut(&I::Item) -> bool,
{
inner: &'a mut Peekable<I>, //'
condition: P,
}
impl<'a, I, P> Iterator for CautiousTakeWhile<'a, I, P>
where I::Item: 'a, //'
I: Iterator + 'a, //'
P: FnMut(&I::Item) -> bool
{
type Item = I::Item;
fn next(&mut self) -> Option<I::Item> {
let return_next =
match self.inner.peek() {
Some(ref v) => (self.condition)(v),
_ => false,
};
if return_next { self.inner.next() } else { None }
}
}
Actually, Rodrigo seems to have a good explanation, so I'll defer to that, unless you'd like me to explain something specific.

Implementing Decodable for a wrapper around a fixed size vector

Background: the serialize crate is undocumented, deriving Decodable doesn't work. I've also looked at existing implementations for other types and find the code difficult to follow.
How does the decoding process work, and how do I implement Decodable for this struct?
pub struct Grid<A> {
data: [[A,..GRIDW],..GRIDH]
}
The reason why #[deriving(Decodable)] doesn't work is that [A,..GRIDW] doesn't implement Decodable, and it's impossible to implement a trait for a type when both are defined outside of this crate, which is the case here. So the only solution I can see is to manually implement Decodable for Grid.
And this is as far as I've gotten
impl <A: Decodable<D, E>, D: Decoder<E>, E> Decodable<D, E> for Grid<A> {
fn decode(decoder: &mut D) -> Result<Grid<A>, E> {
decoder.read_struct("Grid", 1u, ref |d| Ok(Grid {
data: match d.read_struct_field("data", 0u, ref |d| Decodable::decode(d)) {
Ok(e) => e,
Err(e) => return Err(e)
},
}))
}
}
Which gives an error at Decodable::decode(d)
error: failed to find an implementation of trait
serialize::serialize::Decodable for [[A, .. 20], .. 20]
It's not really possible to do this nicely at the moment for a variety of reasons:
We can't be generic over the length of a fixed length array (the fundamental issue)
The current trait coherence restrictions means we can't write a custom trait MyDecodable<D, E> { ... } with impl MyDecodable<D, E> for [A, .. GRIDW] (and one for GRIDH) and a blanket implementation impl<A: Decodable<D, E>> MyDecodable<D, E> for A. This forces a trait-based solution into using an intermediary type, which then makes the compiler's type inference rather unhappy and AFAICT impossible to satisfy.
We don't have associated types (aka "output types"), which I think would allow the type inference to be slightly sane.
Thus, for now, we're left with a manual implementation. :(
extern crate serialize;
use std::default::Default;
use serialize::{Decoder, Decodable};
static GRIDW: uint = 10;
static GRIDH: uint = 5;
fn decode_grid<E, D: Decoder<E>,
A: Copy + Default + Decodable<D, E>>(d: &mut D)
-> Result<Grid<A>, E> {
// mirror the Vec implementation: try to read a sequence
d.read_seq(|d, len| {
// check it's the required length
if len != GRIDH {
return Err(
d.error(format!("expecting length {} but found {}",
GRIDH, len).as_slice()));
}
// create the array with empty values ...
let mut array: [[A, .. GRIDW], .. GRIDH]
= [[Default::default(), .. GRIDW], .. GRIDH];
// ... and fill it in progressively ...
for (i, outer) in array.mut_iter().enumerate() {
// ... by reading each outer element ...
try!(d.read_seq_elt(i, |d| {
// ... as a sequence ...
d.read_seq(|d, len| {
// ... of the right length,
if len != GRIDW { return Err(d.error("...")) }
// and then read each element of that sequence as the
// elements of the grid.
for (j, inner) in outer.mut_iter().enumerate() {
*inner = try!(d.read_seq_elt(j, Decodable::decode));
}
Ok(())
})
}));
}
// all done successfully!
Ok(Grid { data: array })
})
}
pub struct Grid<A> {
data: [[A,..GRIDW],..GRIDH]
}
impl<E, D: Decoder<E>, A: Copy + Default + Decodable<D, E>>
Decodable<D, E> for Grid<A> {
fn decode(d: &mut D) -> Result<Grid<A>, E> {
d.read_struct("Grid", 1, |d| {
d.read_struct_field("data", 0, decode_grid)
})
}
}
fn main() {}
playpen.
It's also possible to write a more "generic" [T, .. n] decoder by using macros to instantiate each version, with special control over how the recursive decoding is handled to allow nested fixed-length arrays to be handled (as required for Grid); this requires somewhat less code (especially with more layers, or a variety of different lengths), but the macro solution:
may be harder to understand, and
the one I give there may be less efficient (there's a new array variable created for every fixed length array, including new Defaults, while the non-macro solution above just uses a single array and thus only calls Default::default once for each element in the grid). It may be possible to expand to a similar set of recursive loops, but I'm not sure.