Why does map throw away type information when operating on DataArrays? - dataframe

I'm trying to understand why this is happening even given the limitations of DataArrays. Suppose you want to map over a DataArray of Int64s:
da = DataArray([1,2,3,4])
println(typeof(da))
println(typeof(map(a -> a^2, da))) # Returns an int for this input
println(typeof(map(a -> int(a^2), da))) # Cast the piecewise result to int
println(typeof(int(map(a -> a^2, da)))) # Cast the output DataArray{Any,1} to int
which results in
DataArray{Int64,1}
DataArray{Any,1}
DataArray{Any,1}
Array{Int64,1}
For an array, a = [1,2,3,4], map(a -> a^2, da) returns an Array of Int64s as expected. What is it about map and/or DataArrays that's causing type information to be lost here? Is there any solution to preserve type information when you're working with a type which doesn't have a constructor that converts DataArray{Any,1} to DataArray{ThatType,1}, like Dates.DateTime?
Edit: convert works fine to make a DataArray{Any,1} a DataArray{ThatType,1} (well at least for DateTime).

#which map(a -> a^2, da::DataArray{Int64, 1})
map(f::Function,dv::DataArray{T,1}) at /home/omer/.julia/v0.3/DataArrays/src/datavector.jl:114
Checking the source;
https://github.com/JuliaStats/DataArrays.jl/blob/master/src/datavector.jl
# TODO: should this be an AbstractDataVector, so it works with PDV's?
function Base.map(f::Function, dv::DataVector)
n = length(dv)
res = DataArray(Any, n)
for i in 1:n
res[i] = f(dv[i])
end
return res
end
It's creating the type DataArray{Any,1} to return.
res = DataArray(Any, n)
You can check the answer given by James Fairbanks (1 Apr 04:12 2015)
http://blog.gmane.org/gmane.comp.lang.julia.user/month=20150401

Related

Implementing map & min that takes the tables.keys iterator as argument in Nim

I would like to define overloads of map and min/max (as originally defined in sequtils) that works for tables.keys. Specifically, I want to be able to write something like the following:
import sequtils, sugar, tables
# A mapping from coordinates (x, y) to values.
var locations = initTable[(int, int), int]()
# Put in some random values.
locations[(1, 2)] = 1
locations[(2, 1)] = 2
locations[(-2, 5)] = 3
# Get the minimum X coordinate.
let minX = locations.keys.map(xy => xy[0]).min
echo minX
Now this fails with:
/usercode/in.nim(12, 24) Error: type mismatch: got <iterable[lent (int, int)], proc (xy: GenericParam): untyped>
but expected one of:
proc map[T, S](s: openArray[T]; op: proc (x: T): S {.closure.}): seq[S]
first type mismatch at position: 1
required type for s: openArray[T]
but expression 'keys(locations)' is of type: iterable[lent (int, int)]
expression: map(keys(locations), proc (xy: auto): auto = xy[0])
Below are my three attempts at writing a map that works (code on Nim playground: https://play.nim-lang.org/#ix=3Heq). Attempts 1 & 2 failed and attempt 3 succeeded. Similarly, I implemented min using both attempt 1 & attempt 2, and attempt 1 failed while attempt 2 succeeded.
However, I'm confused as to why the previous attempts fail, and what the best practice is:
Why does attempt 1 fail when the actual return type of the iterators is iterable[T]?
Why does attempt 2 fail for tables.keys? Is tables.keys implemented differently?
Is attempt 2 the canonical way of taking iterators / iterables as function arguments? Are there alternatives to this?
Attempt 1: Function that takes an iterable[T].
Since the Nim manual seems to imply that the result type of calling an iterator is iterable[T], I tried defining map for iterable[T] like this:
iterator map[A, B](iter: iterable[A], fn: A -> B): B =
for x in iter:
yield fn(x)
But it failed with a pretty long and confusing message:
/usercode/in.nim(16, 24) template/generic instantiation of `map` from here
/usercode/in.nim(11, 12) Error: type mismatch: got <iterable[(int, int)]>
but expected one of:
iterator items(a: cstring): char
first type mismatch at position: 1
required type for a: cstring
but expression 'iter' is of type: iterable[(int, int)]
... (more output like this)
From my understanding it seems to say that items is not defined for iterable[T], which seems weird to me because I think items is exactly what's need for an object to be iterable?
Attempt 2: Function that returns an iterator.
I basically copied the implementation in def-/nim-itertools and defined a map function that takes an iterator and returns a new closure iterator:
type Iterable[T] = (iterator: T)
func map[A, B](iter: Iterable[A], fn: A -> B): iterator: B =
(iterator: B =
for x in iter():
yield fn(x))
but this failed with:
/usercode/in.nim(25, 24) Error: type mismatch: got <iterable[lent (int, int)], proc (xy: GenericParam): untyped>
but expected one of:
func map[A, B](iter: Iterable[A]; fn: A -> B): B
first type mismatch at position: 1
required type for iter: Iterable[map.A]
but expression 'keys(locations)' is of type: iterable[lent (int, int)]
proc map[T, S](s: openArray[T]; op: proc (x: T): S {.closure.}): seq[S]
first type mismatch at position: 1
required type for s: openArray[T]
but expression 'keys(locations)' is of type: iterable[lent (int, int)]
expression: map(keys(locations), proc (xy: auto): auto = xy[0])
which hints that maybe tables.keys doesn't return an iterator?
Attempt 3: Rewrite keys using attempt 2.
This replaces tables.keys using a custom myKeys that's implemented in a similar fashion to the version of map in attempt 2. Combined with map in attempt 2, this works:
func myKeys[K, V](table: Table[K, V]): iterator: K =
(iterator: K =
for x in table.keys:
yield x)
Explanation of errors in first attempts
which hints that maybe tables.keys doesn't return an iterator
You are right. It does not return an iterator, it is an iterator that returns elements of the type of your Table keys. Unlike in python3, there seems to be no difference between type(locations.keys) and type(locations.keys()). They both return (int, int).
Here is keys prototype:
iterator keys[A, B](t: Table[A, B]): lent A
The lent keyword avoids copies from the Table elements.
Hence you get a type mismatch for your first and second attempt:
locations.keys.map(xy => xy[0]) has an incorrect first parameter, since you get a (int, int) element where you expect a iterable[A].
Proposals
As for a solution, you can either first convert your keys to a sequence (which is heavy), like hola suggested.
You can directly rewrite a procedure for your specific application, mixing both the copy in the sequence and your operation, gaining a bit in performance.
import tables
# A mapping from coordinates (x, y) to values.
var locations = initTable[(int, int), int]()
# Put in some random values.
locations[(1, 2)] = 1
locations[(2, 1)] = 2
locations[(-2, 5)] = 3
func firstCoordinate[X, Y, V](table: Table[(X, Y), V]): seq[X] =
result = #[]
for x in table.keys:
result.add(x[0])
let minX = locations.firstCoordinate.min
echo minX
This is not strictly adhering your API, but should be more efficient.

How to optimize this code for speed, in F# and also why is a part executed twice?

The code is used to pack historical financial data in 16 bytes:
type PackedCandle =
struct
val H: single
val L: single
val C: single
val V: int
end
new(h: single, l: single, c: single, v: int) = { H = h; L = l; C = c; V = v }
member this.ToByteArray =
let a = Array.create 16 (byte 0)
let h = BitConverter.GetBytes(this.H)
let l = BitConverter.GetBytes(this.L)
let c = BitConverter.GetBytes(this.C)
let v = BitConverter.GetBytes(this.V)
a.[00] <- h.[0]; a.[01] <- h.[1]; a.[02] <- h.[2]; a.[03] <- h.[3]
a.[04] <- l.[0]; a.[05] <- l.[1]; a.[06] <- l.[2]; a.[07] <- l.[3]
a.[08] <- c.[0]; a.[09] <- c.[1]; a.[10] <- c.[2]; a.[11] <- c.[3]
a.[12] <- v.[0]; a.[13] <- v.[1]; a.[14] <- v.[2]; a.[15] <- v.[3]
printfn "!!" <- for the second part of the question
a
Arrays of these are sent across the network, so I need the data to be as small as possible, but since this is tracking about 80 tradable instruments at the same time, performance matters as well.
A tradeoff was made where clients are not getting historical data and then updates, but just getting chunks of the last 3 days minute by minute, resulting in the same data being sent over and over to simplify the client logic.. and I inherit the problem of making the inefficient design.. as efficient as possible. This is also done over rest polling which I'm converting to sockets right now to keep everything binary.
So my first question is:
how can I make this faster? in C where you can cast anything into anything, I can just take a float and write it straight into the array so there is nothing faster, but in F# it looks like I need to jump through hoops, getting the bytes and then copying them one by one instead of 4 by 4, etc. Is there a better way?
My second question is that since this was to be evaluated once, I made ToByteArray a property. I'm doing some test with random values in Jupyter Notebook but then I see that:
the property seems to be executed twice (indicated by the two "!!" lines). Why is that?
Assuming you have array to write to (generally you should use buffer for reading & writing when working with sockets), you can use System.Runtime.CompilerServices.Unsafe.As<TFrom, TTo> to cast memory from one type to another (same thing that you can do with C/C++)
type PackedCandle =
// omitting fields & consructor
override c.ToString() = $"%f{c.H} %f{c.L} %f{c.C} %d{c.V}" // debug purpose
static member ReadFrom(array: byte[], offset) =
// get managed(!) pointer
// cast pointer to another type
// same as *(PackedCandle*)(&array[offset]) but safe from GC
Unsafe.As<byte, PackedCandle> &array.[offset]
member c.WriteTo(array: byte[], offset: int) =
Unsafe.As<byte, PackedCandle> &array.[offset] <- c
Usage
let byteArray = Array.zeroCreate<byte> 100 // assume array come from different function
// writing
let mutable offset = 0
for i = 0 to 5 do
let candle = PackedCandle(float32 i, float32 i, float32 i, i)
candle.WriteTo(byteArray, offset)
offset <- offset + Unsafe.SizeOf<PackedCandle>() // "increment pointer"
// reading
let mutable offset = 0
for i = 0 to 5 do
let candle = PackedCandle.ReadFrom(byteArray, offset)
printfn "%O" candle
offset <- offset + Unsafe.SizeOf<PackedCandle>()
But do you really want to mess with pointers (even managed)? Have measured that this code is bottleneck?
Update
It's better to use MemoryMarshal instead of raw Unsafe because first checks out-of-range and enforces usage of unmanaged (see here or here) types at runtime
member c.WriteTo (array: byte[], offset: int) =
MemoryMarshal.Write(array.AsSpan(offset), &Unsafe.AsRef(&c))
static member ReadFrom (array: byte[], offset: int) =
MemoryMarshal.Read<PackedCandle>(ReadOnlySpan(array).Slice(offset))
My first question would be, why do you need the ToByteArray operation? In the comments, you say that you are sending arrays of these values over network, so I assume you plan to convert the data to a byte array so that you can write it to network stream.
I think it would be more efficient (and easier) to instead have a method that takes a StreamWriter and writes the data to the stream directly:
type PackedCandle =
struct
val H: single
val L: single
val C: single
val V: int
end
new(h: single, l: single, c: single, v: int) = { H = h; L = l; C = c; V = v }
member this.WriteTo(sw:StreamWriter) =
sw.Write(this.H)
sw.Write(this.L)
sw.Write(this.C)
sw.Write(this.V)
If you now have some code for the network communication, that will expose a stream and you'll need to write to that stream. Assuming this is stream, you can do just:
use writer = new StreamWriter(stream)
for a in packedCandles do a.WriteTo(writer)
Regarding your second question, I think this cannot be answered without a more complete code sample.

Using FsCheck I get different results on tests, once 100% passed and the other time error

I created a generator to generate lists of int with the same lenght and to test the property of zip and unzip.
Running the test I get once in a while the error
Error: System.ArgumentException: list2 is 1 element shorter than list1
but it shouldn't happen because of my generator.
I got three times the test 100% passed and then the error above. Why?
It seems my generator is not working properly.
let samelength (x, y) =
List.length x = List.length y
let arbMyGen2 = Arb.filter samelength Arb.from<int list * int list>
type MyGenZ =
static member genZip() =
{
new Arbitrary<int list * int list>() with
override x.Generator = arbMyGen2 |> Arb.toGen
override x.Shrinker t = Seq.empty
}
let _ = Arb.register<MyGenZ>()
let pro_zip (xs: int list, ys: int list) =
(xs, ys) = List.unzip(List.zip xs ys)
|> Prop.collect (List.length xs = List.length ys)
do Check.Quick pro_zip
Your code, as written, works for me. So I'm not sure what exactly is wrong, but I can give you a few helpful (hopefully!) hints.
In the first instance, try not using the registrating mechanism, but instead using Prop.forAll, as follows:
let pro_zip =
Prop.forAll arbMyGen2 (fun (xs,ys) ->
(xs, ys) = List.unzip(List.zip xs ys)
|> Prop.collect (List.length xs))
do Check.Quick pro_zip
Note I've also changed your Prop.collect call to collect the length of the list(s), which gives somewhat more interesting output. In fact your property already checks that the lists are the same length (albeit implicitly) so the test will fail with a counterexample if they are not.
Arb.filter transforms an existing Arbitrary (i.e. generator and filter) to a new Arbitrary. In other words, arbMyGen2 has a shrinking function that'll work (i.e. only returns smaller pairs of lists that are of equal length), while in genZip() you throw the shrinker away. It would be fine to simply write
type MyGenZ =
static member genZip() = arbMyGen2
instead.

Why is Deedle casting a DataFrame boolean column into a float Series?

When I run the code below I get a DataFrame with one bool column and two double columns. However, when I extract the boolcolumn as a Series the result is a Series object with types DateTime and float.
It looks like Deedle "cast" the column to another type.
Why is this happening?
open Deedle
let dates =
[ DateTime(2013,1,1);
DateTime(2013,1,4);
DateTime(2013,1,8) ]
let values = [ 10.0; 20.0; 30.0 ]
let values2 = [ 0.0; -1.0; 1.0 ]
let first = Series(dates, values)
let second = Series(dates, values2)
let third: Series<DateTime,bool> = Series.map (fun k v -> v > 0.0) second
let df1 = Frame(["first"; "second"; "third"], [first; second; third])
let sb = df1.["third"]
df1;;
val it : Frame<DateTime,string> =
Deedle.Frame`2[System.DateTime,System.String]
{ColumnCount = 3;
ColumnIndex = Deedle.Indices.Linear.LinearIndex`1[System.String];
ColumnKeys = seq ["first"; "second"; "third"];
ColumnTypes = seq [System.Double; System.Double; System.Boolean];
...
sb;;
val it : Series<DateTime,float> = ...
As the existing answer points out, GetColumn is the way to go. You can specify the generic parameter directly when calling GetColumn and avoid the type annotation to make the code nicer:
let sb = df1.GetColumn<bool>("third")
Deedle frame does not statically keep track of the types of the columns, so when you want to get a column as a typed series, you need to specify the type in some way.
We did not want to force people to write type annotations, because they tend to be quite long and ugly, so the primary way of getting a column is GetColumn where you can specify the type argument as in the above example.
The other ways of accessing column such as df?third and df.["third"] are shorthands that assume the column type to be float because that happens to be quite common scenario (at least for the most common uses of Deedle in finance), so these two notations give you a simpler way that "often works nicely".
You can use .GetColumn to extract the Series as a bool:
let sb':(Series<DateTime,bool>) = df1.GetColumn("third")
//val sb' : Series<DateTime,bool> =
//series [ 2013/01/01 0:00:00 => False; 2013/01/04 0:00:00 => False; 2013/01/08 0:00:00 => True]
As to your question of why, I haven't looked at the source, but I assume the type of indexer you use maybe returns an obj, then Deedle tries to cast it to something, or maybe it tries to cast everything to float.

How to display OK results

I am playing with the Elm examples, and I noticed the field example gives Result types. After getting stuck, I came up with this simplified case:
import Html exposing (text)
import String
f: Int -> Int
f x = x + 1
g: Result String Int -> Result String Int
g x = (Result.map f) x
main =
text ( toString ( g (String.toInt 5 ) ))
The result displays OK 6 and I would rather it display just 6 -- I know that toString takes any type and returns a string representaton of it. So maybe I can modify toString
if result is OK then I can print the numerical result
if the result is Err then I would like do some custom error message
Possibly this is the reason for the andThen since the + 1 operation can fail.
andThen : Result e a -> (a -> Result e b) -> Result e b
andThen result callback =
case result of
Ok value -> callback value
Err msg -> Err msg
The definition of andThen is exactly what it does... and is an instance of case.
Either with andThen or plain old case how do I fix my example? Even if I fix it myself, it might not be the most Elm-like solution with good error handling. So I am posting the question.
When a function returns a Result, you have a choice - you can also return a Result, in which case you can return Err(something) or Ok(something). This percolates your errors up to the calling function, which can decide what to do. The other way is you can return something that isn't a result, like a String or Html. If you go this second route, then you need to handle both possibilities of the Result and still return your String or Html.
So for example this function takes a result and returns a string. It handles both possibilities, returning a string even if the result was an Err.
foo: Result String Err -> String
foo myres =
case myres of
Ok(str) -> str
Err(e) -> "there was an error! uh oh"
Its kind of a question of how far up the hierarchy you want to go with your Result. Do you want the errors to percolate all the way up to the top level? Maybe your top level function is like this:
View: Model -> Html
View model =
case makeMyHtml(model) of
Ok(htm) -> htm
Err(e) -> renderSpecialErrorHtmlPage(e)
At any rate, to get rid of the 'Ok' in this case you can do this:
main =
let res = g (String.toInt 5 )
text ( toString ( Result.withDefault "g returned an error!" res))
If g returns Ok(6) then you get "6", but if it returns error you get "g returned an error!".