Why is Deedle casting a DataFrame boolean column into a float Series? - dataframe

When I run the code below I get a DataFrame with one bool column and two double columns. However, when I extract the boolcolumn as a Series the result is a Series object with types DateTime and float.
It looks like Deedle "cast" the column to another type.
Why is this happening?
open Deedle
let dates =
[ DateTime(2013,1,1);
DateTime(2013,1,4);
DateTime(2013,1,8) ]
let values = [ 10.0; 20.0; 30.0 ]
let values2 = [ 0.0; -1.0; 1.0 ]
let first = Series(dates, values)
let second = Series(dates, values2)
let third: Series<DateTime,bool> = Series.map (fun k v -> v > 0.0) second
let df1 = Frame(["first"; "second"; "third"], [first; second; third])
let sb = df1.["third"]
df1;;
val it : Frame<DateTime,string> =
Deedle.Frame`2[System.DateTime,System.String]
{ColumnCount = 3;
ColumnIndex = Deedle.Indices.Linear.LinearIndex`1[System.String];
ColumnKeys = seq ["first"; "second"; "third"];
ColumnTypes = seq [System.Double; System.Double; System.Boolean];
...
sb;;
val it : Series<DateTime,float> = ...

As the existing answer points out, GetColumn is the way to go. You can specify the generic parameter directly when calling GetColumn and avoid the type annotation to make the code nicer:
let sb = df1.GetColumn<bool>("third")
Deedle frame does not statically keep track of the types of the columns, so when you want to get a column as a typed series, you need to specify the type in some way.
We did not want to force people to write type annotations, because they tend to be quite long and ugly, so the primary way of getting a column is GetColumn where you can specify the type argument as in the above example.
The other ways of accessing column such as df?third and df.["third"] are shorthands that assume the column type to be float because that happens to be quite common scenario (at least for the most common uses of Deedle in finance), so these two notations give you a simpler way that "often works nicely".

You can use .GetColumn to extract the Series as a bool:
let sb':(Series<DateTime,bool>) = df1.GetColumn("third")
//val sb' : Series<DateTime,bool> =
//series [ 2013/01/01 0:00:00 => False; 2013/01/04 0:00:00 => False; 2013/01/08 0:00:00 => True]
As to your question of why, I haven't looked at the source, but I assume the type of indexer you use maybe returns an obj, then Deedle tries to cast it to something, or maybe it tries to cast everything to float.

Related

How to optimize this code for speed, in F# and also why is a part executed twice?

The code is used to pack historical financial data in 16 bytes:
type PackedCandle =
struct
val H: single
val L: single
val C: single
val V: int
end
new(h: single, l: single, c: single, v: int) = { H = h; L = l; C = c; V = v }
member this.ToByteArray =
let a = Array.create 16 (byte 0)
let h = BitConverter.GetBytes(this.H)
let l = BitConverter.GetBytes(this.L)
let c = BitConverter.GetBytes(this.C)
let v = BitConverter.GetBytes(this.V)
a.[00] <- h.[0]; a.[01] <- h.[1]; a.[02] <- h.[2]; a.[03] <- h.[3]
a.[04] <- l.[0]; a.[05] <- l.[1]; a.[06] <- l.[2]; a.[07] <- l.[3]
a.[08] <- c.[0]; a.[09] <- c.[1]; a.[10] <- c.[2]; a.[11] <- c.[3]
a.[12] <- v.[0]; a.[13] <- v.[1]; a.[14] <- v.[2]; a.[15] <- v.[3]
printfn "!!" <- for the second part of the question
a
Arrays of these are sent across the network, so I need the data to be as small as possible, but since this is tracking about 80 tradable instruments at the same time, performance matters as well.
A tradeoff was made where clients are not getting historical data and then updates, but just getting chunks of the last 3 days minute by minute, resulting in the same data being sent over and over to simplify the client logic.. and I inherit the problem of making the inefficient design.. as efficient as possible. This is also done over rest polling which I'm converting to sockets right now to keep everything binary.
So my first question is:
how can I make this faster? in C where you can cast anything into anything, I can just take a float and write it straight into the array so there is nothing faster, but in F# it looks like I need to jump through hoops, getting the bytes and then copying them one by one instead of 4 by 4, etc. Is there a better way?
My second question is that since this was to be evaluated once, I made ToByteArray a property. I'm doing some test with random values in Jupyter Notebook but then I see that:
the property seems to be executed twice (indicated by the two "!!" lines). Why is that?
Assuming you have array to write to (generally you should use buffer for reading & writing when working with sockets), you can use System.Runtime.CompilerServices.Unsafe.As<TFrom, TTo> to cast memory from one type to another (same thing that you can do with C/C++)
type PackedCandle =
// omitting fields & consructor
override c.ToString() = $"%f{c.H} %f{c.L} %f{c.C} %d{c.V}" // debug purpose
static member ReadFrom(array: byte[], offset) =
// get managed(!) pointer
// cast pointer to another type
// same as *(PackedCandle*)(&array[offset]) but safe from GC
Unsafe.As<byte, PackedCandle> &array.[offset]
member c.WriteTo(array: byte[], offset: int) =
Unsafe.As<byte, PackedCandle> &array.[offset] <- c
Usage
let byteArray = Array.zeroCreate<byte> 100 // assume array come from different function
// writing
let mutable offset = 0
for i = 0 to 5 do
let candle = PackedCandle(float32 i, float32 i, float32 i, i)
candle.WriteTo(byteArray, offset)
offset <- offset + Unsafe.SizeOf<PackedCandle>() // "increment pointer"
// reading
let mutable offset = 0
for i = 0 to 5 do
let candle = PackedCandle.ReadFrom(byteArray, offset)
printfn "%O" candle
offset <- offset + Unsafe.SizeOf<PackedCandle>()
But do you really want to mess with pointers (even managed)? Have measured that this code is bottleneck?
Update
It's better to use MemoryMarshal instead of raw Unsafe because first checks out-of-range and enforces usage of unmanaged (see here or here) types at runtime
member c.WriteTo (array: byte[], offset: int) =
MemoryMarshal.Write(array.AsSpan(offset), &Unsafe.AsRef(&c))
static member ReadFrom (array: byte[], offset: int) =
MemoryMarshal.Read<PackedCandle>(ReadOnlySpan(array).Slice(offset))
My first question would be, why do you need the ToByteArray operation? In the comments, you say that you are sending arrays of these values over network, so I assume you plan to convert the data to a byte array so that you can write it to network stream.
I think it would be more efficient (and easier) to instead have a method that takes a StreamWriter and writes the data to the stream directly:
type PackedCandle =
struct
val H: single
val L: single
val C: single
val V: int
end
new(h: single, l: single, c: single, v: int) = { H = h; L = l; C = c; V = v }
member this.WriteTo(sw:StreamWriter) =
sw.Write(this.H)
sw.Write(this.L)
sw.Write(this.C)
sw.Write(this.V)
If you now have some code for the network communication, that will expose a stream and you'll need to write to that stream. Assuming this is stream, you can do just:
use writer = new StreamWriter(stream)
for a in packedCandles do a.WriteTo(writer)
Regarding your second question, I think this cannot be answered without a more complete code sample.

How to create a variable that can take strings and functions in Kotlin?

Is there a way to create a variable to store strings and functions? Like var x:dynamic where x can be any type or a function: x="foo"; x= {print (...)}
dynamic isn't a type (it just turns off type checking) and works only in kotlin.js (JavaScript). Is there a type that includes function types and Any?
I try this code and works fine.
The var x is Any so it can hold any kind of data (not nullable ) in it. To hold nullable data use Any?
var x: Any = "foo"
println( x )
x = { println("") }
x.invoke()
The IDE smart cast the variable but you can help the cast using this
(x as ()->Unit).invoke()

Difference between fold and reduce in Kotlin, When to use which?

I am pretty confused with both functions fold() and reduce() in Kotlin, can anyone give me a concrete example that distinguishes both of them?
fold takes an initial value, and the first invocation of the lambda you pass to it will receive that initial value and the first element of the collection as parameters.
For example, take the following code that calculates the sum of a list of integers:
listOf(1, 2, 3).fold(0) { sum, element -> sum + element }
The first call to the lambda will be with parameters 0 and 1.
Having the ability to pass in an initial value is useful if you have to provide some sort of default value or parameter for your operation. For example, if you were looking for the maximum value inside a list, but for some reason want to return at least 10, you could do the following:
listOf(1, 6, 4).fold(10) { max, element ->
if (element > max) element else max
}
reduce doesn't take an initial value, but instead starts with the first element of the collection as the accumulator (called sum in the following example).
For example, let's do a sum of integers again:
listOf(1, 2, 3).reduce { sum, element -> sum + element }
The first call to the lambda here will be with parameters 1 and 2.
You can use reduce when your operation does not depend on any values other than those in the collection you're applying it to.
The major functional difference I would call out (which is mentioned in the comments on the other answer, but may be hard to understand) is that reduce will throw an exception if performed on an empty collection.
listOf<Int>().reduce { x, y -> x + y }
// java.lang.UnsupportedOperationException: Empty collection can't be reduced.
This is because .reduce doesn't know what value to return in the event of "no data".
Contrast this with .fold, which requires you to provide a "starting value", which will be the default value in the event of an empty collection:
val result = listOf<Int>().fold(0) { x, y -> x + y }
assertEquals(0, result)
So, even if you don't want to aggregate your collection down to a single element of a different (non-related) type (which only .fold will let you do), if your starting collection may be empty then you must either check your collection size first and then .reduce, or just use .fold
val collection: List<Int> = // collection of unknown size
val result1 = if (collection.isEmpty()) 0
else collection.reduce { x, y -> x + y }
val result2 = collection.fold(0) { x, y -> x + y }
assertEquals(result1, result2)
Another difference that none of the other answers mentioned is the following:
The result of a reduce operation will always be of the same type (or a super type) as the data that is being reduced.
We can see that from the definition of the reduce method:
public inline fun <S, T : S> Iterable<T>.reduce(operation: (acc: S, T) -> S): S {
val iterator = this.iterator()
if (!iterator.hasNext()) throw UnsupportedOperationException("Empty collection can't be reduced.")
var accumulator: S = iterator.next()
while (iterator.hasNext()) {
accumulator = operation(accumulator, iterator.next())
}
return accumulator
}
On the other hand, the result of a fold operation can be anything, because there are no restrictions when it comes to setting up the initial value.
So, for example, let us say that we have a string that contains letters and digits. We want to calculate the sum of all the digits.
We can easily do that with fold:
val string = "1a2b3"
val result: Int = string.fold(0, { currentSum: Int, char: Char ->
if (char.isDigit())
currentSum + Character.getNumericValue(char)
else currentSum
})
//result is equal to 6
reduce - The reduce() method transforms a given collection into a single result.
val numbers: List<Int> = listOf(1, 2, 3)
val sum: Int = numbers.reduce { acc, next -> acc + next }
//sum is 6 now.
fold - What would happen in the previous case of an empty list? Actually, there’s no right value to return, so reduce() throws a RuntimeException
In this case, fold is a handy tool. You can put an initial value by it -
val sum: Int = numbers.fold(0, { acc, next -> acc + next })
Here, we’ve provided initial value. In contrast, to reduce(), if the collection is empty, the initial value will be returned which will prevent you from the RuntimeException.
Simple Answer
Result of both reduce and fold is "a list of items will be transformed into a single item".
In case of fold,we provide 1 extra parameter apart from list but in case of reduce,only items in list will be considered.
Fold
listOf("AC","Fridge").fold("stabilizer") { freeGift, itemBought -> freeGift + itemBought }
//output: stabilizerACFridge
In above case,think as AC,fridge bought from store & they give stabilizer as gift(this will be the parameter passed in the fold).so,you get all 3 items together.Please note that freeGift will be available only once i.e for the first iteration.
Reduce
In case of reduce,we get items in list as parameters and can perform required transformations on it.
listOf("AC","Fridge").reduce { itemBought1, itemBought2 -> itemBought1 + itemBought2 }
//output: ACFridge
The difference between the two functions is that fold() takes an initial value and uses it as the accumulated value on the first step, whereas the first step of reduce() uses the first and the second elements as operation arguments on the first step.

Why does map throw away type information when operating on DataArrays?

I'm trying to understand why this is happening even given the limitations of DataArrays. Suppose you want to map over a DataArray of Int64s:
da = DataArray([1,2,3,4])
println(typeof(da))
println(typeof(map(a -> a^2, da))) # Returns an int for this input
println(typeof(map(a -> int(a^2), da))) # Cast the piecewise result to int
println(typeof(int(map(a -> a^2, da)))) # Cast the output DataArray{Any,1} to int
which results in
DataArray{Int64,1}
DataArray{Any,1}
DataArray{Any,1}
Array{Int64,1}
For an array, a = [1,2,3,4], map(a -> a^2, da) returns an Array of Int64s as expected. What is it about map and/or DataArrays that's causing type information to be lost here? Is there any solution to preserve type information when you're working with a type which doesn't have a constructor that converts DataArray{Any,1} to DataArray{ThatType,1}, like Dates.DateTime?
Edit: convert works fine to make a DataArray{Any,1} a DataArray{ThatType,1} (well at least for DateTime).
#which map(a -> a^2, da::DataArray{Int64, 1})
map(f::Function,dv::DataArray{T,1}) at /home/omer/.julia/v0.3/DataArrays/src/datavector.jl:114
Checking the source;
https://github.com/JuliaStats/DataArrays.jl/blob/master/src/datavector.jl
# TODO: should this be an AbstractDataVector, so it works with PDV's?
function Base.map(f::Function, dv::DataVector)
n = length(dv)
res = DataArray(Any, n)
for i in 1:n
res[i] = f(dv[i])
end
return res
end
It's creating the type DataArray{Any,1} to return.
res = DataArray(Any, n)
You can check the answer given by James Fairbanks (1 Apr 04:12 2015)
http://blog.gmane.org/gmane.comp.lang.julia.user/month=20150401

Is there a way to get a Curried form of the binary operators in SML/NJ?

For example, instead of
- op =;
val it = fn : ''a * ''a -> bool
I would rather have
- op =;
val it = fn : ''a -> ''a -> bool
for use in
val x = getX()
val l = getList()
val l' = if List.exists ((op =) x) l then l else x::l
Obviously I can do this on my own, for example,
val l' = if List.exists (fn y => x = y) l then l else x::l
but I want to make sure I'm not missing a more elegant way.
You could write a helper function that curries a function:
fun curry f x y = f (x, y)
Then you can do something like
val curried_equals = curry (op =)
val l' = if List.exists (curried_equals x) l then l else x::l
My knowledge of SML is scant, but I looked through the Ullman book and couldn't find an easy way to convert a function that accepts a tuple to a curried function. They have two different signatures and aren't directly compatible with one another.
I think you're going to have to roll your own.
Or switch to Haskell.
Edit: I've thought about it, and now know why one isn't the same as the other. In SML, nearly all of the functions you're used to actually accept only one parameter. It just so happens that most of the time you're actually passing it a tuple with more than one element. Still, a tuple is a single value and is treated as such by the function. You can't pass such function a partial tuple. It's either the whole tuple or nothing.
Any function that accepts more than one parameter is, by definition, curried. When you define a function that accepts multiple parameters (as opposed to a single tuple with multiple elements), you can partially apply it and use its return value as the argument to another function.