GroupBy function from .NET in Haskell - api

LINQ library in .NET framework does have a very useful function called GroupBy, which I have been using all the time.
Its type in Haskell would look like
Ord b => (a-> b) -> [a] -> [(b, [a])]
Its purpose is to classify items based on the given classification function f into buckets, with each bucket containing similar items, that is (b, l) such that for any item x in l, f x == b.
Its performance in .NET is O(N) because it uses hash-tables, but in Haskell I am OK with O(N*log(N)).
I can't find anything similar in standard Haskell libraries. Also, my implementation in terms of standard functions is somewhat bulky:
myGroupBy :: Ord k => (a -> k) -> [a] -> [(k, [a])]
myGroupBy f = map toFst
. groupBy ((==) `on` fst)
. sortBy (comparing fst)
. map (\a -> (f a, a))
where
toFst l#((k,_):_) = (k, map snd l)
This is definitely not something I want to see amongst my problem-specific code.
My question is: how can I implement this function nicely exploiting standard libraries to their maximum?
Also, the seeming absence of such a standard function hints that it may rarely be needed by experienced Haskellers because they may know some better way. Is that true? What can be used to implement similar functionality in a better way?
Also, what would be the good name for it, considering groupBy is already taken? :)

GHC.Exts.groupWith
groupWith :: Ord b => (a -> b) -> [a] -> [[a]]
Introduced as part of generalised list comprehensions: http://www.haskell.org/ghc/docs/7.0.2/html/users_guide/syntax-extns.html#generalised-list-comprehensions

Using Data.Map as the intermediate structure:
import Control.Arrow ((&&&))
import qualified Data.Map as M
myGroupBy f = M.toList . M.fromListWith (++) . map (f &&& return)
The map operation turns the input list into a list of keys paired with singleton lists containing the elements. M.fromListWith (++) turns this into a Data.Map, concatenating when two items have the same key, and M.toList gets the pairs back out again.
Note that this reverses the lists, so adjust for that if necessary. It is also easy to replace return and (++) with other monoid-like operations if you for example only wanted the sum of the elements in each group.

Related

Pointwise function equality proof [duplicate]

I'm just starting playing with idris and theorem proving in general. I can follow most of the examples of proofs of basic facts on the internet, so I wanted to try something arbitrary by my own. So, I want to write a proof term for the following basic property of map:
map : (a -> b) -> List a -> List b
prf : map id = id
Intuitively, I can imagine how the proof should work: Take an arbitrary list l and analyze the possibilities for map id l. When l is empty, it's obvious; when
l is non-empty it's based on the concept that function application preserves equality.
So, I can do something like this:
prf' : (l : List a) -> map id l = id l
It's like a for all statement. How can I turn it into a proof of the equality of the functions involved?
You can't. Idris's type theory (like Coq's and Agda's) does not support general extensionality. Given two functions f and g that "act the same", you will never be able to prove Not (f = g), but you will only be able to prove f = g if f and g are defined the same, up to alpha and eta equivalence or so. Unfortunately, things only get worse when you consider higher-order functions; there's a theorem about such in the Coq standard library, but I can't seem to find or remember it right now.

How to modify a List element at a given index

What I've done (with some help from a friend) is create a function that takes a List, Int for the index, and a function to be applied to the element at the specified index. It's similar to Map but instead of applying a function to every element, it applies it to only one element.
So my questions are:
Does this function already exist in the core somewhere? We couldn't find it.
If not, is there a better way of accomplishing this than how we have done it?
Here's the code:
import Html exposing (text)
main =
let
m = {arr=[1,5,3], msg=""}
in
text (toString (getDisplay m 4 (\x -> x + 5)))
type alias Model =
{ arr : List (Int)
, msg : String
}
getDisplay : Model -> Int -> (Int -> Int) -> Model
getDisplay model i f =
let
m = (changeAt model.arr i f)
in
case m of
Ok val ->
{model | arr = val, msg = ""}
Err err ->
{model | arr = [], msg = err}
changeAt : List a -> Int -> (a -> a) -> Result String (List a)
changeAt l i func =
let
f j x = if j==i then func x else x
in
if i < (List.length l) && i >= 0 then
Ok(List.indexedMap f l)
else
Err "Bad index"
NOTE: Elm discourages indexing Lists, as they are linked lists under the hood: to retrieve the 1001th element, you have to first visit all 1000 previous elements. Nonetheless, if you wanted to do it, this is one way.
List.indexedMap is a good way to do what you're describing.
However, since you mention the downside of having to visit all preceding elements in a list, the reality in your example is actually a little worse, if indeed you are super worried about performance.
Your list is actually traversed fully at least two times, regardless of whether the index exists or not. The simple act of asking for the length of a linked list has to traverse the entire list. Check out the source code, length is implemented in terms of a foldl.
Furthermore, List.indexedMap traverses the entire list at least once. I say, at least once, since the source of indexedMap also calls the length function in addition to using map. If we're lucky, the length call is memoized (I'm not familiar enough with Elm internals to know whether it is or not, hence the at least comment). The map itself traverses the entire list when called, unlike Haskell which evaluates things lazily, only as much as necessary.
And if you use indexedMap, the whole list is indexed regardless of the position you are interested in. That is, even if you want to apply the function at index zero, the entire list is indexed.
If you actually want to reduce the number of traversals to a minimum, you're going to (at this time) have to implement your own function and you'll have to do it without relying on length or indexedMap.
Here is an example of a changeAt function which avoids unnecessary traversals and if it finds the position, it stops traversing the list.
changeAt : List a -> Int -> (a -> a) -> Result String (List a)
changeAt l i func =
if i < 0 then
Err "Bad Index"
else
case l of
[] ->
Err "Not found"
(x::xs) ->
if i == 0 then
Ok <| func x :: xs
else
Result.map ((::) x) <| changeAt xs (i - 1) func
It's not terribly pretty, but if you want to avoid unnecessarily walking through the list - multiple times - then you might want to go with something like this.
You're looking for the set function for Arrays. Instead of using a List, which is inefficient as you described, this structure is better suited to your use case.
Here's an efficient implementation of the function you're looking for:
changeAt : Int -> (a -> a) -> Array a -> Array a
changeAt i f array =
case get i array of
Just item ->
set i (f item) array
Nothing ->
array
Also note that the data structure is the last argument in this implementation.
Array is mentioned in the link in your question, but nobody on this thread had explicitly mentioned this option yet.

How to "read" Elm's -> operator

I'm really loving Elm, up to the point where I encounter a function I've never seen before and want to understand its inputs and outputs.
Take the declaration of foldl for example:
foldl : (a -> b -> b) -> b -> List a -> b
I look at this and can't help feeling as if there's a set of parentheses that I'm missing, or some other subtlety about the associativity of this operator (for which I can't find any explicit documentation). Perhaps it's just a matter of using the language more until I just get a "feel" for it, but I'd like to think there's a way to "read" this definition in English.
Looking at the example from the docs…
foldl (::) [] [1,2,3] == [3,2,1]
I expect the function signature to read something like this:
Given a function that takes an a and a b and returns a b, an additional b, and a List, foldl returns a b.
Is that correct?
What advice can you give to someone like me who desperately wants inputs to be comma-delineated and inputs/outputs to be separated more clearly?
Short answer
The missing parentheses you're looking for are due to the fact that -> is right-associative: the type (a -> b -> b) -> b -> List a -> b is equivalent to (a -> b -> b) -> (b -> (List a -> b)). Informally, in a chain of ->s, read everything before the last -> as an argument and only the rightmost thing as a result.
Long answer
The key insight you may be missing is currying -- the idea that if you have a function that takes two arguments, you can represent it with a function that takes the first argument and returns a function that takes the second argument and then returns the result.
For instance, suppose you have a function add that takes two integers and adds them together. In Elm, you could write a function that takes both elements as a tuple and adds them:
add : (Int, Int) -> Int
add (x, y) = x+y
and you could call it as
add (1, 2) -- evaluates to 3
But suppose you didn't have tuples. You might think that there would be no way to write this function, but in fact using currying you could write it as:
add : Int -> (Int -> Int)
add x =
let addx : Int -> Int
addx y = x+y
in
addx
That is, you write a function that takes x and returns another function that takes y and adds it to the original x. You could call it with
((add 1) 2) -- evaluates to 3
You can now think of add in two ways: either as a function that takes an x and a y and adds them, or as a "factory" function that takes x values and produces new, specialized addx functions that take just one argument and add it to x.
The "factory" way of thinking about things comes in handy every once in a while. For instance, if you have a list of numbers called numbers and you want to add 3 to each number, you can just call List.map (add 3) numbers; if you'd written the tuple version instead you'd have to write something like List.map (\y -> add (3,y)) numbers which is a bit more awkward.
Elm comes from a tradition of programming languages that really like this way of thinking about functions and encourage it where possible, so Elm's syntax for functions is designed to make it easy. To that end, -> is right-associative: a -> b -> c is equivalent to a -> (b -> c). This means if you don't parenthesize, what you're defining is a function that takes an a and returns a b -> c, which again we can think of either as a function that takes an a and a b and returns a c, or equivalently a function that takes an a and returns a b -> c.
There's another syntactic nicety that helps call these functions: function application is left-associative. That way, the ugly ((add 1) 2) from above can be written as add 1 2. With that syntax tweak, you don't have to think about currying at all unless you want to partially apply a function -- just call it with all the arguments and the syntax will work out.

Haskell: List fusion, where is it needed?

Lets say we have the following:
l = map f (map g [1..100])
And we want to do:
head l
So we get:
head (map f (map g [1..100]))
Now, we have to get the first element of this. map is defined something like so:
map f l = f (head l) : (map f (tail l))
So then we get:
f (head (map g [1..100]))
And then applied again:
f (g (head [1..100]))
Which results in
f (g 1)
No intermediate lists are formed, simply due to laziness.
Is this analysis correct? And with simple structures like this:
foldl' ... $ map f1 $ map f2 $ createlist
are intermediate lists ever created, even without "list fusion"? (I think laziness should eliminate them trivially).
The only place I can see a reason to keep a list is if we did:
l' = [1..100]
l = map f (map g l')
Where we might want to keep l', if it is used elsewhere. However, in the l' case above here, it should be fairly trivial for the compiler to realise its quicker just to recalculate the above list than store it.
In your map example the list cells are created and then immediately garbage-collected. With list fusion no GC or thunk manipulation (which are both not free) is needed.
Fusion benefits most when the intermediate data structures are expensive. When the structure being passed is lazy, and accessed in a sequential manner, the structures may be relatively cheap (as is the case with lists).
For lazy structures, fusion removes the constant factor of creating a thunk, and immediately garbage collecting it.
For strict structures, fusion can remove O(n) work.
The greatest benefits are thus for strict arrays and similar structures. However, even fusing lazy lists is still a win, due to increased data locality, due to fewer intermediate structures being allocated as thunks.

Can GHC really never inline map, scanl, foldr, etc.?

I've noticed the GHC manual says "for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored."
Doesn't this say every application of common recursive functional constructs like map, zip, scan*, fold*, sum, etc. cannot be inlined?
You could always rewrite all these function when you employ them, adding appropriate strictness tags, or maybe employ fancy techniques like the "stream fusion" recommended here.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
Indeed, GHC cannot at present inline recursive functions. However:
GHC will still specialise recursive functions. For instance, given
fac :: (Eq a, Num a) => a -> a
fac 0 = 1
fac n = n * fac (n-1)
f :: Int -> Int
f x = 1 + fac x
GHC will spot that fac is used at type Int -> Int and generate a specialised version of fac for that type, which uses fast integer arithmetic.
This specialisation happens automatically within a module (e.g. if fac and f are defined in the same module). For cross-module specialisation (e.g. if f and fac are defined in different modules), mark the to-be-specialised function with an INLINABLE pragma:
{-# INLINABLE fac #-}
fac :: (Eq a, Num a) => a -> a
...
There are manual transformations which make functions nonrecursive. The lowest-power technique is the static argument transformation, which applies to recursive functions with arguments which don't change on recursive calls (eg many higher-order functions such as map, filter, fold*). This transformation turns
map f [] = []
map f (x:xs) = f x : map f xs
into
map f xs0 = go xs0
where
go [] = []
go (x:xs) = f x : go xs
so that a call such as
g :: [Int] -> [Int]
g xs = map (2*) xs
will have map inlined and become
g [] = []
g (x:xs) = 2*x : g xs
This transformation has been applied to Prelude functions such as foldr and foldl.
Fusion techniques are also make many functions nonrecursive, and are more powerful than the static argument transformation. The main approach for lists, which is built into the Prelude, is shortcut fusion. The basic approach is to write as many functions as possible as non-recursive functions which use foldr and/or build; then all the recursion is captured in foldr, and there are special RULES for dealing with foldr.
Taking advantage of this fusion is in principle easy: avoid manual recursion, preferring library functions such as foldr, map, filter, and any functions in this list. In particular, writing code in this style produces code which is "simultaneously fast and elegant".
Modern libraries such as text and vector use stream fusion behind the scenes. Don Stewart wrote a pair of blog posts (1, 2) demonstrating this in action in the now obsolete library uvector, but the same principles apply to text and vector.
As with shortcut fusion, taking advantage of stream fusion in text and vector is in principle easy: avoid manual recursion, preferring library functions which have been marked as "subject to fusion".
There is ongoing work on improving GHC to support inlining of recursive functions. This falls under the general heading of supercompilation, and recent work on this seems to have been led by Max Bolingbroke and Neil Mitchell.
In short, not as often as you would think. The reason is that the "fancy techniques" such as stream fusion are employed when the libraries are implemented, and library users don't need to worry about them.
Consider Data.List.map. The base package defines map as
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (x:xs) = f x : map f xs
This map is self-recursive, so GHC won't inline it.
However, base also defines the following rewrite rules:
{-# RULES
"map" [~1] forall f xs. map f xs = build (\c n -> foldr (mapFB c f) n xs)
"mapList" [1] forall f. foldr (mapFB (:) f) [] = map f
"mapFB" forall c f g. mapFB (mapFB c f) g = mapFB c (f.g)
#-}
This replaces uses of map via foldr/build fusion, then, if the function cannot be fused, replaces it with the original map. Because the fusion happens automatically, it doesn't depend on the user being aware of it.
As proof that this all works, you can examine what GHC produces for specific inputs. For this function:
proc1 = sum . take 10 . map (+1) . map (*2)
eval1 = proc1 [1..5]
eval2 = proc1 [1..]
when compiled with -O2, GHC fuses all of proc1 into a single recursive form (as seen in the core output with -ddump-simpl).
Of course there are limits to what these techniques can accomplish. For example, the naive average function, mean xs = sum xs / length xs is easily manually transformed into a single fold, and frameworks exist that can do so automatically, however at present there's no known way to automatically translate between standard functions and the fusion framework. So in this case, the user does need to be aware of the limitations of the compiler-produced code.
So in many cases compilers are sufficiently advanced to create code that's fast and elegant. Knowing when they will do so, and when the compiler is likely to fall down, is IMHO a large part of learning how to write efficient Haskell code.
for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored.
If something is recursive, to inline it, you would have to know how many times it is executed at compile time. Considering it will be a variable length input, that is not possible.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
There are certain techniques though that can make recursive calls much, much faster than their normal situation. For example, tail call optimization SO Wiki