internal representation not hidden in f# module - module

I'm doing a project, using modules, where I compute images and print them on screen
Everything in my code work just fine, but I'm not quite happy with this inconvenient: in my .fs file (where I define the type and all my functions) I've declared
type Picture = P of (Set<((float*float)*(float*float))> * (int*int))
Sets describe the points of the segments in the picture (the lines)
and the couple of int is needed to define the bounding box where the image will be shown (width and height)
When I try my function (in another file) I use my grid function
let rec setBuilder (ls:((int*int)*(int*int)) list) =
match ls with
| [] -> Set.empty
| ((x,y),(w,z))::xs -> setBuilder xs |> Set.add ((float x,float y),(float w, float z))
let grid ls (wdt,hgt) = P (setBuilder ls, (wdt, hgt))
With this I take a couple/couple int list and build my Picture
The problem is, when I build a Picture with grid (in another file where I try all my function) is visible the internal representation
let persn = [((4, 0),(6, 7)); ((6, 7), (6, 10)); ((6, 10), (0, 10)); ((0, 10), (0, 12)); ((0, 12), (6, 12)); ((6, 12), (6, 14)); ((6, 14), (4, 16)); ((4, 16), (4, 18)); ((4, 18), (6, 20)); ((6, 20), (8, 20));
((8, 20), (10,18)); ((10, 18), (10, 16)); ((10, 16), (8, 14)); ((8, 14), (8, 12)); ((8, 12), (10, 12)); ((10, 12), (10, 14)); ((10, 14), (12, 14)); ((12, 14), (12, 10)); ((12, 10), (8, 10)); ((8, 10), (8, 8));
((8, 8), (10, 0)); ((10, 0), (8, 0)); ((8, 0), (7, 4)); ((7, 4), (6, 0)); ((6, 0), (4, 0))]
let box = (15,20)
let person = grid persn box
when I interpret the last line I get from the console
val person : Picture =
P (set
[((0.0, 10.0), (0.0, 12.0)); ((0.0, 12.0), (6.0, 12.0));
((4.0, 0.0), (6.0, 7.0)); ((4.0, 16.0), (4.0, 18.0));
((4.0, 18.0), (6.0, 20.0)); ((6.0, 0.0), (4.0, 0.0));
((6.0, 7.0), (6.0, 10.0)); ((6.0, 10.0), (0.0, 10.0));
((6.0, 12.0), (6.0, 14.0)); ...], (15, 20))
there is a way to hide this information, I looked up and the solution seems to be the tagged value (but I'm already using them)
* EDIT *
I noticed that this behavior could be associated with the static member in my implementation files, without them the inner type is not shown
type Picture with
static member(*) (c:float,pic:Picture) =
match pic with
| P(set,(wdt,hgt)) -> P (Set.map (fun ((x,y),(w,z)) -> ((x*c,y*c),(w*c,z*c))) set, (int (round (float wdt * c)) ,int (round (float hgt * c))))
static member(|>>) (pic1:Picture,pic2:Picture) =
match pic1,pic2 with
(P (set1,(w1,h1)), P (set2,(w2,h2))) -> let new_p2 = (((float h1/ float h2)) * pic2)
match new_p2 with
P (nset2,(nw2,nh2)) -> P(Set.union set1 (Set.map (fun ((x,y),(w,z)) -> ((x + (float w1) ,y),(w + (float w1), z)) ) nset2),(w1 + nw2, h1))
static member(|^^) (pic1:Picture,pic2:Picture) =
match pic1,pic2 with
(P (set1,(w1,h1)), P (set2,(w2,h2))) -> let new_pic2 = (((float w1/ float w2)) * pic2)
match new_pic2 with
P (nset2,(nw2,nh2)) -> P(Set.union set1 (Set.map (fun ((x,y),(w,z)) -> ((x,(float h1) + y),(w,(float h1) + z)) ) nset2),(w1 , h1 + nh2))
static member (>||>) (n, pic:Picture) =
match n with
| 0 -> pic
| m -> pic |>> ((m-1) >||> pic)
static member (^||^) (n, pic:Picture) =
match n with
| 0 -> pic
| m -> pic |^^ ((m-1) ^||^ pic)

Simply write type Picture = private P of ... then other modules cannot see the internal of Picture.
Note: if you write type private Picture = P of ... it means that other modules cannot see the Picture type.

Related

Supply different families of priors as a parameter in the bugs/stan model

This is the classic eight school example in Bayesian data analysis by Andrew Gelman. Please see the stan file and R code below. I use a cauchy prior with paratmer A for the hyperparamter tau in the stan file. I am trying to supply the R function "school" with different priors not within cauchy family, for example, uniform(0,1000) prior, so that I do not have to create different stans file for the new priors. Is this possible within stan or bugs?
schools.stan:
`
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // standard error of effect estimates
real<lower=0> A;
}
parameters {
real mu; // population treatment effect
real<lower=0> tau; // standard deviation in treatment effects
vector[J] eta; // unscaled deviation from mu by school
}
transformed parameters {
vector[J] theta = mu + tau * eta; // school treatment effects
}
model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
tau ~ cauchy(0,A);
}
`
`
school <- function(A=100){
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18),
A=A)
fit <- stan(file = "schools.stan", data = schools_dat,iter = 20)
print(fit)
}
school()
`
I tried the following but have no idea how to change the stan file correspondingly.
`
school <- function(prior="dunif(0,1000"){
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18),
prior=prior)
fit <- stan(file = "schools.stan", data = schools_dat,iter = 20)
print(fit)
}
school()
`
It's possible to pre-specify more than one distribution in the Stan code, and then specify which distribution you want in the input data. Stan isn't really intended to be used this way, but it can be done!
Here's an example. I've added a new data variable, tau_prior; it's an integer that specifies which prior you want to use for tau. 1 = Cauchy, 2 = uniform, 3 = exponential. In addition, for each type of prior, there's a data variable that sets a hyperparameter. (Hyperparameters for the distributions that aren't chosen have no effect.)
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // standard error of effect estimates
int<lower=1,upper=3> tau_prior;
real<lower=0> cauchy_sigma;
real<lower=0> uniform_beta;
real<lower=0> exponential_beta;
}
parameters {
real mu; // population treatment effect
real<lower=0> tau; // standard deviation in treatment effects
vector[J] eta; // unscaled deviation from mu by school
}
transformed parameters {
vector[J] theta = mu + tau * eta; // school treatment effects
}
model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
if(tau_prior == 1) {
tau ~ cauchy(0, cauchy_sigma);
} else if(tau_prior == 2) {
tau ~ uniform(0, uniform_beta);
} else if(tau_prior == 3) {
tau ~ exponential(exponential_beta);
}
}
I've also modified the R function so that it provides default values for each hyperparameter, on a scale similar to the one you've used already.
school <- function(tau_prior = 1,
cauchy_sigma = 100,
uniform_beta = 1000,
exponential_beta = 0.01) {
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18),
tau_prior = tau_prior,
cauchy_sigma = cauchy_sigma,
uniform_beta = uniform_beta,
exponential_beta = exponential_beta)
fit <- stan(file = "schools.stan", data = schools_dat, iter = 20)
print(fit)
}
# The default: use a Cauchy prior with scale 100.
school()
# Use a uniform prior with the default upper limit (1000).
school(tau_prior = 2)
# Use an exponential prior with a non-default rate (1).
school(tau_prior = 3, exponential_beta = 1)

Array Pair Loading from databases using Qlik Sense

Does anyone has experience how to load/prepare data:
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
taken from SQL database (stored there as value) into qlik sense table:
ID, Value
1, a
2, b
3, v
4, d
Check out the annotated script below.
After its execution the result table will be:
set vSQLData = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')];
SQLData:
Load
// at this point the data will look like: "1, a", "2, b"
// will split the string on "," and will
// get the first value as ID
// and the second one as Valule
SubField(TempField2, ',', 1) as ID,
SubField(TempField2, ',', 2) as Value,
;
Load
// split the string by ")," and generate N number of rows
// then for each row remove "(", ")" and "'" characters
PurgeChar(SubField(TempField1, '),'), '''()''') as TempField2
;
Load
// remove "[" and "]" characters
PurgeChar('$(vSQLData)', '[]') as TempField1
AutoGenerate(1)
;

Function like `enumerate` for arrays with custom indices?

For an array with a non-one based index like:
using OffsetArrays
a = OffsetArray( [1,2,3], -1)
Is there a simple way to get a tuple of (index,value), similar to enumerate?
Enumerating still counts the elements... collect(enumerate(a)) returns:
3-element Array{Tuple{Int64,Int64},1}:
(1, 1)
(2, 2)
(3, 3)
I'm looking for:
(0, 1)
(1, 2)
(2, 3)
The canonical solution is to use pairs:
julia> a = OffsetArray( [1,2,3], -1);
julia> for (i, x) in pairs(a)
println("a[", i, "]: ", x)
end
a[0]: 1
a[1]: 2
a[2]: 3
julia> b = [1,2,3];
julia> for (i, x) in pairs(b)
println("b[", i, "]: ", x)
end
b[1]: 1
b[2]: 2
b[3]: 3
It works for other types of collections too:
julia> d = Dict(:a => 1, :b => 2, :c => 3);
julia> for (i, x) in pairs(d)
println("d[:", i, "]: ", x)
end
d[:a]: 1
d[:b]: 2
d[:c]: 3
You can find a lot of other interesting iterators by reading the documentation of Base.Iterators.
Try eachindex(a) to get the indexes, see the example below:
julia> tuple.(eachindex(a),a)
3-element OffsetArray(::Array{Tuple{Int64,Int64},1}, 0:2) with eltype Tuple{Int64,Int64} with indices 0:2:
(0, 1)
(1, 2)
(2, 3)

It looks like the keyword "in" is used as a variable

Pasting the following code in http://elm-lang.org/try and clicking on "Compile" :
import Html exposing (text)
main =
let (x, y, _) = List.foldL (\elem (sum, diff, mult) ->
(sum + elem, elem - diff, mult * elem)
) (0, 0, 0) [1, 2, 3, 4, 5] in
text ("Hello, World!" ++ toString x)
results in an unexpected error:
Detected errors in 1 module.
-- SYNTAX PROBLEM ------------------------------------------------------------
It looks like the keyword in is being used as a variable.
7| ) (0, 0, 0) [1, 2, 3, 4, 5] in
^
Rename it to something else.
What is wrong here? Parentheses match.
Indentation is important in Elm, and you've got a closing parenthesis that is too far to the left (the second to last line). Changing it to this will be valid code (also, it's List.foldl, not foldL):
main =
let (x, y, _) = List.foldl (\elem (sum, diff, mult) ->
(sum + elem, elem - diff, mult * elem)
) (0, 0, 0) [1, 2, 3, 4, 5] in
text ("Hello, World!" ++ toString x)
It is probably more idiomatic to put the in statement on its own line, aligned with let, just for keeping things clear:
main =
let (x, y, _) = List.foldl (\elem (sum, diff, mult) ->
(sum + elem, elem - diff, mult * elem)
) (0, 0, 0) [1, 2, 3, 4, 5]
in
text ("Hello, World!" ++ toString x)
You could also incorporate elm-format into your editing process to automatically format your code on save.

How do I get a SQL row_number equivalent for a Spark RDD?

I need to generate a full list of row_numbers for a data table with many columns.
In SQL, this would look like this:
select
key_value,
col1,
col2,
col3,
row_number() over (partition by key_value order by col1, col2 desc, col3)
from
temp
;
Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like
(key1, (1,2,3))
(key1, (1,4,7))
(key1, (2,2,3))
(key2, (5,5,5))
(key2, (5,5,9))
(key2, (7,5,5))
etc.
I want to order these using commands like sortBy(), sortWith(), sortByKey(), zipWithIndex, etc. and have a new RDD with the correct row_number
(key1, (1,2,3), 2)
(key1, (1,4,7), 1)
(key1, (2,2,3), 3)
(key2, (5,5,5), 1)
(key2, (5,5,9), 2)
(key2, (7,5,5), 3)
etc.
(I don't care about the parentheses, so the form can also be (K, (col1,col2,col3,rownum)) instead)
How do I do this?
Here's my first attempt:
val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3))
val temp1 = sc.parallelize(sample_data)
temp1.collect().foreach(println)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
// ((1,2),1,2,3)
// ((1,2),1,4,7)
// ((1,2),2,2,3)
temp1.map(x => (x, 1)).sortByKey().zipWithIndex.collect().foreach(println)
// ((((1,2),1,2,3),1),0)
// ((((1,2),1,4,7),1),1)
// ((((1,2),2,2,3),1),2)
// ((((3,4),5,5,5),1),3)
// ((((3,4),5,5,9),1),4)
// ((((3,4),7,5,5),1),5)
// note that this isn't ordering with a partition on key value K!
val temp2 = temp1.???
Also note that the function sortBy cannot be applied directly to an RDD, but one must run collect() first, and then the output isn't an RDD, either, but an array
temp1.collect().sortBy(a => a._2 -> -a._3 -> a._4).foreach(println)
// ((1,2),1,4,7)
// ((1,2),1,2,3)
// ((1,2),2,2,3)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
Here's a little more progress, but still not partitioned:
val temp2 = sc.parallelize(temp1.map(a => (a._1,(a._2, a._3, a._4))).collect().sortBy(a => a._2._1 -> -a._2._2 -> a._2._3)).zipWithIndex.map(a => (a._1._1, a._1._2._1, a._1._2._2, a._1._2._3, a._2 + 1))
temp2.collect().foreach(println)
// ((1,2),1,4,7,1)
// ((1,2),1,2,3,2)
// ((1,2),2,2,3,3)
// ((3,4),5,5,5,4)
// ((3,4),5,5,9,5)
// ((3,4),7,5,5,6)
The row_number() over (partition by ... order by ...) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames.
Create a test DataFrame:
from pyspark.sql import Row, functions as F
testDF = sc.parallelize(
(Row(k="key1", v=(1,2,3)),
Row(k="key1", v=(1,4,7)),
Row(k="key1", v=(2,2,3)),
Row(k="key2", v=(5,5,5)),
Row(k="key2", v=(5,5,9)),
Row(k="key2", v=(7,5,5))
)
).toDF()
Add the partitioned row number:
from pyspark.sql.window import Window
(testDF
.select("k", "v",
F.rowNumber()
.over(Window
.partitionBy("k")
.orderBy("k")
)
.alias("rowNum")
)
.show()
)
+----+-------+------+
| k| v|rowNum|
+----+-------+------+
|key1|[1,2,3]| 1|
|key1|[1,4,7]| 2|
|key1|[2,2,3]| 3|
|key2|[5,5,5]| 1|
|key2|[5,5,9]| 2|
|key2|[7,5,5]| 3|
+----+-------+------+
This is an interesting problem you're bringing up. I will answer it in Python but I'm sure you will be able to translate seamlessly to Scala.
Here is how I would tackle it:
1- Simplify your data:
temp2 = temp1.map(lambda x: (x[0],(x[1],x[2],x[3])))
temp2 is now a "real" key-value pair. It looks like that:
[
((3, 4), (5, 5, 5)),
((3, 4), (5, 5, 9)),
((3, 4), (7, 5, 5)),
((1, 2), (1, 2, 3)),
((1, 2), (1, 4, 7)),
((1, 2), (2, 2, 3))
]
2- Then, use the group-by function to reproduce the effect of the PARTITION BY:
temp3 = temp2.groupByKey()
temp3 is now a RDD with 2 rows:
[((1, 2), <pyspark.resultiterable.ResultIterable object at 0x15e08d0>),
((3, 4), <pyspark.resultiterable.ResultIterable object at 0x15e0290>)]
3- Now, you need to apply a rank function for each value of the RDD. In python, I would use the simple sorted function (the enumerate will create your row_number column):
temp4 = temp3.flatMap(lambda x: tuple([(x[0],(i[1],i[0])) for i in enumerate(sorted(x[1]))])).take(10)
Note that to implement your particular order, you would need to feed the right "key" argument (in python, I would just create a lambda function like those:
lambda tuple : (tuple[0],-tuple[1],tuple[2])
At the end (without the key argument function, it looks like that):
[
((1, 2), ((1, 2, 3), 0)),
((1, 2), ((1, 4, 7), 1)),
((1, 2), ((2, 2, 3), 2)),
((3, 4), ((5, 5, 5), 0)),
((3, 4), ((5, 5, 9), 1)),
((3, 4), ((7, 5, 5), 2))
]
Hope that helps!
Good luck.
val test = Seq(("key1", (1,2,3)),("key1",(4,5,6)), ("key2", (7,8,9)), ("key2", (0,1,2)))
test: Seq[(String, (Int, Int, Int))] = List((key1,(1,2,3)), (key1,(4,5,6)), (key2,(7,8,9)), (key2,(0,1,2)))
test.foreach(println)
(key1,(1,2,3))
(key1,(4,5,6))
(key2,(7,8,9))
(key2,(0,1,2))
val rdd = sc.parallelize(test, 2)
rdd: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ParallelCollectionRDD[41] at parallelize at :26
val rdd1 = rdd.groupByKey.map(x => (x._1,x._2.toArray)).map(x => (x._1, x._2.sortBy(x => x._1).zipWithIndex))
rdd1: org.apache.spark.rdd.RDD[(String, Array[((Int, Int, Int), Int)])] = MapPartitionsRDD[44] at map at :25
val rdd2 = rdd1.flatMap{
elem =>
val key = elem._1
elem._2.map(row => (key, row._1, row._2))
}
rdd2: org.apache.spark.rdd.RDD[(String, (Int, Int, Int), Int)] = MapPartitionsRDD[45] at flatMap at :25
rdd2.collect.foreach(println)
(key1,(1,2,3),0)
(key1,(4,5,6),1)
(key2,(0,1,2),0)
(key2,(7,8,9),1)
From spark sql, Read the data files...
val df = spark.read.json("s3://s3bukcet/key/activity/year=2018/month=12/date=15/*");
The above file has fields user_id, pageviews and clicks
Generate the activity Id (row_number) partitioned by user_id and order by clicks
val output = df.withColumn("activity_id", functions.row_number().over(Window.partitionBy("user_id").orderBy("clicks")).cast(DataTypes.IntegerType));