F# Deedle and Multi Index - dataframe

I have recently started to learn F# for Data Science (coming from simple C# and Python). I start to get used to the power of functional first paradigm for Science.
However, I am still confused on how to treat a problem I could easily fix using pandas in Python. It is related to Multi index time series / Data frame. I have extensively checked on Deedle but I am still not sure if Deedle could help me achieve such a table:
Column Index 1: A || B
Column Index 2: A1 A2 || B1 B2
Column Index 3: p1 p2 | p1 p2 || p1 p2 | p1 p2
Row Index:
date1 0.5 2. | 2. 0.5 || 3. 0. | 2. 3.
date2 ......
The idea being able to sum all p1 series when Index1 = A etc etc
I did not find example of such a thing using Deedle.
If it is not available, what structure for my data would you recommend me?
Thanks for helping a newbie (but in love with) in F#

In Deedle, you can create a frame or a series with hierarchical index by using a tuple as the key:
let ts =
series
[ ("A", "A1", "p1") => 0.5
("A", "A1", "p2") => 2.
("A", "A2", "p3") => 2.
("A", "A2", "p4") => 0.5 ]
Deedle does have some special handling for this. For example, it will output the data as:
A A1 p1 -> 0.5
p2 -> 2
A2 p3 -> 2
p4 -> 0.5
To apply aggregation over a part of the hierarchy, you can use the applyLevel function:
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1) Stats.mean
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1, l2) Stats.mean
The first argument is a function that gets the tuple of keys and selects what part of the level you want to group - so the above two create an aggregation over the top and top two levels, respectively.

Related

how to make pandas code faster or using dask dataframe or how to use vectorization for this type of problem?

import pandas as pd
# list of name, degree, score
label1 = ["a1", "a1", "a1","a1", "a2","a2","a2","a2", "b1","b1","b1","b1", "b2","b2","b2","b2"]
label2 = ["a1", "a2", "b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2"]
m1 = [ 0, 3, 2, 7, 3, 0, 5, 8, 2, 5, 0, 9, 7, 8, 9, 0]
# dictionary of lists
dict = {'label1': label1, 'label2': label2,'m1':m1}
df = pd.DataFrame(dict)
df
output of this dataframe:
label1 label2 m1
0 a1 a1 0
1 a1 a2 3
2 a1 b1 2
3 a1 b2 7
4 a2 a1 3
5 a2 a2 0
6 a2 b1 5
7 a2 b2 8
8 b1 a1 2
9 b1 a2 5
10 b1 b1 0
11 b1 b2 9
12 b2 a1 7
13 b2 a2 8
14 b2 b1 9
15 b2 b2 0
I want to write a function that will take strings (samp1)a, (samp2)b, and a (df) data frame as input. We have to preprocess those two input strings so that we can get desired strings in our data frame. Then we need to access some particular rows' (like (a1,b1) or (a2,b2)) indices of the data frame to get their corresponding 'm1' value. Next, we will make some (addition) operations for those m1 values and store them in two variables and after that, it will return the minimum of two variables. [looking at coding snippet may be easier to understand]
The following is the code for this function:
def min_4line(samp1,samp2,df):
k=['1','2']
#k and samp are helping to generate variable along with number
#for example it will take a,b and can create a1,a2,b1,b2.....
samp1_1=samp1+k[0]
samp1_2=samp1+k[1]
samp2_1=samp2+k[0]
samp2_2=samp2+k[1]
#print(samp1_1)#a1
#print(samp1_2)#a2
#print(samp2_1)#b1
#print(samp2_2)#b2
"""
#As we are interested about particular rows to get comb1 variable, we need those row's
#indexes
#for comb1 we want to sum (a1,b1)[which located at ind1] and (a2,b2)[which located at ind2]
#same types of thing for comb2
"""
ind1=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_1)].tolist()
ind2=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_2)].tolist()
#print(ind1)#[2]
#print(ind2)#[7]
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])
#print('comb1: ',comb1)#comb1: 10
ind3=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_1)].tolist()
ind4=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_2)].tolist()
#print(ind3)#[6]
#print(ind4) #[3]
comb2=int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
#print('comb2: ',comb2)#comb2: 12
return min(comb1,comb2)#10
To append unique char like a,b from the dataframe we need to do a list operation:
#this list is needed so that I can compare how many unique values are there...
#it could get a,b,c,d.... and make comparison
#like (a,b), (a,c),(a,d), (b,c),(b,d),(c,d) for the function
list_line=list(df['label1'].unique())
string_test=[a[:-1] for a in list_line]
#string_test will exclude number portion of character
list_img=sorted(list(set(string_test)))
#print(list_img)#['a', 'b']
#print(len(list_img))#2
Now we need to create a data frame that will go over the 'list_img' and call the min4line function to get value like (a,b), (a,c) and corresponding output of the function. Here a nested loop is necessary as suppose list consist [a,b,c,d]. it will go like(a,b),(a,c),(a,d),(b,c),(b,d),(c,d). So that we can have unique pair. The code for this is:
%%time
d=[]
for i in range(len(list_img)):
for j in range(i+1,len(list_img)):
a=min_4line(list_img[i],list_img[j],df)
print(a)
d.append({'label1':str(list_img[i]),'label2':str(list_img[j]), 'metric': str(a)})
dataf=pd.DataFrame(d)
dataf.head(5)
output is:
label1label2metric
0 a b 10
Is there any way to make the code faster? I broke down the problem into small parts. this operation is needed for 16 million rows. I am interested in using dask for this. But when I have asked this type of question previously, many people failed to understand as I was not able to state the problem clearly. Hope this time I broke it down in easier format. You can copy those code cell and run in jupyter notebook to check the output and suggest me any good way to make the program faster.
[updated]
Can anyone suggest, how can I get those particular indices of those rows using numpy or any kind of vectorized operation?

Find pairs with small difference in value

In a single table I need to find those pairs for which the values of a certain differ by maximum a given amount. For example, given the following table and the maximum difference 0.5:
val
---
1
1.2
1.3
4
4.5
6
The desired result would be:
val1 | val2
-----+-----
1 | 1.2
1 | 1.3
1.2 | 1.3
4 | 4.5
The main problem is that my table is gigantic and a cross product is not possible in reasonable time. i.e. this does not work:
SELECT t1.val, t2.val
FROM table t1, table t2
WHERE abs(t1.val - t2.val) <= 0.5
Is there a way to do this? I read up upon window functions, so at least I know it is possible to compute for each entry the difference in value to the previous entry, obtaining for the example above:
val | diff
----+-----
1 | 0
1.2 | 0.2
1.3 | 0.1
4 | 2.7
4.5 | 0.5
6 | 1.5
From here on I need to find the ranges where the sum of diff does not exceed the given maximum. Is this possible? Are there more reasonable approaches?
I'm using spark.
Thank you.
EDIT: As pointed out, my query would also include symmetric pairs as well es pairs where the values are equal. Sorry for the ambiguity.
However, this is not the point. My problem is the join. The dataset is too large for a cartesian product. I am looking for a solution which avoids using one.
Also, the size of the dataset I'm dealing with is 1000000 tuples. I am not
sure what execution time to expect, but it was suggested that there must be a solution which avoids using a cartesian product on the data.
Thank you.
What you tried is close. Just a few modifications needed:
select t1.val,t2.val
from tbl t1
join tbl t2 on t2.val-t1.val<=0.5 and t1.val<t2.val
You can generate virtual time-based window:
import org.apache.spark.sql.functions._
import spark.implicits._ // Where spark is an instance of SparkSession
val df = Seq(1.0, 1.2, 1.3, 4.0, 4.5, 6).toDF("val")
val w = window(
$"val".cast("timestamp"), "1000 milliseconds", "500 milliseconds"
).cast("struct<start:double,start:double>").alias("window")
val windowed = df.select(w, $"val")
join and filter and remove duplicates:
val result = windowed.alias("left")
.join(windowed.alias("right"), "window")
.where(abs($"left.val" - $"right.val") <= 0.5 && $"left.val" < $"right.val")
.drop("window").distinct
Result:
result.show
// +---+---+
// |val|val|
// +---+---+
// |1.0|1.2|
// |1.2|1.3|
// |4.0|4.5|
// |1.0|1.3|
// +---+---+
One thing I have been advised to do is adding a bucket column so that each possibly matching tuples must be either in the same bucket or in adjacent buckets. Thus I can join (equijoin) the table with itself based on buckets and extract tuples from the result where the condition does indeed hold. I'm not sure if it is a good solution and I have not yet been able to verify it.
/* max difference cannot span more than 2 buckets */
spark.sql("set max_diff=0.001")
var dec_count = 3
var bucket_size = scala.math.pow(10,-1 * dec_count)
var songs_buckets = songs.orderBy(col("artist_familiarity")).withColumn("bucket", round(col("artist_familiarity"), dec_count))
/*
tuples in adjacent buckets can have very close `artist_familiarity`.
add id to avoid duplicate pairs or tuples paired with themselves.
*/
songs_buckets = songs_buckets.withColumn("bucket2", $"bucket" - bucket_size).withColumn("id", monotonically_increasing_id())
songs_buckets.createOrReplaceTempView("songs_buckets")
var tmp = sql("SELECT s1.title as t1, s2.title as t2, s1.artist_familiarity as f1, s2.artist_familiarity as f2, s1.id as id1, s2.id as id2 FROM songs_buckets s1 JOIN songs_buckets s2 ON s1.bucket = s2.bucket OR s1.bucket = s2.bucket2")
tmp.createOrReplaceTempView("tmp")
var result = sql("SELECT t1, t2 FROM tmp WHERE id1 < id2 and f2 - f1 <= ${max_diff}")
result.show()
I haven't bothered to change variable names back to the example in the question. It displays the first 20 rows of the result after about 12 seconds. Not sure if this has something to do with lazy loading, because it won't display the count of the result, but it's the best thing I could make work.

Explain the Differential Evolution method

Can someone please explain the Differential Evolution method? The Wikipedia definition is extremely technical.
A dumbed-down explanation followed by a simple example would be appreciated :)
Here's a simplified description. DE is an optimisation technique which iteratively modifies a population of candidate solutions to make it converge to an optimum of your function.
You first initialise your candidate solutions randomly. Then at each iteration and for each candidate solution x you do the following:
you produce a trial vector: v = a + ( b - c ) / 2, where a, b, c are three distinct candidate solutions picked randomly among your population.
you randomly swap vector components between x and v to produce v'. At least one component from v must be swapped.
you replace x in your population with v' only if it is a better candidate (i.e. it better optimise your function).
(Note that the above algorithm is very simplified; don't code from it, find proper spec. elsewhere instead)
Unfortunately the Wikipedia article lacks illustrations. It is easier to understand with a graphical representation, you'll find some in these slides: http://www-personal.une.edu.au/~jvanderw/DE_1.pdf .
It is similar to genetic algorithm (GA) except that the candidate solutions are not considered as binary strings (chromosome) but (usually) as real vectors. One key aspect of DE is that the mutation step size (see step 1 for the mutation) is dynamic, that is, it adapts to the configuration of your population and will tend to zero when it converges. This makes DE less vulnerable to genetic drift than GA.
Answering my own question...
Overview
The principal difference between Genetic Algorithms and Differential Evolution (DE) is that Genetic Algorithms rely on crossover while evolutionary strategies use mutation as the primary search mechanism.
DE generates new candidates by adding a weighted difference between two population members to a third member (more on this below).
If the resulting candidate is superior to the candidate with which it was compared, it replaces it; otherwise, the original candidate remains unchanged.
Definitions
The population is made up of NP candidates.
Xi = A parent candidate at index i (indexes range from 0 to NP-1) from the current generation. Also known as the target vector.
Each candidate contains D parameters.
Xi(j) = The jth parameter in candidate Xi.
Xa, Xb, Xc = three random parent candidates.
Difference vector = (Xb - Xa)
F = A weight that determines the rate of the population's evolution.
Ideal values: [0.5, 1.0]
CR = The probability of crossover taking place.
Range: [0, 1]
Xc` = A mutant vector obtained through the differential mutation operation. Also known as the donor vector.
Xt = The child of Xi and Xc`. Also known as the trial vector.
Algorithm
For each candidate in the population
for (int i = 0; i<NP; ++i)
Choose three distinct parents at random (they must differ from each other and i)
do
{
a = random.nextInt(NP);
} while (a == i)
do
{
b = random.nextInt(NP);
} while (b == i || b == a);
do
{
c = random.nextInt(NP);
} while (c == i || c == b || c == a);
(Mutation step) Add a weighted difference vector between two population members to a third member
Xc` = Xc + F * (Xb - Xa)
(Crossover step) For every variable in Xi, apply uniform crossover with probability CR to inherit from Xc`; otherwise, inherit from Xi. At least one variable must be inherited from Xc`
int R = random.nextInt(D);
for (int j=0; j < D; ++j)
{
double probability = random.nextDouble();
if (probability < CR || j == R)
Xt[j] = Xc`[j]
else
Xt[j] = Xi[j]
}
(Selection step) If Xt is superior to Xi then Xt replaces Xi in the next generation. Otherwise, Xi is kept unmodified.
Resources
See this for an overview of the terminology
See Optimization Using Differential Evolution by Vasan Arunachalam for an explanation of the Differential Evolution algorithm
See Evolution: A Survey of the State-of-the-Art by Swagatam Das and Ponnuthurai Nagaratnam Suganthan for different variants of the Differential Evolution algorithm
See Differential Evolution Optimization from Scratch with Python for a detailed description of an implementation of a DE algorithm in python.
The working of DE algorithm is very simple.
Consider you need to optimize(minimize,for eg) ∑Xi^2 (sphere model) within a given range, say [-100,100]. We know that the minimum value is 0. Let's see how DE works.
DE is a population-based algorithm. And for each individual in the population, a fixed number of chromosomes will be there (imagine it as a set of human beings and chromosomes or genes in each of them).
Let me explain DE w.r.t above function
We need to fix the population size and the number of chromosomes or genes(named as parameters). For instance, let's consider a population of size 4 and each of the individual has 3 chromosomes(or genes or parameters). Let's call the individuals R1,R2,R3,R4.
Step 1 : Initialize the population
We need to randomly initialise the population within the range [-100,100]
G1 G2 G3 objective fn value
R1 -> |-90 | 2 | 1 | =>8105
R2 -> | 7 | 9 | -50 | =>2630
R3 -> | 4 | 2 | -9.2| =>104.64
R4 -> | 8.5 | 7 | 9 | =>202.25
objective function value is calculated using the given objective function.In this case, it's ∑Xi^2. So for R1, obj fn value will be -90^2+2^2+2^2 = 8105. Similarly it is found for all.
Step 2 : Mutation
Fix a target vector,say for eg R1 and then randomly select three other vectors(individuals)say for eg.R2,R3,R4 and performs mutation. Mutation is done as follows,
MutantVector = R2 + F(R3-R4)
(vectors can be chosen randomly, need not be in any order).F (scaling factor/mutation constant) within range [0,1] is one among the few control parameters DE is having.In simple words , it describes how different the mutated vector becomes. Let's keep F =0.5.
| 7 | 9 | -50 |
+
0.5 *
| 4 | 2 | -9.2|
+
| 8.5 | 7 | 9 |
Now performing Mutation will give the following Mutant Vector
MV = | 13.25 | 13.5 | -50.1 | =>2867.82
Step 3 : Crossover
Now that we have a target vector(R1) and a mutant vector MV formed from R2,R3 & R4 ,we need to do a crossover. Consider R1 and MV as two parents and we need a child from these two parents. Crossover is done to determine how much information is to be taken from both the parents. It is controlled by Crossover rate(CR). Every gene/chromosome of the child is determined as follows,
a random number between 0 & 1 is generated, if it is greater than CR , then inherit a gene from target(R1) else from mutant(MV).
Let's set CR = 0.9. Since we have 3 chromosomes for individuals, we need to generate 3 random numbers between 0 and 1. Say for eg, those numbers are 0.21,0.97,0.8 respectively. First and last are lesser than CR value, so those positions in the child's vector will be filled by values from MV and second position will be filled by gene taken from target(R1).
Target-> |-90 | 2 | 1 | Mutant-> | 13.25 | 13.5 | -50.1 |
random num - 0.21, => `Child -> |13.25| -- | -- |`
random num - 0.97, => `Child -> |13.25| 2 | -- |`
random num - 0.80, => `Child -> |13.25| 2 | -50.1 |`
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Step 4 : Selection
Now we have child and target. Compare the obj fn of both, see which is smaller(minimization problem). Select that individual out of the two for next generation
R1 -> |-90 | 2 | 1 | =>8105
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Clearly, the child is better so replace target(R1) with the child. So the new population will become
G1 G2 G3 objective fn value
R1 -> | 13.25 | 2 | -50.1 | =>2689.57
R2 -> | 7 | 9 | -50 | =>2500
R3 -> | 4 | 2 | -9.2 | =>104.64
R4 -> | -8.5 | 7 | 9 | =>202.25
This procedure will be continued either till the number of generations desired has reached or till we get our desired value. Hope this will give you some help.

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.

Efficiently find mapping

Assume that we have two data sets A, B that have m to n relationship.
A = { k1, k2, k3 .... kn}
B = { g1, g2, g3..........gn}
All the elements in both the sets are alphanumeric.
Now, tuples one each from Set A and Set B are stored in a table T.
for ex :-
(k1, g2)
(k2, g4)
(k1, g3)
(k4, g2)
...
...
..
(kn, gm)
The challenge is to find out what 'm' elements in set A map to what 'n' elements in set B in the most efficient way.
For ex, let's say we have the below tuples,
(k1, g1)
(k1, g2)
(k3, g1)
(k3, g2)
(k5, g1)
(k5, g2)
the o/p I need is (k1, k3, k5) -> (g1, g2).
As the mapping is m to n, a simple select won' t work. Please let me know if you need further clarifications
Since this information is already in database, I would prefer if we can get to this with some SQL.
Help much appreciated.
Thanks in advance...
You can often solve problems like this by using an aggregate, and a group by clause.
For example, if your table name is T then:
select T.item1, concat(T.item2, ", ") from T group by T.item1
Gives you which item1 maps to item 2. THen do it again switching item1 and item 2 around to find which item2 maps to item1.