how to use fold and reduce in grouping in kotlin - kotlin

I know how to use reduce and fold operations but, I'm not getting how to use it with map.
val numbers = listOf("one", "two", "three", "four", "five")
println(numbers.groupingBy { it.first() }.eachCount()) // Output:- {o=1, t=2, f=2}
grouping returns Map. So, i need to figure out how to use fold and reduce with kotlin Map.
Any example is ok. I just need to use reduce and fold with grouping in kotlin.

I seriously never used this in my project. You can understand the code below.
it % 5 gives various remainder values 0,1,2,3,4
i.e 5%5 = 0 ; 5%6 = 1; 5%7= 2; 5%8 = 3; 5%9 = 4; 5%5 = 0
val numb = listOf(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
val nmap = numb.groupingBy { it % 5 }
println(nmap.eachCount())
println("map = ${nmap.reduce { key, accumulator, element ->
println("$key ($accumulator,$element)")
accumulator + element
}}")
Output
{1=4, 2=4, 3=4, 4=4, 0=4}
1 (1,6)
2 (2,7)
3 (3,8)
4 (4,9)
0 (5,10) ---> 5 +
1 (7,11)
2 (9,12)
3 (11,13)
4 (13,14)
0 (15,15) ----> 5 + 15 +
1 (18,16)
2 (21,17)
3 (24,18)
4 (27,19)
0 (30,20) -----> 5 + 15 + 30
map = {1=34, 2=38, 3=42, 4=46, 0=50}

Related

Obtain corresponding column based on another column that matches another dataframe

I want to find matching values from two data frames and return a third value.
For example, if cpg_symbol["Gene_Symbol"] corresponds with diff_meth_kirp_symbol.index, I want to assign cpg_symbol.loc["Composite_Element_REF"] as index.
My code returned an empty dataframe.
diff_meth_kirp.index = diff_meth_kirp.merge(cpg_symbol, left_on=diff_meth_kirp.index, right_on="Gene_Symbol")[["Composite_Element_REF"]]
Example:
diff_meth_kirp
Hello
My
name
is
First
0
1
2
3
Second
4
5
6
7
Third
8
9
10
11
Fourth
12
13
14
15
Fifth
16
17
18
19
Sixth
20
21
22
23
cpg_symbol
Composite_Element_REF
Gene_Symbol
cg1
First
cg2
Third
cg3
Fifth
cg4
Seventh
cg5
Ninth
cg6
First
Expected output:
Hello
My
name
is
cg1
0
1
2
3
cg2
8
9
10
11
cg3
16
17
18
19
cg6
0
1
2
3
Your code works well for me but you can try this version:
out = (diff_meth_kirp.merge(cpg_symbol.set_index('Gene_Symbol'),
left_index=True, right_index=True)
.set_index('Composite_Element_REF')
.rename_axis(None).sort_index())
print(out)
# Output
Hello My name is
cg1 0 1 2 3
cg2 8 9 10 11
cg3 16 17 18 19
cg6 0 1 2 3
Input dataframes:
data1 = {'Hello': {'First': 0, 'Second': 4, 'Third': 8, 'Fourth': 12, 'Fifth': 16, 'Sixth': 20},
'My': {'First': 1, 'Second': 5, 'Third': 9, 'Fourth': 13, 'Fifth': 17, 'Sixth': 21},
'name': {'First': 2, 'Second': 6, 'Third': 10, 'Fourth': 14, 'Fifth': 18, 'Sixth': 22},
'is': {'First': 3, 'Second': 7, 'Third': 11, 'Fourth': 15, 'Fifth': 19, 'Sixth': 23}}
diff_meth_kirp = pd.DataFrame(data1)
data2 = {'Composite_Element_REF': {0: 'cg1', 1: 'cg2', 2: 'cg3', 3: 'cg4', 4: 'cg5', 5: 'cg6'},
'Gene_Symbol': {0: 'First', 1: 'Third', 2: 'Fifth', 3: 'Seventh', 4: 'Ninth', 5: 'First'}}
cpg_symbol = pd.DataFrame(data2)

Count variable in Python function

def prime_numbers(number_list):
prime_list = []`
for number in number_list:
count = 0
for prime in range(1, number):
if number % prime == 0:
count += 1
if count == 1:
prime_list.append(number)
print(prime_list)
prime_numbers([1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 21, 19])`
Count here is used as a flag.
when it gets to 1, it means that the number is divisible by another number and hence not a prime

Filter a range in kotlin

In kotlin I want filter a range of Int to make odd/even example. so I made a listOf range 1..50
val angka = listOf(1..50)
followed by applying filter for even and filterNot for odd
val genap = angka.filter { it % 2 == 0}
val ganjil = angka.filterNot { it % 2 == 0 }
then printing both the even/odd lists
println("Genap = $genap")
println("Ganjil = $ganjil")
I do not see any problems with code, but it does throws exception mentioned below
Unresolved reference. None of the following candidates is applicable because of receiver type mismatch:
public inline operator fun BigDecimal.mod(other: BigDecimal): BigDecimal defined in kotlin
This is creating a List<IntRange> with a single element:
val angka = listOf(1..50)
You should instead directly filter the range:
val angka = 1..50
The rest of the code is correct.
If you are a beginner with Kotlin, please either specify the type of values explicitly, or turn on the local variable type hits.
This way you would have noticed that the code is not perfect. Your list angka is not a list of type List<Int>, but a list of type List<IntRange>.
Meaning that you are not doing Int % 2 == 0, but in fact, you are doing IntRange % 2 == 0.
If you want to get a list from a range, you need to do (x..y).toList(). So your code will be:
val angka = (1..50).toList() //or since you are not using this list anywhere else, just leave it as `1..50` and the filter will work fine on the IntRange.
val genap = angka.filter { it % 2 == 0 }
val ganjil = angka.filterNot { it % 2 == 0 }
println("Genap = $genap")
println("Ganjil = $ganjil")
With output:
Genap = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
Ganjil = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49]
Your declaration of all numbers is wrong... it is like this val angka: List<IntRange> = listOf(1..50). You created a list of ranges contaning one range.
This should work:
val angka = 1..50
val genap = angka.filter { it % 2 == 0}
val ganjil = angka.filterNot { it % 2 == 0 }
println("Genap = $genap")
println("Ganjil = $ganjil")

Timeseries: Groupby and calculate variance

I have the following dataframe with timeseries data:
df = pd.DataFrame(columns = ['id', 'value'])
df['value'] =[9, 16, 10, 12, 11, 14]
df['id'] = [1, 1, 1, 2, 2, 2]
For each timeseries (defined by column 'id' I want to calculate the variance to find timeseries that do not change at all or only very little.
The final dataframe should look like this:
df_end = pd.DataFrame(columns = ['id','value', 'var'])
df_end['value'] =[9, 16, 10, 12, 11, 14]
df_end['id'] = [1, 1, 1, 2, 2, 2]
df_end['var'] = [21, 21, 21, 2.3, 2.3, 2.3]
I tried:
df.groupby(df['id']).var()
which gives me the values, but I couldn't put it into the df in the right form. I am sure, there is a handy function for this that I don't know about yet!
Thanks for helping out!
Use GroupBy.transform with specify column value:
df['var'] = df.groupby('id')['value'].transform('var')
print (df)
id value var
0 1 9 14.333333
1 1 16 14.333333
2 1 10 14.333333
3 2 12 2.333333
4 2 11 2.333333
5 2 14 2.333333

Optimizing the Verhoeff Algorithm in R

I have written the following function to calculate a check digit in R.
verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
}
In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.
How could this function be vectorized?
I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?
We begin by defining the lookup matrices. I've laid them out in a way
that should make them easier to check against a reference, e.g.
http://en.wikipedia.org/wiki/Verhoeff_algorithm.
d5_mult <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 0, 6, 7, 8, 9, 5,
2, 3, 4, 0, 1, 7, 8, 9, 5, 6,
3, 4, 0, 1, 2, 8, 9, 5, 6, 7,
4, 0, 1, 2, 3, 9, 5, 6, 7, 8,
5, 9, 8, 7, 6, 0, 4, 3, 2, 1,
6, 5, 9, 8, 7, 1, 0, 4, 3, 2,
7, 6, 5, 9, 8, 2, 1, 0, 4, 3,
8, 7, 6, 5, 9, 3, 2, 1, 0, 4,
9, 8, 7, 6, 5, 4, 3, 2, 1, 0
)), ncol = 10, byrow = TRUE)
d5_perm <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 5, 7, 6, 2, 8, 3, 0, 9, 4,
5, 8, 0, 3, 7, 9, 6, 1, 4, 2,
8, 9, 1, 6, 0, 4, 3, 5, 2, 7,
9, 4, 5, 3, 1, 2, 6, 8, 7, 0,
4, 2, 8, 6, 5, 7, 3, 9, 0, 1,
2, 7, 9, 3, 8, 0, 6, 4, 1, 5,
7, 0, 4, 6, 9, 1, 3, 2, 5, 8
)), ncol = 10, byrow = TRUE)
d5_inv <- as.integer(c(0, 4, 3, 2, 1, 5, 6, 7, 8, 9))
Next, we'll define the check function, and try it out with a test input.
I've followed the derivation in wikipedia as closely as possible.
p <- function(i, n_i) {
d5_perm[(i %% 8) + 1, n_i + 1] + 1
}
d <- function(c, p) {
d5_mult[c + 1, p]
}
verhoeff <- function(x) {
#split and convert to numbers
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff(142857)
## [1] 0
This function is fundamentally iterative, as each iteration depends on
the value of the previous. This means that we're unlikely to be able to
vectorise in R, so if we want to vectorise, we'll need to use Rcpp.
However, before we turn to that, it's worth exploring if we can do the
initial split faster. First we do a little microbenchmark to see if it's
worth bothering:
library(microbenchmark)
digits <- function(x) {
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
rev(digs)
}
microbenchmark(
digits(142857),
verhoeff(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.30 12.01 12.43 12.85 28.79 100
## verhoeff(142857) 32.24 33.81 34.66 35.47 95.85 100
It looks like it! On my computer, verhoeff_prepare() accounts for
about 50% of the run time. A little searching on stackoverflow reveals
another approach to turning a number into
digits:
digits2 <- function(x) {
n <- floor(log10(x))
x %/% 10^(0:n) %% 10
}
digits2(12345)
## [1] 5 4 3 2 1
microbenchmark(
digits(142857),
digits2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.495 12.102 12.468 12.834 79.60 100
## digits2(142857) 2.322 2.784 3.358 3.561 13.69 100
digits2() is a lot faster than digits() but it has limited impact on
the whole runtime.
verhoeff2 <- function(x) {
digs <- digits2(x)
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff2(142857)
## [1] 0
microbenchmark(
verhoeff(142857),
verhoeff2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff(142857) 33.06 34.49 35.19 35.92 73.38 100
## verhoeff2(142857) 20.98 22.58 24.05 25.28 48.69 100
To make it even faster we could try C++.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff3_c(IntegerVector digits, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int n = digits.size();
int c = 0;
for(int i = 0; i < n; ++i) {
int p = perm(i % 8, digits[i]);
c = mult(c, p);
}
return inv[c];
}
verhoeff3 <- function(x) {
verhoeff3_c(digits(x), d5_mult, d5_perm, d5_inv)
}
verhoeff3(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.00 22.85 25.53 27.11 63.71 100
## verhoeff3(142857) 16.75 17.99 18.87 19.64 79.54 100
That doesn't yield much of an improvement. Maybe we can do better if we
pass the number to C++ and process the digits in a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff4_c(int number, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int c = 0;
int i = 0;
for (int i = 0; number > 0; ++i, number /= 10) {
int p = perm(i % 8, number % 10);
c = mult(c, p);
}
return inv[c];
}
verhoeff4 <- function(x) {
verhoeff4_c(x, d5_mult, d5_perm, d5_inv)
}
verhoeff4(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857),
verhoeff4(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.808 24.910 26.838 27.797 64.22 100
## verhoeff3(142857) 17.699 18.742 19.599 20.764 81.67 100
## verhoeff4(142857) 3.143 3.797 4.095 4.396 13.21 100
And we get a pay off: verhoeff4() is about 5 times faster than
verhoeff2().
If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.
verhoeffCheckNew <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## check for string since leading zeros with numbers will be lost
if (!is.character(x)) stop("Must enter a string")
#split and convert to numbers
digs <- strsplit(x, "")
digs <- lapply(digs, function(x) rev(as.numeric(x)))
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algorithm - note 1-based indexing in R
sapply(digs, function(x)
{
d <- 0
for (i in 1:length(x)){
d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
}
d5_inv[d+1]
})
}
Since d depends on what it was previously, the is no easy way to vectorise the for loop.
My version runs in about half the time for 1e5 strings.
rand_string <- function(n = 12)
{
paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
}
big_test <- replicate(1e5, rand_string())
tic()
res1 <- unname(sapply(big_test, verhoeffCheck))
toc()
tic()
res2 <- verhoeffCheckNew(big_test)
toc()
identical(res1, res2) #hopefully TRUE!
See this question for tic and toc.
Further thoughts:
You may want additional input checking for "" and other strings that return NA when converted in numeric.
Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)
Richie C answered the vectorisation question nicely; as for only creatig the tables once without cluttering the global name space, one quick solution that does not require a package is
verhoeffCheck <- local(function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
})
assign("d5_mult", matrix(c(
0:9, c(1:4,0,6:9,5), c(2:4,0:1,7:9,5:6), c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8), c(5,9:6,0,4:1), c(6:5,9:7,1:0,4:2), c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4), 9:0), 10, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_perm", matrix(c(
0:9, c(1,5,7,6,2,8,3,0,9,4), c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7), c(9,4,5,3,1,2,6,8,7,0), c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5), c(7,0,4,6,9,1,3,2,5,8)), 8, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_inv", c(0,4:1,5:9), envir = environment(verhoeffCheck))
## Now just use the function
which keeps the data in the environment of the function. You can time it to see how much faster it is.
Hope this helps.
Allan