Kotlin summing with groupingBy and aggregate - kotlin

tl/dr: How would Kotlin use groupingBy and aggregate to get a Sequence of (key, number) pairs to sum to a map of counts?
I have 30gb of csv files which are a breeze to read and parse.
File("data").walk().filter { it.isFile }.flatMap { file ->
println(file.toString())
file.inputStream().bufferedReader().lineSequence()
}. // now I have lines
Each line is "key,extraStuff,matchCount"
.map { line ->
val (key, stuff, matchCount) = line.split(",")
Triple(key, stuff, matchCount.toInt())
}.
and I can filter on the "stuff" which is good because lots gets dropped -- yay lazy Sequences. (code omitted)
But then I need a lazy way to get a final Map(key:String to count:Int).
I think I should be using groupingBy and aggregate, because eachCount() would just count rows, not sum up matchCount, and groupingBy is lazy whereas groupBy isn't, but we have reached the end of my knowledge.
.groupingBy { (key, _, _) ->
key
}.aggregate { (key, _, matchCount) ->
??? something with matchCount ???
}

You can use Grouping.fold extension instead of Grouping.aggregate. It would be more suitable for summing grouped entries by a particular property:
triples
.groupingBy { (key, _, _) -> key }
.fold(0) { acc, (_, _, matchCount) -> acc + matchCount }

You need to pass a function with four parameters to aggregate:
#param operation: function is invoked on each element with the following parameters:
key: the key of the group this element belongs to;
accumulator: the current value of the accumulator of the group, can be null if it's the first element encountered in the group;
element: the element from the source being aggregated;
first: indicates whether it's the first element encountered in the group.
Of them, you need accumulator and element (which you can destructure). The code would be:
.groupingBy { (key, _, _) -> key }
.aggregate { _, acc: Int?, (_, _, matchCount), _ ->
(acc ?: 0) + matchCount
}

I ran into a similar problem today and, like one of Kotlin's tutorials, I finally solved it. The code is as follows:
(kt sample:https://play.kotlinlang.org/hands-on/Introduction%20to%20Coroutines%20and%20Channels/02_BlockingRequest)
In the initial list each user is present several times, once for each
repository he or she contributed to.
Merge duplications: each user should be present only once in the resulting list
with the total value of contributions for all the repositories.
Users should be sorted in a descending order by their contributions.
The corresponding test can be found in test/tasks/AggregationKtTest.kt.
You can use 'Navigate | Test' menu action (note the shortcut) to navigate to the test.
*/
fun List<User>.aggregate(): List<User> = this.groupBy { it.login }.map { (login, contributions) ->
User(login, contributions.sumOf { it.contributions })
}.sortedByDescending { it.contributions }
//or
fun List<User>.aggregate2(): List<User> = this.groupingBy { it.login }.aggregate<User, String, User>{ key, accumulator, element, _ ->
User(key, (accumulator?.contributions?:0)+element.contributions)
}.map { (_, v)-> v}.sortedByDescending { it.contributions }

Related

Kafka streams - Fetching 4 values from input topic, passing to a function, then sending to output topic aggregating based on key

In this simple version, I was just taking one input value and aggregating that. Instead, I would like to take 4 values received from the input topic, pass these into a function, and then use the value returned to be aggregated in the output topic based on the key.
Essentially keeping a running total of the values for each key. The 4 inputs are just Doubles that are used to calculate a single final figure which is added to the running total.
I'm not quite sure of the syntax for it, does anyone have any ideas?
private val streamsBuilder = StreamsBuilder()
private val trips = streamsBuilder.stream<String, String>(inputTopic)
private val distance_travelled = trips
.peek { _, trip -> logger.info(trip) }
.mapValues { _, trip ->
objectMapper.readValue(trip, TripClass::class.java)
}
.groupByKey()
.aggregate(
{ 0L },
{ _, trip, total -> total + trip.amount },
Materialized.with(Serdes.String(), Serdes.Long())
)
Calling the function here does sort of work, but .aggregate complains if I try to change the data type from Long to Double.
.aggregate(
{ 0L },
{ _, trip, total -> total + exampleFunction(trip) },
Materialized.with(Serdes.String(), Serdes.Long())
)
EDIT
Clarification on the error produced by changing data type from Serdes.Long to Serdes.Double
Type inference failed: Cannot infer type parameter VR in
fun <VR : Any!> aggregate
(
initializer: (() → VR!)!,
aggregator: ((key: String!, value: Transaction!, aggregate: VR!) → VR!)!,
materialized: Materialized<String!, VR!, KeyValueStore<Bytes!, ByteArray!>!>!
)
: KTable<String!, VR!>!
None of the following substitutions
(
(() → Double!)!,
((String!, Transaction!, Double!) → Double!)!,
Materialized<String!, Double!, KeyValueStore<Bytes!, ByteArray!>!>!
)
(
(() → Long!)!,
((String!, Transaction!, Long!) → Long!)!,
Materialized<String!, Long!, KeyValueStore<Bytes!, ByteArray!>!>!
)
can be applied to
(
() → Long,
(String!, Transaction!, Double) → Double,
Materialized<String!, Double!, KeyValueStore<Bytes!, ByteArray!>!>!
)
Update
distance.aggregate(
{ 0.0 }, // 0.0 for Double initializer, 0L for Long2
{ _, trip, balance:Double -> balance + eh(trip) }, { _, leftAggValue, rightAggValue ->
leftAggValue + rightAggValue
},
Materialized.`as`<String, Double, SessionStore<Bytes, ByteArray>>("stateStoreName")
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.Double())
.withLoggingDisabled())
The above produces no error itself but causes an error on this line.
distance.toStream().to(outputTopic, Produced.with(Serdes.String(), Serdes.Double()))
There's no error without the window, the change of type causes some sort of issue. Any ideas?
The error:
None of the following functions can be called with the arguments supplied.
to(((key: Windowed<String!>!, value: Double!, recordContext: RecordContext!) → String!)!, Produced<Windowed<String!>!, Double!>!)
defined in org.apache.kafka.streams.kstream.KStream
to(String!, Produced<Windowed<String!>!, Double!>!)
defined in org.apache.kafka.streams.kstream.KStream
to(TopicNameExtractor<Windowed<String!>!, Double!>!, Produced<Windowed<String!>!, Double!>!)
defined in org.apache.kafka.streams.kstream.KStream

Error when trying to convert a list of objects in a string using reduce function

I am playing with kotlin language and I tried the following code:
data class D( val n: Int, val s: String )
val lst = listOf( D(1,"one"), D(2, "two" ) )
val res = lst.reduce { acc:String, d:D -> acc + ", " + d.toString() }
The last statement causes the following errors:
Expected parameter of type String
Expected parameter of type String
Type mismatch: inferred type is D but String was expected
while this version of the last statement works:
val res = lst.map { e -> e.toString() }.reduce { acc, el -> acc + ", " + el }
I do not understand why the first version does not work. The formal definition of the reduce function, found here, is the following:
inline fun <S, T : S> Iterable<T>.reduce(
operation: (acc: S, T) -> S
): S
But this seems in contrast with the following sentence, on the same page:
Accumulates value starting with the first element and applying
operation from left to right to current accumulator value and each
element.
That is, as explained here:
The difference between the two functions is that fold() takes an
initial value and uses it as the accumulated value on the first step,
whereas the first step of reduce() uses the first and the second
elements as operation arguments on the first step.
But, to be able to apply the operation on first and second element, and so on, it seems to me tha the operation shall have both arguments of the base type of the Iterable.
So, what am I missing ?
Reduce is not the right tool here. The best function in this case is joinToString:
listOf(D(1, "one"), D(2, "two"))
.joinToString(", ")
.let { println(it) }
This prints:
D(n=1, s=one), D(n=2, s=two)
reduce is not designed for converting types, it's designed for reducing a collection of elements to a single element of the same type. You don't want to reduce to a single D, you want a string. You could try implementing it with fold, which is like reduce but takes an initial element you want to fold into:
listOf(D(1, "one"), D(2, "two"))
.fold("") { acc, d -> "$acc, $d" }
.let { println(it) }
However, this will add an extra comma:
, D(n=1, s=one), D(n=2, s=two)
Which is exactly why joinToString exists.
You can see the definition to understand why its not working
To make it work, you can simply create an extension function:
fun List<D>.reduce(operation: (acc: String, D) -> String): String {
if (isEmpty())
throw UnsupportedOperationException("Empty list can't be reduced.")
var accumulator = this[0].toString()
for (index in 1..lastIndex) {
accumulator = operation(accumulator, this[index])
}
return accumulator
}
you can use it as:
val res = lst.reduce { acc:String, d:D -> acc + ", " + d.toString() }
or simply:
val res = lst.reduce { acc, d -> "$acc, $d" }
You can modify the function to be more generic if you want to.
TL;DR
Your code acc:String is already a false statement inside this line:
val res = lst.reduce { acc:String, d:D -> acc + ", " + d.toString() }
Because acc can only be D, never a String! Reduce returns the same type as the Iterable it is performed on and lst is Iterable<D>.
Explanation
You already looked up the definition of reduce
inline fun <S, T : S> Iterable<T>.reduce(
operation: (acc: S, T) -> S
): S
so lets try to put your code inside:
lst is of type List<D>
since List extends Iterable, we can write lst : Iterable<D>
reduce will look like this now:
inline fun <D, T : D> Iterable<T>.reduce(
operation: (acc: D, T) -> D //String is literally impossible here, because D is not a String
): S
and written out:
lst<D>.reduce { acc:D, d:D -> }

Generating unique random values with SecureRandom

i'm currently implementing a secret sharing scheme.(shamir)
In order to generate some secret shares, I need to generate some random numbers within a range. FOr this purpose, I have this very simple code:
val sharesPRG = SecureRandom()
fun generateShares(k :Int): List<Pair<BigDecimal,BigDecimal>> {
val xs = IntArray(k){ i -> sharesPRG.nextInt(5)}
return xs
}
I have left out the part that actually creates the shares as coordinates, just to make it reproduceable, and picked an arbitrarily small bound of 5.
My problem is that I of course need these shares to be unique, it doesnt make sense to have shares that are the same.
So would it be possible for the Securerandom.nextint to not return a value that it has already returned?
Of course I could do some logic where I was checking for duplicates, but I really thought there should be something more elegant
If your k is not too large you can keep adding random values to a set until it reaches size k:
fun generateMaterial(k: Int): Set<Int> = mutableSetOf<Int>().also {
while (it.size < k) {
it.add(sharesPRG.nextInt(50))
}
}
You can then use the set as the source material to your list of pairs (k needs to be even):
fun main() {
val pairList = generateMaterial(10).windowed(2, 2).map { it[0] to it[1] }
println(pairList)
}

Simplifying the predicate when checking for several known values

Kotlin often uses very pragmatic approaches. I wonder whether there is some I don't know of to simplify a filter predicate which just asks for some known values.
E.g. consider the following list:
val list = listOf("one", "two", "2", "three")
To filter out "two" and "2" filtering can be accomplished in several ways, e.g.:
list.filter {
it in listOf("two", "2") // but that creates a new list every time... (didn't check though)
}
// extracting the list first, uses more code... and may hide the list somewhere sooner or later
val toCheck = listOf("two", "2")
list.filter { it in toCheck }
// similar, but probably less readable due to naming ;-)
list.filter(toCheck::contains)
// alternative using when, but that's not easier for this specific case and definitely longer:
list.filter {
when (it) {
"two", "2" -> true
else -> false
}
}
// probably one of the simplest... but not so nice, if we need to check more then 2 values
list.filter { it == "two" || it == "2" }
I wonder... is there something like list.filter { it in ("two", "2") } or any other simple way to create/use a short predicate for known values/constants? In the end that's all I wanted to check.
EDIT: I just realised that the sample doesn't make much sense as listOf("anything", "some", "other").filter { it in listOf("anything") } will always be just: listOf("anything"). However, the list intersection makes sense in constellations where dealing with, e.g. a Map. In places where the filter actually doesn't return only the filtered value (e.g. .filterKeys). The subtraction (i.e. list.filterNot { it in listOf("two", "2") }) however also makes sense in lists as well.
Kotlin provides some set operations on collections which are
intersect (what both collections have in common)
union (combine both collections)
subtract (collections without elements of the other)
In your case, instead of filter, you may use the set operation subtract
val filteredList = list.subtract(setOf("two","2"))
and there you go.
EDIT:
and the fun (pun intended) doesn't end there: you could extend the collections with your own functions such as a missing outerJoin or for filtering something like without or operators i.e. / for intersect
For example, by adding these
infix fun <T> Iterable<T>.without(other Iterable<T>) = this.subtract(other)
infix fun <T> Iterable<T>.excluding(other Iterable<T>) = this.subtract(other)
operator fun <T> Iterable<T>.div(other: Iterable<T>) = this.intersect(other)
Your code - when applied to your example using the intersect - would become
val filtered = list / filter //instead of intersect filter
or - instead of substract:
val filtered = list without setOf("two", "2")
or
val filtered = list excluding setOf("two", "2")
Pragmatic enough?
I ended up with the following now:
fun <E> containedIn(vararg elements: E) = { e:E -> e in elements }
fun <E> notContainedIn(vararg elements: E) = { e:E -> e !in elements }
which can be used for maps & lists using filter, e.g.:
list.filter(containedIn("two", "2"))
list.filter(notContainedIn("two", "2"))
map.filterKeys(containedIn("two", "2"))
map.filterValues(notContainedIn("whatever"))
In fact it can be used for anything (if you like):
if (containedIn(1, 2, 3)(string.toInt())) {
My first approach inspired by Gerald Mückes answer, but with minus instead of subtract (so it only covers the subtraction-part):
(list - setOf("two", "2"))
.forEach ...
Or with own extension functions and using vararg:
fun <T> Iterable<T>.without(vararg other: T) = this - other
with the following usage:
list.without("two", "2")
.forEach... // or whatever...
With the above variant however no infix is possible then. For only one exclusion an infix can be supplied as well... otherwise the Iterable-overload must be implemented:
infix fun <T> Iterable<T>.without(other : T) = this - other
infix fun <T> Iterable<T>.without(other : Iterable<T>) = this - other
Usages:
list without "two"
list without listOf("two", "2")
I don't think there is anything simpler than to create the filtering list/set and then apply it:
val toCheck = listOf("two", "2")
val filtered = list.filter { it in toCheck }
or
val toCheck = setOf("two", "2")
val filtered = list.filter { it in toCheck }
but if you prefer you can create a Predicate:
val predicate: (String) -> Boolean = { it in listOf("2", "two") }
val filtered = list.filter { predicate(it) }
Edit: as for the approach with minus, which is not the case here but has been mentioned, it does not provide simplicity or efficiency since itself is using filter:
/**
* Returns a list containing all elements of the original collection except the elements contained in the given [elements] collection.
*/
public operator fun <T> Iterable<T>.minus(elements: Iterable<T>): List<T> {
val other = elements.convertToSetForSetOperationWith(this)
if (other.isEmpty())
return this.toList()
return this.filterNot { it in other }
}
(from Collections.kt)

How to calculate totals for each row in a table (rows*columns) structure in Kotlin?

I have a (simplified) table structure that is defined like this:
data class Column<T>(val name: String, val value: T)
data class Row(val data: List<Column<*>>)
data class Grid(val rows: List<Row>)
I now want to calculate the totals for each column in that grid, i.e. the ith element of each row needs to be accumulated.
My solution looks like this. I simply flatMap the data and group the column values by the column's name, which I then fold to the corresponding sums.
private fun calculateTotals(data: Grid) = data.rows
.flatMap(Row::data)
.groupingBy(Column<*>::name)
.fold(0.0) { accumulator, (_, value) ->
accumulator + when (value) {
is Number -> value.toDouble()
else -> 0.0
}
}
I could not come up with a better solution. I think yours is really good, but I would suggest some syntactic improvements.
Use lambda references
Use destructuring syntax
Don't use when, if you only test for one specific type, use the safe cast operator (as?), the safe call operator (?) and the elvis operator (:?).
private fun calculateTotals(data: GridData) = data.rows
.flatMap(RowData::data) // 1
.groupingBy(ColumnsData<*>::column) // 1
.fold(0.0) { accumulator, (_, value) -> // 2
accumulator + ((value as? Number)?.toDouble() ?: 0.0) // 3
}