How to use Lucene's DistinctValuesCollector? - kotlin

My objective is to collect distinct values of select fields to provided them as filter options for the frontend. DistinctValuesCollector seems to be the tool for this, however since I haven't found code sample and documentation except for the Javadocs I can't currently correctly construct this collector. Can anyone provide an example?
This is my attempt which doesn't deliver the desired distinct values of the field PROJEKTSTATUS.name.
val groupSelector = TermGroupSelector(PROJEKTSTATUS.name)
val searchGroup = SearchGroup<BytesRef>()
val valueSelector = TermGroupSelector(PROJEKTSTATUS.name)
val groups = mutableListOf(searchGroup)
val distinctValuesCollector = DistinctValuesCollector(groupSelector, groups, valueSelector)
That field is indexed as follows:
document.add(TextField(PROJEKTSTATUS.name, aggregat.projektstatus, YES))
document.add(SortedDocValuesField(PROJEKTSTATUS.name, BytesRef(aggregat.projektstatus)))

Thanks to #andrewJames's hint to a test class I could figure it out:
fun IndexSearcher.collectFilterOptions(query: Query, field: String, topNGroups: Int = 128, mapper: Function<String?, String?> = Function { it }): Set<String?> {
val firstPassGroupingCollector = FirstPassGroupingCollector(TermGroupSelector(field), Sort(), topNGroups)
search(query, firstPassGroupingCollector)
val topGroups = firstPassGroupingCollector.getTopGroups(0)
val groupSelector = firstPassGroupingCollector.groupSelector
val distinctValuesCollector = DistinctValuesCollector(groupSelector, topGroups, groupSelector)
search(query, distinctValuesCollector)
return distinctValuesCollector.groups.map { mapper.apply(it.groupValue.utf8ToString()) }.toSet()
}

Related

Optimization query for DataFrame Spark

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success
You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

How to add data to Map copying existing values based on List of identifiers

Sorry for the poor title but it is rather hard to describe my use case in a short sentence.
Context
I have the following model:
typealias Identifier = String
data class Data(val identifier: Identifier,
val data1: String,
val data2: String)
And I have three main data structures in my use case:
A Set of Identifiers that exist and are valid in a given context. Example:
val existentIdentifiers = setOf("A-1", "A-2", "B-1", "B-2", "C-1")
A Map that contains a List of Data objects per Identifier. Example:
val dataPerIdentifier: Map<Identifier, List<Data>> = mapOf(
"A-1" to listOf(Data("A-1", "Data-1-A", "Data-2-A"), Data("A-1", "Data-1-A", "Data-2-A")),
"B-1" to listOf(Data("B-1", "Data-1-B", "Data-2-B")),
"C-1" to listOf(Data("C-1", "Data-1-C", "Data-2-C"))
)
A List of Lists that group together the Identifiers that should share the same List<Data> (each List includes always 2 Identifiers). Example
val identifiersWithSameData = listOf(listOf("A-1", "A-2"), listOf("B-1", "B-2"))
Problem / Use Case
The problem that I am trying to tackle stems from the fact that dataPerIdentifier might not contain all identifiersWithSameData given that existentIdentifiers contains such missing Identifiers. I need to add those missing Identifier to dataPerIdentifier, copying the List<Data> already in there.
Example
Given the data in the Context section:
A-1=[Data(identifier=A-1, data1=Data-1-A, data2=Data-2-A),
Data(identifier=A-1, data1=Data-1-A, data2=Data-2-A)],
B-1=[Data(identifier=B-1, data1=Data-1-B, data2=Data-2-B)],
C-1=[Data(identifier=C-1, data1=Data-1-C, data2=Data-2-C)]
The desired outcome is to update dataPerIdentifier so that it includes:
A-1=[Data(identifier=A-1, data1=Data-1-A, data2=Data-2-A),
Data(identifier=A-1, data1=Data-1-A, data2=Data-2-A)],
B-1=[Data(identifier=B-1, data1=Data-1-B, data2=Data-2-B)],
C-1=[Data(identifier=C-1, data1=Data-1-C, data2=Data-2-C)],
A-2=[Data(identifier=A-2, data1=Data-1-A, data2=Data-2-A),
Data(identifier=A-2, data1=Data-1-A, data2=Data-2-A)]
The reason is that existentIdentifiers contains A-2 that is missing in the initial dataPerIdentifier Map. B-2 is also missing in the initial dataPerIdentifier Map but existentIdentifiers does not contain it, so it is ignored.
Possible solution
I have already a working code (handleDataForMultipleIdentifiers() method is the one doing the heavy lifting), but it does not feel to be the cleanest or easiest to read:
fun main(args: Array<String>) {
val existentIdentifiers = setOf("A-1", "A-2", "B-1", "C-1")
val dataPerIdentifier: Map<Identifier, List<Data>> = mapOf(
"A-1" to listOf(Data("A-1", "Data-1-A", "Data-2-A"), Data("A-1", "Data-1-A", "Data-2-A")),
"B-1" to listOf(Data("B-1", "Data-1-B", "Data-2-B")),
"C-1" to listOf(Data("C-1", "Data-1-C", "Data-2-C"))
)
val identifiersWithSameData = listOf(listOf("A-1", "A-2"), listOf("B-1", "B-2"))
print("Original Data")
println(dataPerIdentifier)
print("Target Data")
println(dataPerIdentifier.handleDataForMultipleIdentifiers(identifiersWithSameData, existentIdentifiers))
}
fun Map<Identifier, List<Data>>.handleDataForMultipleIdentifiers(identifiersWithSameData: List<List<Identifier>>, existentIdentifiers: Set<Identifier>)
: Map<Identifier, List<Data>> {
val additionalDataPerIdentifier = identifiersWithSameData
.mapNotNull { identifiersList ->
val identifiersWithData = identifiersList.find { it in this.keys }
identifiersWithData?.let { it to identifiersList.minus(it).filter { it in existentIdentifiers } }
}.flatMap { (existentIdentifier, additionalIdentifiers) ->
val existentIdentifierData = this[existentIdentifier].orEmpty()
additionalIdentifiers.associateWith { identifier -> existentIdentifierData.map { it.copy(identifier = identifier) } }.entries
}.associate { it.key to it.value }
return this + additionalDataPerIdentifier
}
typealias Identifier = String
data class Data(val identifier: Identifier,
val data1: String,
val data2: String)
So my question is: how can I do this in a simpler way?
If identifiersWithSameData always contains 2 identifiers per item then it should not really be a list of lists, but rather a list of pairs or dedicated data classes. And if you convert this data structure into a map like this:
val identifiersWithSameData = mapOf("A-1" to "A-2", "A-2" to "A-1", "B-1" to "B-2", "B-2" to "B-1")
The the whole solution is pretty simple:
existentIdentifiers.associateWith {
dataPerIdentifier[it] ?: dataPerIdentifier[identifiersWithSameData[it]!!]!!
}
I'm not sure about both !!, for example I don't know if it is guaranteed that identifier existing in existentIdentifiers exists in identifiersWithSameData as well. You may need to tune this solution a little.

About binarySearch() of Kotlin List

I ran the examples in the official Kotlin documentation in the local Android Studio, and found that the results are different from what I expected, but I don’t know what is causing this?
data class Produce(
val name: String,
val price: Double
)
This is the data class I defined
val list2 = listOf(
Produce("AppCode", 52.0),
Produce("IDEA", 182.0),
Produce("VSCode", 2.75),
Produce("Eclipse", 1.75)
)
this is my source list
println(list2.sortedWith(compareBy<Produce> {
it.price
}.thenBy {
it.name
}))
The output on the console is:
[Produce(name=Eclipse, price=1.75), Produce(name=VSCode, price=2.75), Produce(name=AppCode, price=52.0), Produce(name=IDEA, price=182.0)]
I call binarySearch() like this
println("result: ${
list2.binarySearch(
Produce("AppCode", 52.0), compareBy<Produce> {
it.price
}.thenBy {
it.name
}
)
}")
I think the result should be 2, but it is 0
result: 0
I don't know why it turned out like this. Plase help me . thanks a lot
sortedWith() does not modify the list, it returns a new, sorted collection. When calling list2.binarySearch() you still search through original, unsorted list.
You need to either do something like:
list2.sortedWith().binarySearch()
Or create your list with mutableListOf() and then use sort() which sorts in-place.
Broot is right. You need to pass the sorted list to the binarySearch() function. To clarify in code:
val comparator = compareBy<Produce> { it.price }.thenBy { it.name }
val sorted = list2.sortedWith(comparator)
println(sorted.joinToString("\n"))
val foundIndex = sorted.binarySearch(Produce("AppCode", 52.0), comparator)
println("Found at: $foundIndex")
Result:
Produce(name=Eclipse, price=1.75)
Produce(name=VSCode, price=2.75)
Produce(name=AppCode, price=52.0)
Produce(name=IDEA, price=182.0)
Found at: 2

get a list of parsed json elements

I parsed a json string to the following object structure using gson:
data class Base (
val expand: String,
val startAt: Long,
val maxResults: Long,
val total: Long,
val issues: List<Issue>
)
data class Issue (
val expand: String,
val id: String,
val self: String,
val key: String,
val fields: Fields
)
data class Fields (
val summary: String,
val issuetype: Issuetype,
val customfield10006: Long? = null,
val created: String,
val customfield11201: String? = null,
val status: Status,
val customfield10002: Customfield10002? = null,
val customfield10003: String? = null
)
Everything works fine and also the object model is correct, because I can access each element of the object.
However, I encountered the problem that I dont know how to get a list of all field-elements. Right now, I have only figured out how to access one item (by using an index and get()-function):
val baseObject = gson.fromJson(response, Base::class.java)
val fieldsList = baseObject.issues.get(0).fields
I actually want to have a list of all field elements and not just one. Is there a gson function allowing me to do that? I couldn't find anything about it in the gson documentation for java.
You don't have to look for some gson function when you've already created a baseObject. You just need to get from each issue it's fields and you can use a map function to achieve this, it will convert each issue to a new type so you can get issue fields there
val fieldFromAllIssues: List<Fields> = baseObject.issues.map { it.fields }
it in this context is a one issue. More explanation about it is here

Is it possible to parameterize queries or parameters for an Acolyte ScalaCompositeHandler?

Background:
I have attempted to accomplish the question defined here, and I have not been able to succeed. Acolyte requires you to define the queries and parameters you want to handle within a match expression, and the values used in match expressions must be known at compile time. (Note, however, that this StackOverflow answer appears to provide a way around this limitation).
If this is indeed not possible, the inability to dynamically define the parameters and queries for Acolyte would be, for my use case, a severe limitation of the framework. I suspect this would be a limitation for others as well.
One SO user who has advocated for the use of Acolyte across a handful of questions stated in this comment that it is possible to dynamically define queries and their responses. So, I have opened this question as an invitation for someone to show that to be the case.
Question:
Using Acolyte, I want to be able to encapsulate the logic for matching queries and generating their responses. This is a desired feature because I want to keep my code DRY. In other words, I am looking for something like the following pseudo-code:
def generateHandler(query: String, accountId: Int, parameters: Seq[String]): ScalaCompositeHandler = AcolyteDSL.handleQuery {
parameters.foreach(p =>
// Tell the handler to handle this specific parameter
case acolyte.jdbc.QueryExecution(query, ExecutedParameter(accountId) :: ExecutedParameter(p) :: Nil) =>
someResultFunction(p)
)
}
Is this possible in Acolyte? If so, please provide an example.
It is indeed possible to parameterize queries and/or parameters by utilizing pattern matching.
See the code below for an example:
import java.sql.DriverManager
import acolyte.jdbc._
import acolyte.jdbc.Implicits._
import org.scalatest.FunSpec
class AcolyteTest extends FunSpec {
describe("Using pattern matching to extract a query parameter") {
it("should extract the parameter and make it usable for dynamic result returning") {
val query = "SELECT someresult FROM someDB WHERE id = ?"
val rows = RowLists.rowList1(classOf[String] -> "someresult")
val handlerName = "testOneHandler"
val handler = AcolyteDSL.handleQuery {
case acolyte.jdbc.QueryExecution(`query`, ExecutedParameter(id) :: _) =>
rows.append(id.toString)
}
Driver.register(handlerName, handler)
val connection = DriverManager.getConnection(s"jdbc:acolyte:anything-you-want?handler=$handlerName")
val preparedStatement = connection.prepareStatement(query)
preparedStatement.setString(1, "hello world")
val resultSet = preparedStatement.executeQuery()
resultSet.next()
assertResult(resultSet.getString(1))("hello world")
}
it("should support a slightly more complex example") {
val firstResult = "The first result"
val secondResult = "The second result"
val query = "SELECT someresult FROM someDB WHERE id = ?"
val rows = RowLists.rowList1(classOf[String] -> "someresult")
val results: Map[String, RowList1.Impl[String]] = Map(
"one" -> rows.append(firstResult),
"two" -> rows.append(secondResult)
)
def getResult(parameter: String): QueryResult = {
results.get(parameter) match {
case Some(row) => row.asResult()
case _ => acolyte.jdbc.QueryResult.Nil
}
}
val handlerName = "testTwoHandler"
val handler = AcolyteDSL.handleQuery {
case acolyte.jdbc.QueryExecution(`query`, ExecutedParameter(id) :: _) =>
getResult(id.toString)
}
Driver.register(handlerName, handler)
val connection = DriverManager.getConnection(s"jdbc:acolyte:anything-you-want?handler=$handlerName")
val preparedStatement = connection.prepareStatement(query)
preparedStatement.setString(1, "one")
val resultSetOne = preparedStatement.executeQuery()
resultSetOne.next()
assertResult(resultSetOne.getString(1))(firstResult)
preparedStatement.setString(1, "two")
val resultSetTwo = preparedStatement.executeQuery()
resultSetTwo.next()
assertResult(resultSetTwo.getString(1))(secondResult)
}
}
}