Lucene: TokenFilter to replace chars and produce new tokens as synonyms

Lucene: TokenFilter to replace chars and produce new tokens as synonyms - lucene

I want to map chars like this:
private static final Map<String, String> MAP = Map.of("CH", "X",
"X", "CH",
"I", "Y",
"Y", "I",
"S", "Z",
"Z", "S",
"F", "PH",
"PH", "F");
So for ex, XANTION is tokenized to CHANTION, PHYTOVEIN is tokenized to FITOVEIN, keeping original tokens.
Those are medicine names, it would generate "synonyms" for those, to use when analyzing search terms.
Could I use any existing token filter?

Related

How to define an incomplete enum?

I want to define a property that could be one of the values of an enum or other string. I do not want to define the property as a string without an enum and I do not want to put into the enum an OTHER value.
Definition of a property 'p':
"p": {
"type": "string",
"enum": ["A", "B", "C"]
}
I want it:
{
"p": "D"
}
to be valid.

Use the oneOf keyword: understanding-json-schema/reference/combining
One schema branch contains your enum, the other maype pattern (more here) if there is a valid regular expression to define any "other" values.

Display discrepancies between collections

When asserting the equality of two string collections using
collectionA.Should().Equal(collectionB) the message is:
"Expected collectionA to be equal to {"a", "b", "c", …5 more…} , but {"a", "b", "c", …13 more…} contains 8 item(s) too many."
Is there a way to display the actual discrepancies between the collections?

How to query the items of a nested json/dict

My table is contacts_contact while the column is called fields.
Just for context, the first part of the json I'm interested in represents a flow ID and the corresponding response from the user.
Like this: <Key, which is the flow id>:{"text", "false"}
Where text is the data type of my field and false means that the user did not consent. They could also choose true and consent.
In pseudo code, here's what I'm trying to do:
Search through contacts_contact
Return all rows where the key is 6784cbd4-505d-4ee4-8568-fb69913d6998
AND the reply (value) is false. (meaning that 'text' is 'false')
In other words, every row that has "6784cbd4-505d-4ee4-8568-fb69913d6998": {"text": "false"} in the fields column should be SELECTED.
Here's a sample data under the fields column. The first part is what I'm interested in retrieving.
{
"6784cbd4-505d-4ee4-8568-fb69913d6998": {"text": "false"},
"70454b00-f408-4e69-8013-b010c3130fdd": {"text": "2020-05-04",
"datetime": "2020-05-04T09:38:42.329388+02:00"},
"9fc9e443-4bbb-4356-b9cc-71a6c15ded0e": {"text": "<1 month"},
"abb3bb06-d4b7-4a58-8a3f-b100074b20a2": {"text": "<1 month"},
"b55eb0e6-af0d-48c7-b2eb-f46529bdd07b": {"text": "True"},
"b692354b-f314-406a-8ed8-47b7dde34379": {"text": "true"},
"c7d75b60-f1d8-4588-affa-4ef148c75873": {"text": "WhatsApp"},
"c80e14a9-e10f-41c8-ae59-c0ca14abf806": {"text": "true"},
"cbfd64b8-739c-4913-b8c9-ba366043f1bd": {"text":
"2020-04-20T00:00:00.000000+02:00", "datetime":
"2020-04-20T00:00:00.000000+02:00"},
"d5423a09-5486-4b80-bcae-4fc1e11b0dfa": {"text": "true"},
"db36481d-3bb2-435d-b63d-bc9d5b5eadd3": {"text": "TRUE"},
"e0f301ab-56eb-4bba-a6c1-d9668033172f": {"text": "Late Adopter"}
}

This can be done using the contains operator #>
select *
from contacts_contact
where fields #> '{"6784cbd4-505d-4ee4-8568-fb69913d6998" : {"text": "false"}}';
The above assumes that fields is a jsonb column (which it should be). If it's not, you need to cast it: fields::jsonb

Kotlin how to split a list in to sublists

I would like to split a list into a few sublists, but I have no idea how to do it.
Once of my ideas was splitting the list by index of element. For exmaple "B" index is 0, "S" index 2, so I would like to take a part between index 0 - 1 into the first sublist, then the second sublist should be the part between index 2 - 5.
Example of my list:
val listOfObj = listOf("B", "B" , "S", "B", "B", "X", "S", "B", "B", "P")
Result after splitting:
listOf(listOf("B","B"), listOf("S", "B", "B", "X"), listOf("S", "B", "B", "P") )
How do I achieve such a result?

Here it goes. I wrote it from my phone without checking but the idea is basic.
val result = mutableListOf<List<String>>()
var current = mutableList<String>()
listOfObj.forEach { letter ->
if (letter == "S") {
result.add(current)
current = mutableListOf<String>()
}
current.add(letter)
}
if (current.isNotEmpty()) {
result.add(current)
}
You can even create an extension function for a List<T> that gets a separator element as a parameter and returns a list of lists.

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:
{
key_a:value_a,
key_b:{
key_c:value_c,
key_d:value_d
}
key_e:{
key_f:value_f,
key_g:value_g
}
}
The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.
Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.
How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?

Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.
If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode
Table schema:
[
{ "name": "key_a",
"type": "STRING" },
{ "name": "key_b",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
},
{ "name": "key_e",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
}
]
Example of JSON to load
{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}
Please note: it should be newline delimited JSON so each row must be on one line

You can't do this directly with BigQuery, but you can make it work in two passes:
(1) Import your JSON data as a CSV file with a single string column.
(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.
Example
I'll start with some JSON:
{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}
Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'þ') or it tripped over the default ','.
Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.
Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:
a (INTEGER)
b (STRING)
c (STRING -- packed JSON)
We can do this with a UDF:
// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
'extractAndRepack', // Name of the function exported to SQL
['blob'], // Names of input columns
[{'name': 'a', 'type': 'integer'}, // Output schema
{'name': 'b', 'type': 'string'},
{'name': 'c', 'type': 'string'}],
function (row, emit) {
var parsed = JSON.parse(row.blob);
var repacked = JSON.stringify(parsed.c);
emit({a: parsed.a, b: parsed.b, c: repacked});
}
);
And a corresponding query:
SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)
Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.
Row a b c
1 0 zero {"woodchuck":"a"}
2 1 one {"chipmunk":"b"}
3 2 two {"squirrel":"c"}
4 3 three {"chinchilla":"d"}
5 4 four {"capybara":"e"}
6 5 five {"housemouse":"f"}
7 6 six {"molerat":"g"}
8 7 seven {"marmot":"h"}
9 8 eight {"badger":"i"}

One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.
P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene: TokenFilter to replace chars and produce new tokens as synonyms - lucene

Related

How to define an incomplete enum?

Display discrepancies between collections

How to query the items of a nested json/dict

Kotlin how to split a list in to sublists

BigQuery: Create column of JSON datatype

Categories

Resources