Elasticsearch / Lucene Misspelled Whitespace - lucene

How can I make Elasticsearch correct queries in which keyword should contain whitespace but instead typed adjacent. E.g.
"thisisaquery" -> "this is a query"
my current settings are:
"settings": {
"index": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase", "engram"
]
}
},
"filter": {
"engram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 10
}
}
}
}
}

There isn't an out of the box tokenizer/token filter to explicitly handle what you're asking for. The closest would be the compound word token filter which requires manually providing a dictionary file which in your case would may require the full english dictionary to work correctly. Even with that it would likely have issues with words that are stems of other words, abbreviations, etc without a lot of additional logic. It may be good enough though depending on your exact requirements.

This ruby project claims to do this. You might try it if you're using ruby, or just look at their code and copy their analyzer settings for it :)
https://github.com/ankane/searchkick

Related

Is there precedence for object oriented syntax for a relative path?

I'm creating my own extended version of JSON for various reasons. One thing that I'm adding is the ability to self reference and I'm trying to come up with an OO syntax for relative paths.
To illustrate, lets say I have a nested object that is supposed to reference its parent object
{ my_item: { parent: ??? } }
??? symbolizes the missing syntax.
Now in most operating systems, going up one level is notated as .. so we could try doing the same
{ my_item: { parent: .. } }
Looks pretty neat, however, if I tried to reference anything else in the parent, I'd end up with
{ my_item_sibling: {}, my_item: { sibling_of_parent: ...my_item_sibling } }
Which is not as neat as its the same as spread syntax ... which I'm also adding
I could do something with parentheses, like so
{ my_item_sibling: {}, my_item: { sibling_of_parent: (..).my_item_sibling } }
Which is not terrible but I'd prefer something cleaner.
Maybe I'll reserve a symbol?
{ my_item_sibling: {}, my_item: { sibling_of_parent: #.my_item_sibling } }
In any case, these examples are just to illustrate what I'm doing. If there is an established or a particularly nice looking way to do it, I'll just copy that.
The question is: Is there a precedence to this? A relative path implemented in a c-like language?

How to validate a JSON object against a JSON schema based on object's type described by a field?

I have a number of objects (messages) that I need to validate against a JSON schema (draft-04). Each objects is guaranteed to have a "type" field, which describes its type, but every type have a completely different set of other fields, so each type of object needs a unique schema.
I see several possibilities, none of which are particularly appealing, but I hope I'm missing something.
Possibility 1: Use oneOf for each message type. I guess this would work, but the problem is very long validation errors in case something goes wrong: validators tend to report every schema that failed, which include ALL elements in "oneOf" array.
{
"oneOf":
[
{
"type": "object",
"properties":
{
"t":
{
"type": "string",
"enum":
[
"message_type_1"
]
}
}
},
{
"type": "object",
"properties":
{
"t":
{
"type": "string",
"enum":
[
"message_type_2"
]
},
"some_other_property":
{
"type": "integer"
}
},
"required":
[
"some_other_property"
]
}
]
}
Possibility 2: Nested "if", "then", "else" triads. I haven't tried it, but I guess that maybe errors would be better in this case. However, it's very cumbersome to write, as nested if's pile up.
Possibility 3: A separate scheme for every possible value of "t". This is the simplest solution, however I dislike it, because it precludes me from using common elements in schemas (via references).
So, are these my only options, or can I do better?
Since "type" is a JSON Schema keyword, I'll follow your lead and use "t" as the type-discrimination field, for clarity.
There's no particular keyword to accomplish or indicate this (however, see https://github.com/json-schema-org/json-schema-spec/issues/31 for discussion). This is because, for the purposes of validation, everything you need to do is already possible. Errors are secondary to validation in JSON Schema. All we're trying to do is limit how many errors we see, since it's obvious there's a point where errors are no longer productive.
Normally when you're validating a message, you know its type first, then you read the rest of the message. For example in HTTP, if you're reading a line that starts with Date: and the next character isn't a number or letter, you can emit an error right away (e.g. "Unexpected tilde, expected a month name").
However in JSON, this isn't true, since properties are unordered, and you might not encounter the "t" until the very end, if at all. "if/then" can help with this.
But first, begin by by factoring out the most important constraints, and moving them to the top.
First, use "type": "object" and "required":["t"] in your top level schema, since that's true in all cases.
Second, use "properties" and "enum" to enumerate all its valid values. This way if "t" really is entered wrong, it will be an error out of your top-level schema, instead of a subschema.
If all of these constraints pass, but the document is still invalid, then it's easier to conclude the problem must be with the other contents of the message, and not the "t" property itself.
Now in each sub-schema, use "const" to match the subschema to the type-name.
We get a schema like this:
{
"type": "object",
"required": ["t"],
"properties": { "t": { "enum": ["message_type_1", "message_type_2"] } }
"oneOf": [
{
"type": "object",
"properties": {
"t": { "const": "message_type_1" }
}
},
{
"type": "object",
"properties":
"t": { "const": "message_type_2" },
"some_other_property": {
"type": "integer"
}
},
"required": [ "some_other_property" ]
}
]
}
Now, split out each type into a different schema file. Make it human-accessible by naming the file after the "t". This way, an application can read a stream of objects and pick the schema to validate each object against.
{
"type": "object",
"required": ["t"],
"properties": { "t": { "enum": ["message_type_1", "message_type_2"] } }
"oneOf": [
{"$ref": "message_type_1.json"},
{"$ref": "message_type_2.json"}
]
}
Theoretically, a validator now has enough information to produce much cleaner errors (though I'm not aware of any validators that can do this).
So, if this doesn't produce clean enough error reporting for you, you have two options:
First, you can implement part of the validation process yourself. As described above, use a streaming JSON parser like Oboe.js to read each object in a stream, parse the object and read the "t" property, then apply the appropriate schema.
Or second, if you really want to do this purely in JSON Schema, use "if/then" statements inside "allOf":
{
"type": "object",
"required": ["t"],
"properties": { "t": { "enum": ["message_type_1", "message_type_2"] } }
"allOf": [
{"if":{"properties":{"t":{"const":"message_type_1"}}}, "then":{"$ref": "message_type_1.json"}},
{"if":{"properties":{"t":{"const":"message_type_2"}}}, "then":{"$ref": "message_type_2.json"}}
]
}
This should produce errors to the effect of:
t not one of "message_type_1" or "message_type_2"
or
(because t="message_type_2") some_other_property not an integer
and not both.

Can a JSON schema validator be killed with this schema?

I tried this out with some JSON schema validators and some fail, but the problem is to figure out how much memory a validator uses that causes it to choke and be killed.
It turns out that we can implement finite state machines in JSON schema. To do so, the FSM nodes are object schemas and the FSM edges are a set of JSON Pointers wrapped in an anyOf. The whole thing is rather simple to do, but being able to do this has some consequences: what if we create an FSM that requires 2^N time or memory (depth first search or breadth first search, respectively) given a JSON schema with N definitions and some input to validate?
So let's create a JSON Schema with N definitions to implement a non-deterministic finite state machine (NFA) over an alphabet of two symbols a and b. All we need to do is to encode the regex
(a{N}|a(a|b+){0,N-1}b)*x, where x denotes the end. In the worst case, the NFA for this regex takes 2^N time to match text or 2^N memory (e.g. when converted to a deterministic finite state machine). Now notice that the word abbx can be represented by a JSON pointer a/b/b/x which in JSON is equivalent to {"a":{"b":{"b":{"x":true}}}}.
To encode this NFA as a schema, we first add a definition for state "0":
{
"$schema": "http://json-schema.org/draft-04/schema#",
"$ref": "#/definitions/0",
"definitions": {
"0": {
"type": "object",
"properties": {
"a": { "$ref": "#/definitions/1" },
"x": { "type": "boolean" }
},
"additionalProperties": false
},
Then we add N-1 definitions for each state <DEF> to the schema where <DEF> is enumerated "1", "2", "3", ... "N-1":
"<DEF>": {
"type": "object",
"properties": {
"a": { "$ref": "#/definitions/<DEF>+1" },
"b": {
"anyOf": [
{ "$ref": "#/definitions/0" },
{ "$ref": "#/definitions/<DEF>" }
]
}
},
"additionalProperties": false
},
where "<DEF>+1" wraps back to "0" when <DEF> is equal to N-1.
This "NFA" on a two-letter alphabet has N states, only one initial and one
final state. The equivalent minimal DFA has 2^N (2 to the power N) states.
This means that in the worst case, a validator that uses this schema either must be taking 2^N time or use 2^N memory "cells" to validate the input.
I don't see where this logic can go wrong, unless validators take shortcuts to approximate the validity checking.
I found this here.
I think in principle you are right. I am not 100% sure about the schema construction you've described, but theoretically it should be possible to construct a schema which required ^N time or space, exactly for the reasons you describe.
Practically most schema processors will probably just try to recursively validate anyOf. So, that would be exponential time.

TypeScript wiki (TypeScript-Handbook/pages/Classes.md) first example

https://github.com/Microsoft/TypeScript-Handbook/blob/master/pages/Classes.md
I'm trying to learn TypeScript. The first example in Classes:
class Greeter {
greeting: string;
constructor(message: string) {
this.greeting = message;
}
greet() {
return "Hello, " + this.greeting;
}
}
let greeter = new Greeter("world");
This seems straightforward, but when I log greeter: console.log(greeter);
instead of getting "Hello World" I get "Greeter {greeting: "world"}"
My setup:
package.json: (just TypeScript; no other libraries)
{
"name": "typescript learning",
"version": "1.0.0",
"scripts": {
"start": "concurrently \"npm run tsc:w\" \"npm run lite\" ",
"tsc": "tsc",
"tsc:w": "tsc -w",
"lite": "lite-server"
},
"license": "ISC",
"dependencies": {
"concurrently": "^2.0.0",
"lite-server": "^2.1.0",
"typescript": "^1.8.0"
}
}
and tsconfig.json just defaults:
{
"compilerOptions": {
"target": "es5",
"module": "system",
"moduleResolution": "node",
"sourceMap": true,
"emitDecoratorMetadata": true,
"experimentalDecorators": true,
"removeComments": false,
"noImplicitAny": false
},
"exclude": [
"node_modules",
"typings/main",
"typings/main.d.ts",
"typescript.notes.ts"
]
}
So, am I missing something fundamental? Or is this just an incomplete example that shouldn't be evaluated? Obviously I'm still quite new to TypeScript and don't have any background to take examples apart from their face value. Many thanks for any input,
-Mike
Your problem is:
let greeter = new Greeter("world");
console.log(greeter);
This only shows the class instance itself, and doesn't actually call a method on the class.
So what you want is:
let greeter = new Greeter("world");
console.log(greeter.greet());
To also answer your question in the comments:
One quick question, even though the method greet is part of the class, It doesn't get evaluated by calling the class? I see this is the case, but again, not what I expected. I'm trying to get a model in my mind for using the class instead of separate function.
At its essence, a class is basically nothing more than a collection of methods and variables that logically "belong" together for some reason.
For example, if I have a class Car, it might have the variable fuel and the methods drive() and refuel(). Calling the drive() and refuel() methods would alter the variable fuel. This way, you can easily create one, two, or a hundred instances of one class, and still easily keep track of stuff. Without object-oriented-programming, all of that would be a lot harder to keep track off, especially when you start creating multiple cars.
Obviously, you don't want to immediately start drive() when creating a new car! There is the constructor method in your code, which does get run automatically every time a class is created. This is often useful to initialize some things, but is really nothing more than a shortcut for something like:
let greeter = new Greeter();
greeter.set_message("world")
Except that you can't forget it to use it ;-) The constructor is often used for variables that the class should always have, like the string in your example, or, in our Car example, setting the fuel to some initial level. Hence the name, it is needed to construct the class.
In the "real world" most classes are a bit more abstract, and there are some features which allow you to do more (like inheritance), but the basic idea is the same: a class is a collection if methods and variables that logically belong to the same "thing" or "object" − I feel some guides make this a lot more complicated than it needs to be by the way, as they immediately want to introduce concepts such as inheritance right from the start without fully explaining the basic purpose of classes.
Don't worry if you don't fully comprehend everything when you're just starting out. I think few people do. I certainly didn't. Almost everyone struggles with stuff like this at first.
greeter is an object. So, calling console.log(greeter); is logging an the actual object whose greeting property is set to world.
You want to log greeter.greet() in order to see "Hello, world."

NSJSONSerialization generating NSCFString* and NSTaggedPointerString*

Executing NSJSONSerialization on the following json sometimes gives me NSCFString* and sometimes NSTaggedPointerString* on string values. Does anyone know why this is the case and what NSJSONSerialization uses to determine which type it returns?
jsonData = [NSJSONSerialization JSONObjectWithData:data
options:kNilOptions
error:&parseError];
{
"UserPermissionsService": {
"ServiceHeader": {},
"UserApplicationPermissions": {
"ApplicationPermissions": {
"ApplicationID": "TEST",
"Permission": [
{
"Locations": [
"00000"
],
"PermissionID": "LOGIN"
},
{
"Locations": [
"00000"
],
"PermissionID": "SALES_REPORT_VIEW"
}
]
}
}
}
}
"LOGIN" comes back as a NSTaggedPointerString*. "SALES_REPORT_VIEW" comes back is a NSCFString*. This is having an impact downstream where I'm using and casting the values.
UPDATE
Here's what I learned...
"NSTaggedPointerString results when the whole value can be kept in the pointer itself without allocating any data."
There's a detailed explanation here...
https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
Since NSTaggedPointerString is a subclass of NSString showing quotes / not showing quotes should never been an issue for me as the data is used.
Thanks for everyone that commented. I'm comfortable I understand what NSJSONSerialization is doing.
Much of Foundation is implemented as class clusters. TL;DR you interact with it as an NSString but foundation will change the backing implementation to optimize for certain performance or space characteristics based on the actual contents.
If you are curious one of the Foundation team dumped a list of the class clusters as of iOS 11 here
I FIXED IT BY USING "MUTABLECOPY"
I had the same issue. For some "performance" mechanism apparently apple uses NSTaggedPointerString for "well known" strings such as "California" but this might be an issue since for some weird reason the NSJSONSerialization doesn't add the quotes around this NSTaggedPointerString type of strings. The work around is simple:
NSString *taggedString = #"California";
[data setObject:[taggedString mutableCopy] forKey:#"state"]
Works like a charm.