filter by array size in kibana discover - lucene

I am trying to write a Lucene query in kibana's discover which returns only documents that the array size of certain key are bigger then some number
Meaning I want all documents that their length of the value array of a specific value is bigger than a given number

there's no way to do this with a lucene query. you'd need to create a painless scripted/runtime field to count the array items, then query that with kql


Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Scala: Find edit-distance for all elements of a list not fitting in memory

In my previous question I was asking for advice on an algorithm to compare all elements in huge list:
Scala: Compare all elements in a huge list
A more general problem I am facing and would be grateful to get some advice is to do approximate comparison of list elements for a list not fitting into memory all at once. I am building this list from SQL request that returns a cursor to iterate a single string field of about 70 000 000 records. I need to find edit-distance ( between every two string elements in this list.
My idea here is to use sliding window of N records to compare all 70 000 000 records:
Read N elements into a list that nicely fit into memory (N ~ 10 000)
Calculate edit-distance for all elements in this list using algorithm described here:
Scala: Compare all elements in a huge list
Read next N elements (from N to 2N-1) into a new list. Compare all these as in 2.
Rewind SQL query cursor to the first record
Compare every string starting from index 0 to N with all strings in this new list using the same algorithm as in 2.
Slide window to read strings form 2N to 3N-1 records into a new list
Compare every string starting from index 0 to 2N with all strings in this new list using the same algorithm as in 2.
All comparison results I need to write into DB as (String, String, Distance) records where first two elements are strings to match and third is a result.
How to force Scala to garbage collect unneeded lists from the previous steps of this algorithm?
This algorithm is awful in terms of number of calculations required to do the job. Any other algorithms, ideas on how to reduce complexity?

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.

How does Lucene work

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.
A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.
Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.
Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.
However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.
Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present.
Thus Lucene is very very fast.
For more info read this article by Google developers-
In a word: indexing.
Lucene creates an index of your document that allows it to search much more quickly.
It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.
I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."
My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).
Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.
Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.
Example :
File 1 : Random Access Memory is the main memory.
File 2 : Hard disks are secondary memory.
Lucene creates a reverse index something like
File 1 :
Term : Random
Frequency : 1
Position : 0
Term : Memory
Frequency : 2
Position : 3
Position : 6
So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,
File 1 : Frequency - 1
File 1 : Frequency - 2
File 2 : Frequency - 1
The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).