Luke Where are my field values? - lucene

I've used Luke like four times per year for the past three years. I only break it out when I need it. One concept I've never understood is why only certain fields' values are displayed. I can query these "empty" fields for expected values and get the expected results, but Luke never displays these. I assume I'm missing something fundamental and obvious, but it's not so obvious to me.
Example Search tab:
Example Documents tab:

When a program creates a Lucene Document, it might tell Lucene whether to store the value of the field or not. See, for example, the stored argument to the StringField constructor. If the value is not stored then it can be searched on, but the original bytes of the value are not saved in the index, since they are not required nor used by the search.
A typical pattern with, say, http://www.elasticsearch.org/ is to store the original JSON in a single field and not to store the actually indexed fields. That way the application working with the retrieved data might use it's native data format and does not have to be aware of the Lucene and it's flat key-value Document.

Related

Search in all columns in crate

As elastic search has _all field I am not able to find anything regarding that in cratedb. SO do we need to maintain our own analyzed field for that purpose or does crate provide something in built?
The _all field is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved.
The _all field allows you to search for values in documents without knowing which field contains the value. This makes it a useful option when getting started with a new dataset
refer : https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
We don't have something similar to that, so you'd need to add it to the query or maintain a dedicated column.

Lucene - Expected behavior when indexing multiple occurrences of a token within a field

Lets say that I'm indexing a string value "useridA;useridB,userdidC,useridA,useridA"
The field is set to ANALYZED and uses a custom CharTokenizer which looks for a boundary comma char.
What is the expected behavior in the index, as the token "useridA" occurs multiples times within the same field?
Will it just re-index the same value an preserve the same space as if it would have been just one occurrence?
At the basic level lucene is an "inverted term index" it stores term->docID. So if a term occurs many times it'll only be recorded once.
Obviously this is a huge simplification. Positional information will also be stored depending on the TermVector value used when adding the field (you will need this to use phrase and slop queries).
Depending only your use-case I'd suggest you de-dupe the list either when indexing or just use a HashSet< string> for that property of whatever your class is.

Yahoo finance API field misalignment

I'm trying to use the Yahoo Finance API to create a custom csv but depending upon the stock there is field misalignment.
For instance, if I just want to download the "k3" field for yahoo which corresponds to last trade size, I would craft the url like so:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=k3
However, if you download that csv there are two columns of data.
Similarly, if I decide to get Float Shares , I want the url:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=f6
However that gives me 3 columns. Is there a way to get it in exactly one column? I want to automate this process but the different column orientations make it very difficult as different rows then have different column lengths and I am unable to easily match up the column name with the row.
Bonus: If someone can explain where the 3 float share numbers come from that would be great, I seem to only be able to match up the first to the site...
Thank you for your help!
In short, you are describing known bugs that Yahoo isn't going to fix as the feed is officially unsupported.
Specifically re. Float (f6): the number returned is the entire float. It is not 3 csv numbers. The commas are not delimiters; rather, they are 1,000s separators. (I suspect the same is the case with K3. As it is with a couple of other known numbers. (See link below.))
Two solutions:
(1) Write your own workaround using conditional statements (if or case) in your code.
(2) Download the buggy parameters separately from the clean ones.
See: Yahoo's official reply to your question.
The multiple columns is because that excel (or whatever csv viewer you are using) treats "thousand-seperator" as the the "comma-seperator". We used to have this problem in our school project, and found a hack which is good only if you are using this api for some hobby project and not concerning data usage.
The idea is instead of treating the results as a csv, pick a static column (column A) where you will know the value beforehand (e.g. column 's' stock symbol) or put this value as the first column. When constructing the query, use this column to surround those columns (float columns) with formatting problem. once you get the quotes.csv, manual seperate the results on the column A value.
for example using
http://download.finance.yahoo.com/d/quotes.csv?s=yhoo&f=sf6sa5sb6
will get you
"YHOO", 887,675,000,"YHOO",400,"YHOO",N/A
Then use ,"YHOO", to seperate the results (excluding first column).
Not an elegent way to solve the problem, but at least it gives you the correct result.

Untokenized field in Lucene search

I have stored a field in index file which is untokenized. When I try to get that field value from the index file I'm not able to do get it.
Note: I have another one untokenized field, there I'm able to get that value, the data stored in this field are not having any white spaces among this.
Example: (smith,david,walter,john)... But the one I'm asking is having white spaces among it. Example: (david smith,mark john,bill man)...
I don't think this might be the reason.
Your help is appreciated.
Remember that tokenization or lack of it has to be done both while indexing and searching.
Did you try using a keywordTokenizer in the search side?

Full Text Searching for single characters

I have a table with a TEXT column where the contents is just strings of CSV numbers. Example ",1,76,77,115," Each string can have an arbitrary number of numbers.
I am trying to set up Full Text Indexing so that I can search this column rapidly. This works great. Instead of running queries with
where MY_COL LIKE '%,77,%' and MY_COL LIKE '%,115,%'
I can do
where CONTAINS(MY_COL,'77 and 115')
However, when I try to search for a single character it doesn't work.
where CONTAINS(MY_COL,'1')
But I know that there should be records returned! I quickly found that I need to edit the Noise file and rebuild the index. But even after doing that it still doesn't work.
Working with relational databases that way is going to hurt.
Use a proper schema. Either store the values in different rows or use an array datatype for the column.
That will make solving the problem trivial.
I fixed my own problem, although I'm not exactly sure what fixed it.
I dropped my table and populated a new one (my program does batch processing) and created a new Full Text Index. Maybe I wasn't being patient enough to allow the indexing to fully rebuild.
Agreed. How does 12,15,33 not return that record for a search for 1 with fulltext? Use an actual table schema to accomplish this.