How does lucene store a document? - lucene

Basically, how are each field inside a document stored in the inverted index? Does Lucene internally create a separate index for each field? Also Suppose a query is on a specific field, how does search works for it internally?
I know how inverted indices work. But how do you store multiple fields in a single index and how do you differentiate when to only search on particular fields when requested.

As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).
Below is a sample of what you can expect to see when you use the SimpleTextCodec.
How do you store multiple fields in a single index?
To show a basic example, assume we have a Lucene text field defined as follows:
Field textField1 = new TextField("bodytext1", content, Field.Store.NO);
And assume we have two documents as follows (analyzed using the StandardAnalyzer:
Document 0: echo charlie delta echo
Document 1: bravo alfa charlie
This will give us a basic hierarchical index structure as follows:
field bodytext1
term alfa
doc 1
freq 1
pos 1
term bravo
doc 1
freq 1
pos 0
term charlie
doc 0
freq 1
pos 1
doc 1
freq 1
pos 2
term delta
doc 0
freq 1
pos 2
term echo
doc 0
freq 2
pos 0
pos 3
The general structure is therefore:
field [field 1]
term [token value]
doc [document ID]
frequency
position
field [field 2]
term [token value]
doc [document ID]
frequency
position
And so on, for as many fields as are indexed.
This structure supports basic field-based querying.
You can summarize it as:
field > term > doc > freq/pos
So, "does Lucene internally create a separate index for each field?" Yes, it does.
Lucene can also store other additional structures in its index files, depending on how you configure your Lucene fields - so, this is not the only way data can be indexed.
For example you can request "term vector" data to also be indexed, in which case you will see an additional index structure:
doc 0
numfields 1
field 1
name content2
positions true
offsets true
payloads false
numterms 3
term charlie
freq 1
position 1
startoffset 6
endoffset 13
term delta
freq 1
position 2
startoffset 15
endoffset 20
term echo
freq 2
position 0
startoffset 0
endoffset 4
position 3
startoffset 23
endoffset 27
doc 1
...
This structure starts with documents, not fields - and is therefore well suited for processing which already has a document selected (e.g. the "top hit" document). With this, it is easy to locate the position of a matched word in a specific document field.
This is far from a comprehensive list. But by using SimpleTextCodec, together with different field types, documents and analyzers, you can see for yourself exactly how Lucene indexes its data.

Related

Extractive Text Summarization: Weighting sentence location in document

I am looking at an extractive text summarization problem. Eventually, I want to generate a list of words (not sentences) that seem to be the most important. One of the ideas that I had was to the words that appear early in the document more heavily.
I have two dataframes. the first is a set of words with their occurrence counts:
words.head()
words occurrences
0 '' 2
1 11-1 1
2 2nd 1
3 april 1
4 b.
And the second is a set of sentences. 0 is the first sentence in the document, 1 is the secont.. etc.
sentences.head()
sentences
0 Site Menu expandHave a correction?...
1 This will be a chance for ...
2 The event will include...
3 Further, this...
4 Contact:Share:
I managed to accomplish my goal like this:
weights = []
for value in words.index.values:
weights.append(((len(sentences) - sentences.index.values) *
sentences['sentences'].str.contains(words['words'][value])).sum())
weights
[0,
5,
5,
0,
12,...]
words['occurrences'] *= weights
words.head()
words occurrences
0 '' 0
1 11-1 5
2 2nd 5
3 april 0
4 b. 12
However, this seems sort of sloppy. I know that I can use list comprehension (I thought it would be easier to read on here without it) - but, other than that, does anyone have thoughts on a more elegant solution to this problem?

need to extract all the content between two string in pandas dataframe

I have data in pandas dataframe. i need to extract all the content between the string which starts with "Impact Factor:" and ends with "&#". If the content doesn't have "Impact Factor:" i want null in that row of the dataframe
this is sample data from a single row.
Save to EndNote online &# Add to Marked List &# Impact Factor: Journal 2 and Citation Reports 500 &# Other Information &# IDS Number: EW5UR &#
I want the content like the below in a dataframe .
Journal 2 and Citation Reports 500
Journal 6 and Citation Reports 120
Journal 50 and Citation Reports 360
Journal 30 and Citation Reports 120
Hi you can just use a regular expression here:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:(.*?)&#',x))
You may want to strip white spaces too in which case you could use:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:\s*(.*?)\s*&#',x))

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

Looping calculations from data frames

I have a large dataset coming in from SQLdf. I use split to order it by an index field from the query and list2env to split these into several data frames. These data frames will have names like 1 through 178. After splitting them, i want to do some calculations on all of them. How should i "call" a calculations for 1 through 178 (might change from day to day) ?
Simplification: one dataset becomes n data frames splitted on an index (like this):
return date return benchmark_returen index
28-03-2014 0.03 0.05 6095
with typically 252 * 5 obs (IE: 5 years)
then i want to split these on the index into (now 178 dfs)
and perform typically risk/return analytics from the PerformanceAnalytics package like for example chart.Histogram or charts.PerformanceSummary.
In the next step i would like to group these and insert them into a PDF for each Index. (the graphs/results that is).
As others have pointed out the question lacks a proper example but indexing of environments can be done as with lists. In order to construct a list the have digits as index values one needs to use backticks, and arguments to [[ when accessing environments need to be characters
> mylist <- list(`1`="a", `2`="b")
> myenv <- list2env(mylist)
> myenv$`1`
[1] "a"
> myenv[[as.character(1)]]
[1] "a"
If you want to extract values (and then possibly put them back into the environment:
sapply(1:2, function(n) get(as.character(n), envir=myenv) )
[1] "a" "b"
myenv$calc <- with(myenv, paste(`1`, `2`))

BitMap field in Lucene

Is there any way to store a bitmap field in lucene and search using bit mask operations?
I have a lot of boolean attributes for an object and instead of having a separate field for each one I'm considering if there's a way to store every attribute as a bit in a bitmap and search using a bitmask.
The field value coud be something like:
Attr 1 | Attr 2 | Attr 3 | Attr 4
0 1 0 1
And if I search for documents with Attr 1 & Attr 3, I'd mask with:
Attr 1 | Attr 2 | Attr 3 | Attr 4
1 0 1 0
in a logical AND operation
A kludge is to convert the bit field to a number, then search on numeric values. For example, if you have "0 1 0 1" convert it to "5" then search on "5". But, this does not work directly for "find all documents with Attr 4" if the documents can have other attributes -- you need to search on (in query parser syntax):
+(1 3 5 7 9 11 13 15)
(this assumes that "Attr 4" becomes the least-significant bit in the resulting numeric value (and that the default operator is OR)).