Lucene Indexing - lucene

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...

For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.

I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.

When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.

Related

Is it a bad idea to create index containing field that changes frequently?

I need to prevent table scan on a big table where a search of a record is based on three fields, one of which may be updated.
The searching query looks like this:
select blabla from ttg_transaction where uti = ? and txn_type = ? and state = ?
The index that comes to mind (not unique, not clustered) would the three fields above. But while the first two are constant, 'state' does change during the life cycle of a record.
Is this a good reason to exclude 'state' from the index?
Things that would make this a bad idea
If you have slow storage (spinning metal disks)
If your data types are large (TEXT/NTEXT, VARBINARY, XML, ...)
High frequency updates.
Or a combination of these.
Assuming you have fast storage, I wouldn't worry to much. If you still have slow storage you could
Profile the insert/update statements
Create the index
Profile the insert/update statements when the index is created
Compare the results.
To profile you can use SET STATISTICS IO ON and/or SET STATISTICS TIME ON
To handle index fragmentation you could specify a fill factor that makes sense for your case.
If state is random text field this could interfere with your statistics as well, but you didn't specify.

How to find out how much space a SQL Server table uses?

Is it possible to get the amount of space on disk that a particular table uses? Let's say I have a million users stored in my table and I want to know how much space it's required to store all users and/or one of them.
Update:
I'm planning to use redis to cache some fields from one particular table in memory to quickly retrieve the needed data after. So I need to calculate how much space approximately will it take and thus will it fit in the memory or not. Definitely it depends on the data types that I use inside my table but if a table consists of several dozens of fields it would take too much time to count this one by one.
There is exactly such answer for the MySQL though it's not suitable for SQL Server: How can you determine how much disk space a particular MySQL table is taking up? You can check it to see what I mean.
If you have SSMS, you can right-click on the table in the Object Explorer, go to Properties, and then look at the Storage page. The field, Data space, is the size of the data in that table, but it probably does not include some of the overhead costs of the table.
This is really an extended comment, because it does not directly answer the question.
For most purposes, you just use the size of the columns, add them together, and multiply by the number of rows. This lowballs the estimate, but it is reasonable. And (depending on how you handle the types) might be a reasonable estimate of the size of exporting the data.
That said, the storage of tables is a difficult matter. Here are some of the factors you need to take into account:
The size of individuals fields. This is made slightly more difficult because some types have varying sizes, so those are entirely data dependent.
The number of pages occupied by a table (or equivalently how full each data page is). Note that this can vary, depending on how full each table is.
The number of pages occupied by "overflow" data types, such as varchar(max).
Whether or not the data pages are compressed or encrypted.
The indexes for the table.
How full each index page is.
And, no doubt, I've left out a bunch of other relevant internal details (here is a place to start on page layouts).
In other words, there isn't a simple answer. Equivalent tables on two different systems could occupy very different amounts of space. This is true of the "same" table on the same system at different times.
The general answer when working with databases is that you need a lot more space than number of rows * row size -- I seem to recall using a factor of 3 at one point in time. In general, storage is pretty cheap, so this is not the limiting factor using a database.
We would need to see your full database schema, with tables and columns and all fields' data types. Without those pieces of information it's just a lucky guess. Here is a helpful cheat sheet of the sizes of each data type: https://www.connectionstrings.com/sql-server-2012-data-types-reference/
Then you just have to do the Math and calculate the space needed for X, which is your number of records

SOLR index size reduction

We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!
There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?

Can anyone please explain "storing" vs "indexing" in databases?

What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks
Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.
From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.
Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)
Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.
storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...
Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.

How do I estimate the size of a Lucene index?

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?