Splunk fields extraction limit for specific Index - indexing

In Limits.conf fields is limited to 200 we want 500 fields extraction for specific index . Tried giving limits to sourcetype of the index but not working . If we give it in limits.conf it will effect every index leading to high storage use . Is there a way to change field extraction for specific index

Changing limits.conf affects the entire system. There is no setting governing the number of fields each index can extract.
Changing limits.conf will not necessarily increase storage use as it will depend on the actual number of fields extracted.

Related

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Solr How to store limited amount of text instead of fully body

As per our business requirement, I need to index full story body (consider it for a news story for example) but in the Solr query result I need to return a preview text (say, first 400 characters) to bind to the target news listing page.
As I know there are 2 options in schema file for any field stored=false/true. Only way I can see as of now is I set it to true and take the full story body in result and then excerpt text to preview manually, but this seems not to be practical because (1) It will occupy GBs of space on disc for storing full body and (2) the json response becomes very heavy. (The query result can return 40K/50K stories).
I also know about limiting the number of records but for some reasons we need complete result at once.
Any help for achieving this requirement efficiently ?
In order to display just 400 characters in the news overview, you can simply use Solr Highlighting Feature and specify the number of snippets and their size. For instance for Standard highlighter you have parameters:
hl.snippets: Specifies maximum number of highlighted snippets to generate per field. It is possible for any number of snippets from
zero to this value to be generated. This parameter accepts per-field
overrides.
hl.fragsize: Specifies the size, in characters, of fragments to consider for highlighting. 0 indicates that no fragmenting should be
considered and the whole field value should be used. This parameter
accepts per-field overrides.
If you want to index everything but store only part of the text then you can follow the solution advised here in Solr Community.

How to index and serve poems using apache solr

I am using solr 4.10. I have to index poetry data in solr. Now what should be the document structure. Basically, I want to give a search facility for a term in poem. Only that specific distich should be given back. Now should I index complete poem in single document or one document per distich. I know some poems have two lines for single concept and some 4 etc. Now What should be its storing format ?
Index the distiches individually and link them through a poem identifier and a sequence id. That way you can also retrieve the distich before or after - or the whole poem.
If there's certain use cases that need to treat the poems as a whole instead, create a separate collection and index to both collections. That way you can adjust and tweak the search results as you need, depending on the use case.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.