Elastalert - Text fields are not optimised for operations that require per-document field data - Please use a keyword field instead

Elastalert - Text fields are not optimised for operations that require per-document field data - Please use a keyword field instead - elastalert

I have setup elastalert on a server and managed to run a rule to monitor disk usage. It was running ok until I had to rebuild the server. Now when I run the rule I get error below. I can't find a solution on Internet. Any ideas? Thank you in advanced.
ERROR:root:Error running query: RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [host.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')
alert rule:
name: "warning:High Disk Usage - Disk use is over 85% of capacity:warning"
type: metric_aggregation
index: metricbeat-*
metric_agg_key: system.filesystem.used.pct
metric_agg_type: avg
alert_subject: "High Disk Usage"
max_threshold: 0.15
filter:
- term:
metricset.name: filesystem
- term:
system.filesystem.mount_point: "/"
query_key: host.name

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Exceeded max configured index size while indexing document while running LS Agent

In our project we have a LS agent which is supposed to delete and create a new FT index. We haven't figured out yet why updating the index cannot be accomplished automatically(although in our database option this is set to update the index daily). So we decided to make roughly the same thing, but using our very simple agent. The agent looks like this and runs daily in the night:
Option Public
Option Declare
Dim s As NotesSession
Dim ndb As NotesDatabase
Sub Initialize
Set s = New NotesSession
Set ndb = s.CurrentDatabase
Print("BEFORE REMOVING INDEXES")
Call ndb.Removeftindex()
Print("INDEXES HAVE BEEN REMOVED SUCCESSFULLY")
Call ndb.createftindex(FTINDEX_ALL_BREAKS, true)
Print("INDEXES HAVE BEEN CREATED SUCCESSFULLY")
End Sub
In most cases it works very well, but sometimes when somebody creates a document which exceeds 12MB (we really don't know how this is possible) the agent fails to create the index. (but the latest index is already deleted).
Error message is:
31.05.2018 03:01:25 Full Text Error (FTG): Exceeded max configured index
size while indexing document NT000BD992 in database index *path to FT file*.ft
My question is how to avoid this problem? We've already expanded the limit of 6MB by the following command SET CONFIG FTG_INDEX_LIMIT=12582912. Can we expand it even more? And in general, how to solve the problem? Thanks in advance.

Using FTG_INDEX_LIMIT is an option to avoid this error, yes. But it will impact server performance in two ways - FTI-update processes will take more time and more memory.
There's no max size of this limit (in theory), but! As update-processes eats memory from common heap it can lead to 'out-of-memory-overheaping' and server crash.
You can try to exclude attachments from index - i don't think someone can put more than 1mb of text in one doc, but users can attach some big text files - and this will produce the error you writing about.
p.s. and yeah, i agree with Scott - why do you need such agent anyway? common ft-indexing working fine usualy.

HBase-indexer & Solr : NOT found data

I am currently using hbase-indexer to index hbase in solr.
When I execute foolowing command to check the indexer,
hbase-indexer$ bin/hbase-indexer list-indexers --zookeeper 127.0.0.1:2181
The result is said that:
myindexer
+ Lifecycle state: ACTIVE
+ Incremental indexing state: SUBSCRIBE_AND_CONSUME
+ Batch indexing state: INACTIVE
+ SEP subscription ID: Indexer_myindexer
+ SEP subscription timestamp: 2017-01-24T13:15:48.614+09:00
+ Connection type: solr
+ Connection params:
+ solr.zk = localhost:2181/solr
+ solr.collection = tagcollect
+ Indexer config:
222 bytes, use -dump to see content
+ Indexer component factory:
com.ngdata.hbaseindexer.conf.DefaultIndexerComponentFactory
+ Additional batch index CLI arguments:
(none)
+ Default additional batch index CLI arguments:
(none)
+ Processes
+ 1 running processes
+ 0 failed processes
I think hbase-indexer works well as shown above, because it is displayed as + 1 running processes.(Prior to this, I've already executed hbase-indexer daemon by the command : ~$ bin/hbase-indexer server )
For test, I've insert data in Hbase through put command and checked the data was inserted.
But, solr qry said following that: (No Record)
I wish your knowledge and experience associated with this to be shared.
Thank you.
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":7,
"params":{
"q":"*:*",
"indent":"on",
"wt":"json",
"_":"1485246329559"}},
"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
}}

We encountered same issue.
As You are saying sever instance has good health, below are reasons which it wont work.
Firstly, If 'Write ahead log'(WAL) is disabled (may be for write performance reasons) then your puts wont create solr documents.
Hbase NRT indexer works on WAL. if its disabled then it wont create solr documents.
Second reason may be mophiline configurations if they are not correct then it wont create solr documents
However, I'd suggest to write a custom mapreduce programs(or spark jobs as well) to index solr documents by reading hbase data (if not Real time, that means when ever your put data in to hbase immeditely it wont reflect, after mapreduce solr indexer runs solr documents will be created)

Graphdb's loadrdf tool loads ontology and data very slow

I am using GraphDB loadrdf tool to load an ontology and a fairly big data. I set pool.buffer.size=800000 and jvm -Xmx to 24g. I tried both parallel and serial modes. They both slow down once the repo total statements go over about 10k. It eventually slows down to 1 or 2 statements/second. Does anyone know if this is a normal behavior of loadrdf or there's a way to optimize the performance?
Edit I have increased tuple-index-memory. See part of my repo ttl configuration:
owlim:entity-index-size "45333" ;
owlim:cache-memory "24g" ;
owlim:tuple-index-memory "20g" ;
owlim:enable-context-index "false" ;
owlim:enablePredicateList "false" ;
owlim:predicate-memory "0" ;
owlim:fts-memory "0" ;
owlim:ftsIndexPolicy "never" ;
owlim:ftsLiteralsOnly "true" ;
owlim:in-memory-literal-properties "false" ;
owlim:transaction-mode "safe" ;
owlim:transaction-isolation "true" ;
owlim:disable-sameAs "true";
But somehow the process still slows down. It starts with "Global average rate: 1,402 st/s". But slows down to "Global average rate: 20 st/s" after "Statements in repo: 61,831". I give my jvm: -Xms24g -Xmx36g

can you please post your repository configuration? Inside it, there is a parameter tuple-index-memory - this will determine the amount of changes(disc pages) that we are allowed to keep in memory. The bigger this value is the smaller amount of flushes we are going to do.
Check if this is set to a value like 20G in your setup and retry the process again.

I've looked at you repository configuration ttl. There is this parameter: entity-index-size=45333 whose value needs to be increased, e.g. set it to 100 million (entity-index-size=100000000). Default value for that parameter in GraphDB 7 is 10M, but since you've set it explicitly it gets overriden.
You can read more about that parameter here

Need help in Apache Camel multicast/parallel/concurrent processing

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)

camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas