I have an existing application where there are multiple application flows in it.
All the flows are of JMS messaging flows - where different system exchanges messages of queue.
I want to find out the huge logger statements from the log - which are like more than 10 lines or so.
What i tried - I tried using patterns tab where splunk tells us what are repetitive patterns.
I am good with repetitive patterns - but i want to find out logger statements which are huge in size.
So - is it possible to find out such log statements which are longer/bigger
thank you in advance
Splunk has a built-in field called "linecount" that should do what you want.
index=foo
| where linecount > 10
You can also find the size of an event using the len function.
index=foo
| eval size=len(_raw)
| where size > 5000
Be aware that Splunk truncates large events to 10,000 characters by default, although that setting can be changed in props.conf via TRUNCATE = <n>.
Related
We are creating redis streams dynamically.
We want to keep track of summary of streams like
streamname
count/length
Is there a command that can do this?
SCAN was suggested somewhere but it was not returning expected result
Last option is to maintain diff list of stream-names in redis..
Above are basic requirement.
It would be nice to have info about last element.I think that can be used by
XREVRANGE temp + - COUNT 1
I have a message thread, these messages are coming on splunk.
The chain consists of ten different messages: five messages from one system, five messages from another (backup) system.
Messages from the primary system use the same SrcMsgId value, and messages from the backup system are combined with a common SrcMsgId.
Messages from the standby system also have a Mainsys_srcMsgId value - this value is identical to the main system's SrcMsgId value.
The message chain from the backup system enters the splunk immediately after the messages from the main system.
Tell me how can I display a chain of all ten messages? Perhaps first messages from the first system (main), then from the second (backup) with the display of the time of arrival at the server.
With time, I understand, I will include _time in the request. I got a little familiar with the syntax of queries, but still I still have a lot of difficulties with creating queries.
Please help me with an example of the correct request.
Thank you in advance!
You're starting with quite a challenging query! :-)
To combine the two chains, they'll need a common field. The SrcMsgId field won't do since it can represent different message chains. What you can do is create a new common field using Mainsys_srcMsgId, if present, and SrcMsgId. Then link the messages via that field using streamstats. Finally sort by the common field to put them together. Here's an untested sample query:
index=foo
```Get Mainsys_srcMsgId, if it exists; otherwise, get SrcMsgId```
| eval joiner = coalesce(Mainsys_srcMsgId, SrcMsgId)
| streamstats count by joiner
```Find the earliest event for each chain so can sort by it later```
| eventstats min(_time) as starttime by joiner
```Order the results by time, msgId, sequence
| sort starttime joiner count
```Discard our scratch fields```
| fields - starttime joiner count
We are considering moving out log analytics solution from ElasticSearch/Kibana to Splunk.
We currently use "document id" in ElasticSearch to deduplicate records when indexing :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
In Splunk, I found the internal field "_cd" which is unique to each record in Splunk index: https://docs.splunk.com/Documentation/Splunk/8.1.0/Knowledge/Usedefaultfields
However, using HTTP Event Collector to ingest records, I couldn't find any way to embed this "_cd" field in the request :
https://docs.splunk.com/Documentation/Splunk/8.1.0/Data/HECExamples
Any tips on how to achieve this in Splunk ?
What are you trying to achieve?
If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".
It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.
Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement
We currently use "document id" in ElasticSearch to deduplicate records when indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.
It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.
How are you getting duplicate data in the first place?
Fields in Splunk which begin with the underscore (eg _time, _cd, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.
If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats and, when absolutely necessary/unavoidable, dedup).
HEC inputs don't go through the usual ingestion pipeline so not all internal fields are present.
Not that it matters, really, because Splunk doesn't deduplicate at index time. There is no provision for searching data to see if a given record is already present. Any deduplication must be done at search time.
One cannot use the _cd field to deduplicate at search time because two identical records will have different _cd values.
Consider using a tool such as Cribl to add a hash to each ingested record and use that hash in Splunk to deduplicate in your searches.
Good call #RichG. Cribl has some nice options for this use case.
https://cribl.io/blog/streaming-data-deduplication-with-cribl/
Be aware you can add other fields to HEC data if you are using Cribl LogStream. You get many more options using LogStream. It saved my old team so much time and effort.
Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break
Because of memory limitation i need to split a result from sql-component (List<Map<column,value>>) into smaller chunks (some thousand).
I know about
from(sql:...).split(body()).streaming().to(...)
and i also know
.split().tokenize("\n", 1000).streaming()
but the latter is not working with List<Map<>> and is also returning a String.
Is there a out of the Box way to create those chunks? Or do i need to add a custom aggregator just behind the split? Or is there another way?
Edit
Additional info as requested by soilworker:
At the moment the sql endpoint is configured this way:
SqlEndpoint endpoint = context.getEndpoint("sql:select * from " + lookupTableName + "?dataSource=" + LOOK_UP_DS,
SqlEndpoint.class);
// returns complete result in one list instead of one exchange per line.
endpoint.getConsumerProperties().put("useIterator", false);
// poll interval
endpoint.getConsumerProperties().put("delay", LOOKUP_POLL_INTERVAL);
The route using this should poll once a day (we will add CronScheduledRoutePolicy soon) and fetch a complete table (view). All the data is converted to csv with a custom processor and sent via a custom component to proprietary software. The table has 5 columns (small strings) and around 20M entries.
I don't know if there is a memory issue. But i know on my local machine 3GB isn't enough. Is there a way to approximate the memory footprint to know if a certain amount of Ram would be enough?
thanks in advance
maxMessagesPerPoll will help you get the result in batches