Search using previous query's values - splunk

I am relatively new to Splunk, and I am attempting to perform a query like the following. The snippets below each step show some of what's been attempted.
Query for initial set of events containing a string value
* "string_value"
Get list of distinct values for a specific field returned from step 1
* "string_value" | stats list(someField)
Search for events containing any of the specific field's values returned from step 2
* "string_value" | stats list(someField) as myList | search someField in myList
I'm not entirely certain if this can be accomplished. I've read documents on subqueries, foreach, and various aggregate methods, though I am still uncertain on how to achieve what I need.
Other attempts:
someField IN [search * "string_value" | stats list(someField) as myList]
Thanks in advance.

You certainly can sequentially build a search like this, but you're likely better off doing it this way:
index=ndx sourcetype=srctp someField IN("my","list","of","values") "string_value"
| stats values(someField) as someField
The more you can put in your initial search, the better (in general)

Related

Splunk join two query to based on result of first query

In Splunk query I have two query like below
Query 1- index=mysearchstring1
Result - employid =123
Query 2- index=mysearchstring2
Here I want to use employid=123 in my query 2 to lookup and return final result.
Is it possible in Splunk?
It sounds like you're looking for a subsearch.
index=mysearchstring2 [ search index=mysearchstring1 | fields employid | format ]
Splunk will run the subsearch first and extract only the employid field. The results will be formatted into something like (employid=123 OR employid=456 OR ...) and that string will be appended to the main search before it runs.

Alternative to subsearch to search more than million entries

Hi I have a sub search command which gives me the required results but is dead slow in doing so. I am having more than a million log entries that i need to search which is the reason why i am looking for an optimized solution. I have gone through answers asked for similar questions but not able to achieve what i need
I have a log which has transactions against an entry_id which always has a main entry and may or may not have subEntry
I want to find the count of version number for all the mainEntry log which has a subEntry
sample Query that i used
index=index_a [search index=index_a ENTRY_FIELD="subEntry"| fields Entry_ID] Entry_FIELD="mainEntry" | stats count by version
Sample data
Index=index_a
1) Entry_ID=abcd Entry_FIELD="mainEntry" version=1
Entry_ID=abcd ENTRY_FIELD="subEntry"
2)Entry_ID=1234 Entry_FIELD="mainEntry" version=1
3)Entry_ID=xyz Entry_FIELD="mainEntry" version=2
4)Entry_ID=lmnop Entry_FIELD="mainEntry" version=1
Entry_ID=lmnop ENTRY_FIELD="subEntry"
5)Entry_ID=ab123 Entry_FIELD="mainEntry" version=3
Entry_ID=ab123 ENTRY_FIELD="subEntry"
Please help in optimizing this
Its not entirely clear what your sample data looks like.
Is it that events 1, 4 and 5 have the fields Entry_ID, Entry_FIELD, version, Entry_ID, Entry_FIELD ? That is, 2 occurances of Entry_ID and Entry_FIELD?
You can try something like the following, but I think you need to explain your data a bit better.
index=index_a Entry_FIELD="subEntry" OR Entry_FIELD="mainEntry"
| stats dc(Entry_FIELD) as Entry_FIELD_Count by Entry_ID, version
| where Entry_FIELD_Count==2
| stats count by version

Postgres use index with `split_part`

Context:
I have a test table:
=> \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------+------------------------+-----------+----------+---------+----------+--------
------+-------------
id | character varying(255) | | | | extended |
|
configuration | jsonb | | | | extended |
|
The configuration column contains "well-defined" json, which has a key called source_url (Skipping other non-relevant keys). An example value for configuration column is:
{
"source_url": "https://<resource-address>?Signature=R1UzTGphWEhrTTFFZnc0Q4qkGRxkA5%2BHFZSfx3vNEvRsrlDcHdntArfHwkWiT7Qxi%2BWVJ4DbHJeFp3GpbS%2Bcb1H3r1PXPkfKB7Fjr6tFRCetDWAOtwrDrVOkR9G1m7iOePdi1RW%2Fn1LKE7MzQUImpkcZXkpHTUgzXpE3TPgoeVtVOXXt3qQBARpdSixzDU8dW%2FcftEkMDVuj4B%2Bwiecf6st21MjBPjzD4GNVA%2F6bgvKA6ExrdYmM5S6TYm1lz2e6juk81%2Fk4eDecUtjfOj9ekZiGJVMyrD5Tyw%2FTWOrfUB2VM1uw1PFT2Gqet87jNRDAtiIrJiw1lfB7Od1AwNxIk0Rqkrju8jWxmQhvb1BJLV%2BoRH56OHdm5nHXFmQdldVpyagQ8bQXoKmYmZPuxQb6t9FAyovGMav3aMsxWqIuKTxLzjB89XmgwBTxZSv5E9bkWUbom2%2BWq4O3%2BCrVxYwsqg%3D%3D&Expires-At=1569340020&Issued-At=1568293200"
.
.
}
The URL contains a query param Expires-At
Problem:
There is a scheduled job that runs every 24 hours. This job should find all such records which are expired/about to expire(and then do something about it).
Solution:
I have this query to get my job done:
select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
Explanation:
The query first splits the source_url at Expires-At= and picks the part present at the right of it and then it splits the resultant string on & and picks the left part of it, thus getting the exact epoch time needed as text
The same query also works for the corner case when Expires-At is the last query param in the source_url
Once it extracts the epoch time as text, it first converts it to a bigint and then convert it to Postgres timestamp and then this timestamp is compared if it is going to be less than or equal to the time 24 hours away from now()
All rows passing the above condition are selected
So, at the end, in each run, scheduler refreshes all the urls that will expire in the next 24 hours (including the ones, which are already expired)
Questions:
Though this solves my problem, I really don't like this solution. This has a lot of string manipulation which I kind of find as un-clean. Is there a much cleaner way to do this?
If we "have" to go with above solution, can we even use indices for this kind of query? I know the functions lower(), upper() extra can be indexed, but I really can't think of any way where I could index this query.
Alternatives:
Unless there is a real clean solution, I am going to go with this:
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
And then directly query this newly added field(have the index on configuration column).
I admit that this way I am repeating the information Expires-At, but out of all possible solution I could think of, this is the one which I find to be most clean.
Is there a better way than this that you folks can think of?
EDIT:
Updated the query to use substring() with regex instead of inner split_part():
select * from test where to_timestamp(split_part(substring(configuration->>'source_url' from 'Expires-At=\d+'), '=', 2)::bigint) <= now() + interval '24 hours';
Given your current data model, I don't find your WHERE condition that bad.
You can index it with
CREATE INDEX ON test (
to_timestamp(
split_part(
split_part(
configuration->>'source_url',
'Expires-At=',
2
),
'&',
1
)::bigint
)
);
Essentially, youbhave to index the whole expression on the left side of =. You can only do that if all functions and operators involved are IMMUTABLE, which I think they are in your case.
I would change the data model though. First, I don't see the value of having a jsonb column with a single value in it. Why not have the URL as a text column instead?
You could go farther and split the URL into individual parts which are stored in columns.
If all this is a good idea depends on how you use the value in the database: often it is a good idea to split off those parts of the data that you use in WHERE conditions and the like and leave the rest "in a lump". This is to some extent a matter of taste.
You can use a URI parsing module, if that is the part you find unclean. You could use plperl or plpythonu, with whatever URI parser library in them you prefer. But if your json is really "well defined" I don't see much point. Unless you are already using plperl or plpythonu, adding those dependencies probably adds more "dirt" than it removes.
You can build an index:
create index on test (to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint));
set enable_seqscan TO off;
explain select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using test_to_timestamp_idx1 on test (cost=0.13..8.15 rows=1 width=36)
Index Cond: (to_timestamp(((split_part(split_part((configuration ->> 'source_url'::text), 'Expires-At='::text, 2), '&'::text, 1))::bigint)::double precision) <= (now() + '24:00:00'::interval))
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
Isn't that just re-arranging the dirt? It makes the query look nicer, at the expense of making the insert uglier. Perhaps you could put it in an INSERT OR UPDATE trigger.

Splunk breakdown results by matched search phrase

I'm searching for a few different search terms, and I would like stats grouped by which term matched:
"PhraseA" "PhraseB" "PhraseC" | timechart count by <which Phrase matched>
What should be in place of <which Phrase matched>? I will be building a stacked bar chart with the results.
try creating a category field using eval and case, and using that in your chart:
index=whatever_index "PhraseA" "PhraseB" "PhraseC"
| eval matched_phrase=case(searchmatch("PhraseA"), "PhraseA", searchmatch("PhraseB"), "PhraseB", searchmatch("PhraseC"), "PhraseC")
| timechart count by matched_phrase
Lots more good info in the Splunk documentation for these functions

Query to loop through data in splunk

I've below lines in my log:
...useremail=abc#fdsf.com id=1234 ....
...useremail=pqr#fdsf.com id=4565 ....
...useremail=xyz#fdsf.com id=5773 ....
Capture all those userids for the period from -1d#d to #d
For each user, search from beginning of index until -1d#d & see if the userid is already present by comparing actual id field
If it is not present, then add it into the counter
Display this final count.
Can I achieve this in Splunk?
Thanks!
Yes, there are several ways to do this in Splunk, each varying in degrees of ease and ability to scale. I'll step through the subsearch method:
1) Capture all those userids for the period from -1d#d to #d
You want to first validate a search that returns only a list of ids, which will then be turned into a subsearch:
sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id
2) For each user, search from beginning of index until -1d#d & see if the userid is already present by comparing actual id field
Construct a main search with a different timeframe that using the subsearch from (1) to match against those ids (note that the subsearch must start with search):
sourcetype=<MY_SOURCETYPE> [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d
This will return a raw dataset of all events from the start of the index up to but not including 1d#d that contain the ids from (1).
3) If it is not present, then add it into the counter
Revise that search with a NOT against the entire subsearch and pipe the outer search to stats to see the ids it matched:
sourcetype=<MY_SOURCETYPE> NOT [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d | stats values(id)
4) Display this final count.
Revise the last stats command to return a distinct count number instead:
sourcetype=<MY_SOURCETYPE> NOT [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d | stats dc(id)
Performance considerations:
The above method works reasonably well for datasets under 1 million rows, on commodity hardware. The issue is that the subsearch is blocking, thus the outer search needs to wait. If you have larger datasets to deal with, then alternative methods need to be employed to make this an efficient search.
FYI, Splunk has a dedicated site where you can get answers to questions like this much faster: http://splunk-base.splunk.com/answers/