Alternative to subsearch to search more than million entries - splunk

Hi I have a sub search command which gives me the required results but is dead slow in doing so. I am having more than a million log entries that i need to search which is the reason why i am looking for an optimized solution. I have gone through answers asked for similar questions but not able to achieve what i need
I have a log which has transactions against an entry_id which always has a main entry and may or may not have subEntry
I want to find the count of version number for all the mainEntry log which has a subEntry
sample Query that i used
index=index_a [search index=index_a ENTRY_FIELD="subEntry"| fields Entry_ID] Entry_FIELD="mainEntry" | stats count by version
Sample data
Index=index_a
1) Entry_ID=abcd Entry_FIELD="mainEntry" version=1
Entry_ID=abcd ENTRY_FIELD="subEntry"
2)Entry_ID=1234 Entry_FIELD="mainEntry" version=1
3)Entry_ID=xyz Entry_FIELD="mainEntry" version=2
4)Entry_ID=lmnop Entry_FIELD="mainEntry" version=1
Entry_ID=lmnop ENTRY_FIELD="subEntry"
5)Entry_ID=ab123 Entry_FIELD="mainEntry" version=3
Entry_ID=ab123 ENTRY_FIELD="subEntry"
Please help in optimizing this

Its not entirely clear what your sample data looks like.
Is it that events 1, 4 and 5 have the fields Entry_ID, Entry_FIELD, version, Entry_ID, Entry_FIELD ? That is, 2 occurances of Entry_ID and Entry_FIELD?
You can try something like the following, but I think you need to explain your data a bit better.
index=index_a Entry_FIELD="subEntry" OR Entry_FIELD="mainEntry"
| stats dc(Entry_FIELD) as Entry_FIELD_Count by Entry_ID, version
| where Entry_FIELD_Count==2
| stats count by version

Related

Search using previous query's values

I am relatively new to Splunk, and I am attempting to perform a query like the following. The snippets below each step show some of what's been attempted.
Query for initial set of events containing a string value
* "string_value"
Get list of distinct values for a specific field returned from step 1
* "string_value" | stats list(someField)
Search for events containing any of the specific field's values returned from step 2
* "string_value" | stats list(someField) as myList | search someField in myList
I'm not entirely certain if this can be accomplished. I've read documents on subqueries, foreach, and various aggregate methods, though I am still uncertain on how to achieve what I need.
Other attempts:
someField IN [search * "string_value" | stats list(someField) as myList]
Thanks in advance.
You certainly can sequentially build a search like this, but you're likely better off doing it this way:
index=ndx sourcetype=srctp someField IN("my","list","of","values") "string_value"
| stats values(someField) as someField
The more you can put in your initial search, the better (in general)

Meaning of these two queries (sql injection)

Can someone explain why these two queries (sometimes) do cause errors? I googled some explanations but none of them were right. I dont want to fix it. This queries should be actually used for SQL injection attack (I think error based sql injection). Triggered error should be "duplicate entry". I'm trying to found out why are they sometimes counsing errors.
Thanks.
select
count(*)
from
information_schema.tables
group by
concat(version(),
floor(rand()*2));
select
count(*),
concat(version(),
floor(rand()*2))x
from
information_schema.tables
group by
x;
It seems the second one is trying to guess which database the victim of the injection is using.
The second one is giving me this:
+----------+------------------+
| count(*) | x |
+----------+------------------+
| 88 | 10.1.38-MariaDB0 |
| 90 | 10.1.38-MariaDB1 |
+----------+------------------+
Okay, I'm going to post an answer - and it's more of a frame challenge to the question itself.
Basically: this query is silly, and it should be written; find out what it's supposed to do and rewrite it in a way that makes sense.
What does the query currently do?
It looks like it's getting a count of the tables in the current database... except it's grouping by a calculated column. And that column looks like it is Version() and appends either a '0' or a '1' to it (chosen randomly.)
So the end result? Two rows, each with a numerical value, the sum of which adds up to the total number of tables in the current database. If there are 30 tables, you might get 13/17 one time, 19/11 the next, followed by 16/14.
I have a hard time believing that this is what the query is supposed to do. So instead of just trying to fix the "error" - dig in and figure out what piece of data it should be returning - and then rewrite the proc to do it.

SQL - Need to identify the latest entry out of multiple groups in a single table

I reviewed Selecting latest entries for distinct entry but I struggled because I had more data on the table that I needed in the result set.
We are creating rules around several sets of product, and we need to pull the LATEST rule for each CLASS_ID. By latest, i'm referring to the latest entry_id.
ENTRY_ID....TIMESTAMP....USER_ID....CLASS_ID....PRICE_POINT
1...........3/2/2018 3:40...1..........53.......50
2...........3/2/2018 3:56...1..........12.......50
3...........3/2/2018 4:56...1..........24.......22
4...........3/2/2018 4:57...1..........564.....22
5...........3/3/2018 4:08...1..........53.......99
6...........3/3/2018 4:09...1..........53.......99
The goal is to get the LATEST timestamp or entry (they should correspond with each other) for EACH class.
The desired output is (ordered by class):
TIMESTAMP....USER_ID....CLASS_ID....PRICE_POINT
3/2/2018 3:56...1..........12.......50
3/2/2018 4:56...1..........24.......22
3/3/2018 4:09...1..........53.......99
3/2/2018 4:57...1..........564.....22
I've spent a few hours looking at this, and it seems really simple, but i've struggled to find a way to work through it.
There will not be a lot of growth, few thousand rows max, so i'm looking for simple code to understand and learn from over performance.
Thanks!
Kyle
Here is one common method:
select t.*
from t
where t.entry_id = (select max(t2.entry_id) from t t2 where t2.class_id = t.class_id);
With an index on (class_id, entry_id), this is often the fastest solution.

hive query for selecting top 10 of each category

I have a hive table with fields similar to :
Seller,catgid,subcatgid,prodid,productdetail1,productdetail2....
Now, I want to extract a list of top 10 products(based on count) for each subcategory( a combo of seller,catgid,subcatgid) and want a result like :
Seller1, catg1,subcatg1,{{prodid1,prod1details},{prodid2,prod2details},{prodid3,prod3details},{prodid4,prod4details}....}
Seller2, catg2,subcatg2,{{prodid5,prod5details},{prodid6,prod6details},{prodid7,prod7details},{prodid8,prod8details}....}
So basically I want the product details(preferably in json format) for all the top 10 products till each subcategory
level.
Is this even possible with a hive query? If yes, then could you please provide an example and If not, is there an alternative?
Found the answer to the above question at http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
Mohit,
Take a look at the 'collect_max' UDF in Brickhouse ( http://github.com/klout/brickhouse ) . I think it can provide a more scalable solution for larger datasets ( in that you can reduce the amount of sorting that you need to do).

Query to loop through data in splunk

I've below lines in my log:
...useremail=abc#fdsf.com id=1234 ....
...useremail=pqr#fdsf.com id=4565 ....
...useremail=xyz#fdsf.com id=5773 ....
Capture all those userids for the period from -1d#d to #d
For each user, search from beginning of index until -1d#d & see if the userid is already present by comparing actual id field
If it is not present, then add it into the counter
Display this final count.
Can I achieve this in Splunk?
Thanks!
Yes, there are several ways to do this in Splunk, each varying in degrees of ease and ability to scale. I'll step through the subsearch method:
1) Capture all those userids for the period from -1d#d to #d
You want to first validate a search that returns only a list of ids, which will then be turned into a subsearch:
sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id
2) For each user, search from beginning of index until -1d#d & see if the userid is already present by comparing actual id field
Construct a main search with a different timeframe that using the subsearch from (1) to match against those ids (note that the subsearch must start with search):
sourcetype=<MY_SOURCETYPE> [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d
This will return a raw dataset of all events from the start of the index up to but not including 1d#d that contain the ids from (1).
3) If it is not present, then add it into the counter
Revise that search with a NOT against the entire subsearch and pipe the outer search to stats to see the ids it matched:
sourcetype=<MY_SOURCETYPE> NOT [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d | stats values(id)
4) Display this final count.
Revise the last stats command to return a distinct count number instead:
sourcetype=<MY_SOURCETYPE> NOT [search sourcetype=<MY_SOURCETYPE> earliest=-1d#d latest=-#d | stats values(id) AS id] earliest=0 latest=-1d#d | stats dc(id)
Performance considerations:
The above method works reasonably well for datasets under 1 million rows, on commodity hardware. The issue is that the subsearch is blocking, thus the outer search needs to wait. If you have larger datasets to deal with, then alternative methods need to be employed to make this an efficient search.
FYI, Splunk has a dedicated site where you can get answers to questions like this much faster: http://splunk-base.splunk.com/answers/