Splunk query to take a search from one index and add a field's value from another index? - splunk

How can I write a Splunk query to take a search from one index and add a field's value from another index? I've been reading explanations that involve joins, subsearches, and coalesce, and none seem to do what I want -- even though the example is extremely simple. I am not sure what I am not understanding yet.
main-index has src field which is an IP address and a field I will restrict my results on. I will look over a short amount of time, e.g.
index="main-index" sourcetype="main-index-source" main-index-field="wildcard-restriction*" earliest=-1h | stats count by src
other-index has src_ip field which is an IP address, and has the hostname. It's DHCP leases, so I need to check a longer time frame, and return only the most recent result for a given IP address. I want to get back the hostname from src_nt_host, e.g.
index="other-index" sourcetype="other-index-sourcetype" earliest=-14d
I would like to end up with the following values:
IP address, other-index.src_nt_host, main-index.count
main-index has the smallest amount of records, if that helps for performance reasons.

If I understand you correctly, you need to look at two different time ranges in two different indices,
In that case, it is most likely to be true that a join will be needed
Here's one way it can be done:
index=ndx1 sourcetype=srctp1 field1="someval" src="*" earliest=-1h
| stats count by src
| join src
[| search index=ndx2 sourcetype=srctp2 field2="otherval" src_ip=* src_nt_host=* earliest=-14d
| stats count by src_ip src_nt_host
| fields - count
| rename src_i as src ]
You may need to flip the order of the searches, depending on how many results they each return, and how long they take to run.
You may also be able to achieve what you're looking for in another manner without the use of a join, but we'd need to have some sample data to possibly give a better result

Related

sdiff - limit the result set to X items

I want to get the diff of two sets in redis, but I don't need to return the entire array, just 10 items for example. Is there any way to limit the results?
I was thinking something like this:
SDIFF set1 set2 LIMIT 10
If not, are there any other options to achieve this in a performant way, considering that set1 can be millions of objects and set2 is much much smaller (hundreds).
More info would be helpful on what you want to achieve. Something like this might require you to duplicate your data. Though I don’t know if it’s something you want.
An option is chunking them.
Create a set with a unique generated id that can hold a max of 10 items
Create a sorted set like so…
zadd(key, timestamp, chunkid)
where your timestamp is a unix time and the chunkid is the key the connects to the set. The key can be the name of whatever you would like it to be or it could also be a uniquely generated id.
Use zrange to grab a specific one
(Repeat steps 1-3 for the second set)
Once you have your 1 result from both your sorted sets “zset”. You can now do your sdiff by using the chunkid.
Note that there is advantages and disadvantages in doing this. Like more connection consumption (if calling from a a client), and the obvious being a little more processing. Though it will help immensely if you put this in a lua script.
Hope this helps or at least gives you an idea on how to model your data. Though if this is critical data you might need to use a automated script of some sort to move your data around to meet the modeling requirement.

What JOIN would be equivalent to this query?

I have three tables, the relevant structure looking like the following:
Routes
| ID |
Runs
| ID | RouteID |
Stops
| ID | RunID | Code | Timestamp |
I’m working on a portion of an application that needs to find the next run given a first run. I’ve got a SQL query that’s doing the job, but it’s turning out to be very slow, even though all of the fields being searched are indexed. It looks like this:
SELECT "RunID"
FROM "Stops"
WHERE "Code" = 'ABC'
AND "RunID" IN ('101', '202', '303')
AND "Timestamp" > '2017-02-07 12:34:56'
ORDER BY "Timestamp" ASC
FETCH FIRST 1 ROWS ONLY
Note that this is the just form the query is generally taking. The primary keys are actually UUIDs and obviously the tables are more complicated than shown above. But the idea is that I want to find the Stops that have a given code, one of a subset of RunIDs, and a timestamp after a given timestamp.
I’m wondering if the IN clause is causing the speed issue. All the above fields within the Stops table are indexed, so I would expect this to be a rather quick search, but it’s taking a few seconds each time, and this is within a loop, so this query is making the entire routine very slow.
So, is perhaps a JOIN is the answer? The last piece that leads me to this question is all the runs in the IN clause’s list have the same parent route. So I’m really searching for all the stops that have a given code and are after a given timestamp and have a parent run whose parent route is a given ID.
But, I’m honestly weak with SQL joins. I keep studying them, but I’ve never really gotten them to click for me. Is a join possibly the answer? And if so, how would I write it?

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

How can I control the order of results? Lucene range queries in Cloudant

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...
I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.
#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

Best Way To Index & Search for Value Between Hi & Lo Byte[] Columns?

I know about text indexing, but this is different. I have 2 byte array columns in a table, labeled StartByteArray & EndByteArray. The Start column is a starting IP address in byte array form, and the same with the End column, except it is the stop IP. You can think of the high & low columns as boundaries of IP Addresses. It looks like this (just 10 rows shown):
StartIPAddress StartByteArray EndIPAddress EndByteArray
41.0.0.0 0x29000000 41.31.255.255 0x291FFFFF
41.32.0.0 0x29200000 41.47.255.255 0x292FFFFF
41.48.0.0 0x29300000 41.55.255.255 0x2937FFFF
41.56.0.0 0x29380000 41.56.255.255 0x2938FFFF
41.57.0.0 0x29390000 41.57.63.255 0x29393FFF
41.57.64.0 0x29394000 41.57.79.255 0x29394FFF
41.57.80.0 0x29395000 41.57.95.255 0x29395FFF
41.57.96.0 0x29396000 41.57.111.255 0x29396FFF
41.57.112.0 0x29397000 41.57.115.255 0x293973FF
41.57.116.0 0x29397400 41.57.119.255 0x293977FF
That's it. The reason I did this was to make searching for a row easier, if that row 'contained, or bounded, the given IP Address. Sounds harder than it is.
Put another way, I want to search for the row that my given IP Address (once also converted to byte array) is within.
Now writing the usual SQL is easy (example on SO here, for example), but I've got a feeling there is a clever way to index these columns in such a way that it will be efficient, but all I have done is text indexing, and there are 2 columns here that I'm doing math comparisons to, not letters of words over x characters long.
I'm using SQL Server 2012, and can also convert the data to anything better suited, as I own the DB.
Any thoughts?
I sense there are some misunderstandings here. I hope I'll find them.
Indexing text columns is not different from indexing any other data type. A B-tree based index can index any data type that has a sort order. All it does is keep all index rows sorted by the key columns. This allows for range and point lookups. Binary data, string data and integer data are all fully supported.
Now writing the usual SQL is easy (example on SO here, for example)
This query does not solve your problem. It returns all rows where the StartByteArray would be in a given range. You want to opposite: You want the search argument to be in the range that a certain row specifies.
I already answered how to look up an IP range.
I've got a feeling there is a clever way to index these columns in such a way that it will be efficient
Just index on StartByteArray. That allows you to find the first row that matches a given IP.
but all I have done is text indexing
Not sure what you mean but whatever it is - it probably a misunderstanding.
Using a binary(4) to store IPs is clever. I never thought of doing that. I used a bigint in the past. That takes twice the amount of storage that would strictly be needed, though.