Enrich Splunk search data based on temporal correlation from another search - splunk

I am trying to enrich my table1 data by adding field_to_enrich1 and 2 where fields 1-3 are the same and the _time is right before the _time of my event in table1.
To clarify based on comments, "right before" I mean the first log event that happens immediately prior to the _time field of my current event where fields 1-3 are all a match.
I have conducted a left join on field1,field2,field3 but am trying to figure out how to conduct the _time correlation between the two tables.
I have two tables within splunk like below.
Table1
_time,field1,field2,field3,field4
2022-11-10 13:19:55.308,oepwy0s4mjt,n6u,field4_random_123
2022-11-10 13:19:56.308,6onbcity1n2,lwe,field4_random_456
2022-11-10 13:19:57.308,9rfkuntl7qx,2tc,field4_random_567
2022-11-10 13:19:58.308,fn44tlt6rtt,8tm,field4_random_234
2022-11-10 13:19:59.308,gj11nax4o68,lr3,field4_random_458
2022-11-10 13:20:00.308,mdgdj03sx9c,7pc,field4_random_124
Table2
_time,field1,field2,field3,field_to_enrich1,field_to_enrich2
2022-11-10 13:19:55.108,oepwy0s4mjt,n6u,83zuyt8vdyFF,ljr5furt0mFF
2022-11-10 13:19:55.208,oepwy0s4mjt,n6u,83zuyt8vdy75,ljr5furt0mfs
2022-11-10 13:19:56.108,6onbcity1n2,lwe,yeg1lhraoeGG,ngmly4majhGG
2022-11-10 13:19:56.208,6onbcity1n2,lwe,yeg1lhraoef0,ngmly4majhom
2022-11-10 13:19:57.108,9rfkuntl7qx,2tc,pfe6vssh0qej,me4yghhmj26t
2022-11-10 13:19:57.208,9rfkuntl7qx,2tc,pfe6vssh0qej,me4yghhmj26t
2022-11-10 13:19:58.108,fn44tlt6rtt,8tm,8l06613lartf,bx5h3v9l1udg
2022-11-10 13:19:58.208,fn44tlt6rtt,8tm,8l06613lartf,bx5h3v9l1udg
2022-11-10 13:19:59.208,oepwy0s4mjt,n6u,asdfasdfasdf,asdfasdfasdf
2022-11-10 13:20:00.208,oepwy0s4mjt,n6u,oimlkmjhgggh,asdfiiiidddd
Example output with the above tables is below.
Table3
_time,field1,field2,field3,field_to_enrich1,field_to_enrich2
2022-11-10 13:19:55.308,oepwy0s4mjt,n6u,field4_random_123,83zuyt8vdy75,ljr5furt0mfs
2022-11-10 13:19:56.308,6onbcity1n2,lwe,field4_random_456,yeg1lhraoef0,ngmly4majhom
2022-11-10 13:19:57.308,9rfkuntl7qx,2tc,field4_random_567,pfe6vssh0qej,me4yghhmj26t
2022-11-10 13:19:58.308,fn44tlt6rtt,8tm,field4_random_234,8l06613lartf,bx5h3v9l1udg
2022-11-10 13:19:59.308,gj11nax4o68,lr3,field4_random_458,FILLNULL,FILLNULL2
2022-11-10 13:20:00.308,mdgdj03sx9c,7pc,field4_random_124,FILLNULL,FILLNULL2
Any help would be greatly appreciated.

I prefer to avoid join because it is expensive, but don't have an alternative. We can handle the "_time right before" requirement by using the dedup command to discard all but the latest event for a given set of fields.
<<your search for Table1>>
| fields _time,field1,field2,field3,field4
| join type=left field1,field2,field3 [
<<your search for Table2>>
| _time,field1,field2,field3,field_to_enrich1,field_to_enrich2
```Keep only the most recent event for each triplet```
| dedup field1,field2,field3
]
| fillnull value="FILLNULL" field_to_enrich1
| fillnull value="FILLNULL2" field_to_enrich2
| table _time,field1,field2,field3,field_to_enrich1,field_to_enrich2
Answer #2
Here's some ugliness that should handle duplicative events - at least it works with the sample data. Note: I've removed references to 'field3' since it's not included in the data. Also, I changed _time to time in the samples so _time can be used in the query.
| makeresults
| eval data="time,field1,field2,field4
2022-11-10 13:19:55.308,oepwy0s4mjt,n6u,field4_random_123
2022-11-10 13:19:56.308,6onbcity1n2,lwe,field4_random_456
2022-11-10 13:19:57.308,9rfkuntl7qx,2tc,field4_random_567
2022-11-10 13:19:58.308,fn44tlt6rtt,8tm,field4_random_234
2022-11-10 13:19:59.308,gj11nax4o68,lr3,field4_random_458
2022-11-10 13:20:00.308,mdgdj03sx9c,7pc,field4_random_124"
| eval _raw=data
| multikv forceheader=1
| eval _time=strptime(time,"%Y-%m-%d %H:%M:%S.%3N")
| sort - _time
```Above defines test data. Replace with your search for Table1```
| fields _time,field1,field2,field4
```Define fields we'll need in the map command```
| eval etime=_time,efield1=field1,efield2=field2,efield4=field4
```Repeat a search for each event in in Table1
Change the value of maxsearches based on the expected number of rows in Table1```
| map maxsearches=1000 search="|makeresults | eval data=\"time,field1,field2,field_to_enrich1,field_to_enrich2
2022-11-10 13:19:55.108,oepwy0s4mjt,n6u,83zuyt8vdyFF,ljr5furt0mFF
2022-11-10 13:19:55.208,oepwy0s4mjt,n6u,83zuyt8vdy75,ljr5furt0mfs
2022-11-10 13:19:56.108,6onbcity1n2,lwe,yeg1lhraoeGG,ngmly4majhGG
2022-11-10 13:19:56.208,6onbcity1n2,lwe,yeg1lhraoef0,ngmly4majhom
2022-11-10 13:19:57.108,9rfkuntl7qx,2tc,pfe6vssh0qej,me4yghhmj26t
2022-11-10 13:19:57.208,9rfkuntl7qx,2tc,pfe6vssh0qej,me4yghhmj26t
2022-11-10 13:19:58.108,fn44tlt6rtt,8tm,8l06613lartf,bx5h3v9l1udg
2022-11-10 13:19:58.208,fn44tlt6rtt,8tm,8l06613lartf,bx5h3v9l1udg
2022-11-10 13:19:59.208,oepwy0s4mjt,n6u,asdfasdfasdf,asdfasdfasdf
2022-11-10 13:20:00.208,oepwy0s4mjt,n6u,oimlkmjhgggh,asdfiiiidddd\"
| eval _raw=data | multikv forceheader=1
| eval _time=strptime(time,\"%Y-%m-%d %H:%M:%S.%3N\")
```Everything from "|makeresults" to here defines test data.
Replace with your search for Table2```
| sort - _time
```Look for the fields passed in from Table1```
| search field1=$efield1$ field2=$efield2$ _time<$etime$
```If get more than one, pick the first (most recent)```
| head 1
```Use the values of _time and field4 from Table1```
| eval _time=$etime$, field4=\"$efield4$\"
```If nothing was found in Table2 then assign values```
| appendpipe [stats count | eval _time=$etime$, field1=\"$efield1$\", field2=\"$efield2$\", field4=\"$efield4$\", field_to_enrich1=\"FILLNULL\", field_to_enrich2=\"FILLNULL2\" | where count=0 | fields - count]
| table _time,field1,field2,field4,field_to_enrich1,field_to_enrich2"

Related

Tracking Growth of a Metric over Time In TimescaleDB

I'm currently running timescaleDB. I have a table that looks similar to the following
one_day | name | metric_value
------------------------+---------------------
2022-05-30 00:00:00+00 | foo | 400
2022-05-30 00:00:00+00 | bar | 200
2022-06-01 00:00:00+00 | foo | 800
2022-06-01 00:00:00+00 | bar | 1000
I'd like a query that returns the % growth and raw growth of metric, so something like the following.
name | % growth | growth
-------------------------
foo | 200% | 400
bar | 500% | 800
I'm fairly new to timescaleDB and not sure what the most efficient way to do this is. I've tried using LAG, but the main problem I'm facing with that is OVER (GROUP BY time, url) doesn't respect that I ONLY want to consider the same name in the group by and can't seem to get around it. The query works fine for a single name.
Thanks!
Use LAG to get the previous value for the same name using the PARTITION option:
lag(metric_value,1,0) over (partition by name order by one_day)
This says, when ordered by 'one_day', within each 'name', give me the previous (the second parameter to LAG says 1 row) value of 'metric_value'; if there is no previous row, give me '0'.

Progress query to remove duplicates based on number of duplicates

Our accounting department needs pull tax data from our MIS every month and submit it online to the Dept. of Revenue. Unfortunately, when pulling the data, it is duplicated a varying number of times depending on which jurisdictions we have to pay taxes to. All she needs is the dollar amount for one jurisdiction, for one line, because she enters that on the website.
I've tried using DISTINCT to pull only one record of the type, in conjunction with LEFT() to pull just the first 7 characters of the jurisdiction but it ended up excluding certain results that should have been included. I believe it was because the posting date and the amount on a couple transactions was identical. They were separate transactions but the query took them as duplicates and ignored them.
Here is a couple of examples of queries I've run that have been successful in pulling most of the data, but most times either too much or not enough:
SELECT DISTINCT LEFT("Sales-Tax-Jurisdiction-Code", 7), "Taxable-Base", "Posting-Date"
FROM ARInvoiceTax
WHERE ("Posting-Date" >= '2019-09-01' AND "Posting-Date" <= '2019-09-30')
AND (("Sales-Tax-Jurisdiction-Code" BETWEEN '55001' AND '56763')
OR "Sales-Tax-Jurisdiction-Code" = 'Dakota Cty TT')
ORDER BY "Sales-Tax-Jurisdiction-Code"
Here is a query that I can to pull all of the data and the subsequent result is below that:
SELECT "Sales-Tax-Jurisdiction-Code", "Taxable-Base", "Posting-Date"
FROM ARInvoiceTax
WHERE ("Posting-Date" >= '2019-09-01' AND "Posting-Date" <= '2019-09-30')
AND (("Sales-Tax-Jurisdiction-Code" BETWEEN '55001' AND '56763')
OR "Sales-Tax-Jurisdiction-Code" = 'Dakota Cty TT')
ORDER BY "Sales-Tax-Jurisdiction-Code"
Below is a sample of the output:
Jurisdiction | Tax Amount | Posting Date
-------------|------------|-------------
5512100City | $50.00 | 2019-09-02
5512100City | $50.00 | 2019-09-03
5512100City | $70.00 | 2019-09-02
5512100Cnty | $50.00 | 2019-09-02
5512100Cnty | $50.00 | 2019-09-03
5512100Cnty | $70.00 | 2019-09-02
5512100State | $70.00 | 2019-09-02
5512100State | $50.00 | 2019-09-02
5512100State | $50.00 | 2019-09-03
5513100Cnty | $25.00 | 2019-09-12
5513100State | $25.00 | 2019-09-12
5514100City | $9.00 | 2019-09-06
5514100City | $9.00 | 2019-09-06
5514100Cnty | $9.00 | 2019-09-06
5514100Cnty | $9.00 | 2019-09-06
5515100State | $12.00 | 2019-09-11
5516100City | $6.00 | 2019-09-13
5516100City | $7.00 | 2019-09-13
5516100State | $6.00 | 2019-09-13
5516100State | $7.00 | 2019-09-13
As you can see, the data can be all over the place. One zip code could have multiple different lines. What the accounting department does now is prints a report with this information and, in a spreadsheet, only records (1) dollar amount per transaction. For example, for 55121, she would need to record $50.00, $50.00 and $70.00 (she tallies them and adds the total amount on the website) however the SQL query gives me those (3) numbers, (3) times.
I can't seem to figure out a query that will pull only one set of the data. Unfortunately, I can't do it based on the words/letters after the 00 because not all jurisdictions have all 3 (city, cnty, state) and thus trying to remove lines based on that removes valid lines as well.
Can you use select distinct? If the first five characters are the zip code and you just want that:
select distinct left(jurisdiction, 5), tax_amount
from t;
Take only City/County/.. whatever is first
select jurisdiction, tax_amount, Posting_Date
from (
select *, dense_rank() over(partition by left(jurisdiction, 7) order by substring(jurisdiction, 8, len(jurisdiction))) rnk
from taxes -- you output here
)
where rnk=1;
Sql server syntax, you may need other string functions in your dbms.
Postgresql fiddle

Oracle, Mysql, how to get average

How to get Average fuel consumption only using MySQL or Oracle:
SELECT te.fuelName,
zkd.fuelCapacity,
zkd.odometer
FROM ZakupKartyDrogowej zkd
JOIN TypElementu te
ON te.typElementu_Id = zkd.typElementu_Id
AND te.idFirmy = zkd.idFirmy
AND te.typElementu_Id IN (3,4,5)
WHERE zkd.idFirmy = 1054
AND zkd.kartaDrogowa_Id = 42
AND zkd.data BETWEEN to_date('2015-09-01','YYYY-MM-DD')
AND to_date('2015-09-30','YYYY-MM-DD');
Result of this query is:
fuelName | fuelCapacity | odometer | tanking
-------------------------------------------------
'ON' | 534 | 1284172 | 2015-09-29
'ON' | 571 | 1276284 | 2015-09-02
'ON' | 470 | 1277715 | 2015-09-07
'ON' | 580.01 | 1279700 | 2015-09-11
'ON' | 490 | 1281103 | 2015-09-17
'ON' | 520 | 1282690 | 2015-09-23
We can do it later in java or php, but want to get result right away from query. How should we modify above query to do that?
fuelCapacity is the number of liters of fuel that has been poured into cartank at gas station.
For one total average, what you need is the sum of the refills divided by the difference between the odometer readings at the start and the end, i.e. fuel used / distance travelled.
I don't have your table structure at hand, but this alteration to the select statement should do the trick:
select cast(sum(zkd.fuelCapacity) as float) / (max(zkd.odometer) - min(zkd.odometer)) as consumption ...
The cast(field AS float) does what the name implies, and typecasts the field as float, so the result will also be a float. (I do suspect that your fuelCapacity field is a float because there is one float value in your example, but this will make sure.)

Filling in the blanks with time series summary data

I'm trying to draw a simple (read: fast) sparkline for "data received from a sensor every n minutes"
The data is very simple, it's one, or more readings for a given timestamp, identified by the sensor's mac address:
# SELECT mac, ants, read_at FROM normalized_readings LIMIT 10;
mac | ants | read_at
-------------------+------+-------------------------
f0:d1:a9:a0:fe:e7 | -87 | 2013-07-14 09:25:15.215
74:de:2b:fa:ca:cf | -69 | 2013-07-14 09:25:14.81
74:de:2b:fa:ca:cf | -69 | 2013-07-14 09:25:14.81
74:de:2b:fa:ca:cf | -69 | 2013-07-14 09:25:15.247
38:aa:3c:8f:a0:4f | -85 | 2013-07-14 09:25:21.672
38:aa:3c:8f:a0:4f | -87 | 2013-07-14 09:25:21.695
60:67:20:c8:bc:80 | -83 | 2013-07-14 09:25:26.73
60:67:20:c8:bc:80 | -81 | 2013-07-14 09:25:26.737
f0:d1:a9:a0:fe:e7 | -83 | 2013-07-14 09:25:36.207
f0:d1:a9:a0:fe:e7 | -91 | 2013-07-14 09:26:07.77
(10 rows)
I'm trying to come up with something like:
# SELECT
mac, date_trunc('minute', read_at) AS minute, COUNT(*)
FROM
normalized_readings
GROUP BY mac, minute LIMIT 10;
mac | minute | count
-------------------+---------------------+-------
00:08:ca:e6:a1:86 | 2013-07-14 16:22:00 | 6
00:10:20:56:7c:e2 | 2013-07-27 05:29:00 | 1
00:21:5c:1c:df:7d | 2013-07-14 09:44:00 | 1
00:21:5c:1c:df:7d | 2013-07-14 09:46:00 | 1
00:21:5c:1c:df:7d | 2013-07-14 09:48:00 | 1
00:24:d7:b3:31:04 | 2013-07-15 06:51:00 | 1
00:24:d7:b3:31:04 | 2013-07-15 06:53:00 | 3
00:24:d7:b3:31:04 | 2013-07-15 06:59:00 | 3
00:24:d7:b3:31:04 | 2013-07-15 07:02:00 | 3
00:24:d7:b3:31:04 | 2013-07-15 07:06:00 | 3
(10 rows)
But notice all the empty periods, I'd like to be able to extract 0 for those time periods to indicate that the sensors weren't recording data.
Probably I'll only ever want to show the last 12/24 hours worth of data, so I suppose I could brute-force this by selecting artificial dates from NOW() 12/24 hours into the past, and for each resolution (probably 1, or 5 minutes), I'd have to query the readings table, and SUM the number of readings, but this sounds horrible inefficient.
Is there a way to do what I'm trying to do without brute-forcing things? As far as I can see, when I'm grouping by selecting minutes, I'm automatically coming at this from the wrong side?
For this type of query, you want a driver table that generates all the combinations of "macs" and "minutes". Postgres has the nice function generate_series() to get a counter for each minute.
So, the idea is to start with all the macs and generate a series for each minute. Then use left outer join from the driver table to get a row for each value.
with t as (
SELECT mac, date_trunc('minute', read_at) AS minute, COUNT(*) as cnt
FROM normalized_readings
GROUP BY mac, minute
LIMIT 10
)
select driver.mac, driver.minute, coalesce(cnt, 0)
from (select mac, minminute,
minminute + cast(cast(generate_series(0,
cast(extract(epoch from maxminute - minminute)/60 as int)
) as character varying
)||' minute' as interval
) as minute
from (select mac, min(minute) as minminute, max(minute) as maxminute
from t
group by mac
) macs
) driver left outer join
t
on t.mac = driver.mac and
t.minute = driver.minute
The SQL Fiddle is here.
The only issue that I can see is how you get your original data -- the definition of t. I followed the example in the question. But, it doesn't actually make sense. You have a limit with no order by. You should put in the appropriate order by.

SQL Query : Calculating cross distances based on Master detail predefined tables

I have a database with many tables, especially two tables one store paths and the other one store cities of a path :
Table Paths [ PathID, Name ]
Table Routes [ ID, PathID(Forein Key), City, GoTime, BackTime, GoDistance, BackDistance]
Table Paths :
---------------------------------------
|PathID |Name |
|-------+-----------------------------|
|1 |NewYork Casablanca Alpha 1 |
|7 |Paris Dubai 6007 10:00 |
---------------------------------------
Table Routes :
ID PathID City GoTime BackTime GoDistance BackDistance
1 1 NewYork 08:00 23:46 5810 NULL
2 1 Casablanca 15:43 16:03 NULL 5800
3 7 Paris 10:20 14:01 3215 NULL
4 7 Cairo 14:50 09:31 2425 3215
3 7 Dubai 18:21 06:00 NULL 2425
I want a Query that gives me all the possible combinations inside the same Path, something like :
PathID CityFrom CityTo Distance
I don't know if I made myself clear or not but hope you guys could help me, thanx in advance.
This is the good answer done manually !!
------------------------------------------------------
|PathID |Go_Back |CityA |CityB |Distance|
|-------+-----------+-----------+-----------+--------|
|1 |Go |NewYork |Casablanca |5810 |
|1 |Back |Casablanca |NewYork |5800 |
|7 |Go |Paris |Cairo |3215 |
|7 |Go |Paris |Dubai |5640 |
|7 |Go |Cairo |Dubai |2425 |
|7 |Back |Dubai |Cairo |2425 |
|7 |Back |Dubai |Paris |5640 |
|7 |Back |Cairo |Paris |3215 |
------------------------------------------------------
This comes down to two questions.
Q1:
How to split up column "Name" from table "Paths", so that it is in first normal form. See wikipedia for a definition. The domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain. You must do this yourself. It might be cumbersome to use the text-processing functions of your database to split up the nonatomic column values.
Write a script (perl/python/... ) that does this, and re-import the results into a new table.
Q2:
HOw to calculate "possible paths combinations".
Maybe it is possible with a simple SQL query, by sorting the table. You haven't shown enough data.
Ultimately, this can be done with recursive SQL. Postgres can do this. It is an advanced topic.
You definitely must decide if your paths can contain loops. (A traveller might decide to take a circular detour many times, although it makes no sense practically. mathematically it is possible, though.)