Matching network subnets with BigQuery - sql

I have a huge set of object (i.e. traceroutes) hosted in Google BigQuery. Those objects can be considered as arrays of IP addresses (called hops). I am trying to find a query to identify all the traceroutes crossing a few set of prefixes of interest (i.e. prefixes_of_interest). Those prefixes can be found in another dataset hosted in BigQuery as well. I have been able to write a query to discover all the traceroutes crossing a specific set of IP addresses (i.e. ip_of_interests), after start_date via this query:
WITH
traceroutes AS (
SELECT
hops,
start_time
FROM
`traceroute` AS t
WHERE
DATE(start_time) >= start_date )
SELECT
hops,
h.hop_addr,
start_time
FROM
traceroutes,
UNNEST(hops) AS h
WHERE
h.hop_addr IN (
SELECT
IP
FROM
`ip_addresses`)
Now, in practice, I could translate all of my prefixes to IP addresses and reiterate a similar query but that feels like a very inefficient way to proceed. I would like to know if there is a way to match the h.hop_addr prefixes with the one from my database prefixes_of_interest. In practice, my subnetworks have different lengths (most are /24 but I have a few /20) but a solution that would match /24 would do the trick and allow me to save tons of space and processing time I reckon. Hopefully the question makes sense!

Related

Splunk query to take a search from one index and add a field's value from another index?

How can I write a Splunk query to take a search from one index and add a field's value from another index? I've been reading explanations that involve joins, subsearches, and coalesce, and none seem to do what I want -- even though the example is extremely simple. I am not sure what I am not understanding yet.
main-index has src field which is an IP address and a field I will restrict my results on. I will look over a short amount of time, e.g.
index="main-index" sourcetype="main-index-source" main-index-field="wildcard-restriction*" earliest=-1h | stats count by src
other-index has src_ip field which is an IP address, and has the hostname. It's DHCP leases, so I need to check a longer time frame, and return only the most recent result for a given IP address. I want to get back the hostname from src_nt_host, e.g.
index="other-index" sourcetype="other-index-sourcetype" earliest=-14d
I would like to end up with the following values:
IP address, other-index.src_nt_host, main-index.count
main-index has the smallest amount of records, if that helps for performance reasons.
If I understand you correctly, you need to look at two different time ranges in two different indices,
In that case, it is most likely to be true that a join will be needed
Here's one way it can be done:
index=ndx1 sourcetype=srctp1 field1="someval" src="*" earliest=-1h
| stats count by src
| join src
[| search index=ndx2 sourcetype=srctp2 field2="otherval" src_ip=* src_nt_host=* earliest=-14d
| stats count by src_ip src_nt_host
| fields - count
| rename src_i as src ]
You may need to flip the order of the searches, depending on how many results they each return, and how long they take to run.
You may also be able to achieve what you're looking for in another manner without the use of a join, but we'd need to have some sample data to possibly give a better result

Redis | Best data structure to store IPs and networks

My goal is to store in Redis:
Plain IP addresses like 228.228.228.228
IP networks like 228.228.228.0/24
in order to check in request/response cycle whether or not
current IP xxx.yyy.xxx.vvv is inside( contained by):
Plain ips
or
Ip network ( for example 228.228.228.228 inside 228.228.228.0/24)
Overall amount of ips and networks - few 1000 items.
Question is – what is the best way (best structure) to store both plain ips and networks in Redis and make aforementioned check without fetching data from Redis?
Thanks.
P.S. Current IP is already known.
UPDATE
Ok, lets simplify it a bit with example.
I have 2 ips and 2 networks in where I want to check if certain ip is contained.
# 2 plain ip
202.76.250.29
37.252.145.1
# 2 networks
16.223.132.0/24
9.76.202.0/24
There are 2 possible ways where exact ip might be contained:
1)Just in plain ips. For example 202.76.250.29 contained in the structure above and 215.08.11.23 is not contained simply by definition.
2)Ip might be contained inside network. For example 9.76.202.100 contained inside networks 9.76.202.0/24 but not contained inside list of plain ips as there are no any exact ip = 9.76.202.100.
Little bit of of explanation about ip networks. Very simplified.
Ip network represents range of ips. For example ipv4 network "192.4.2.0/24" represents 256 ip addresses:
IPv4Address('192.4.2.1'), IPv4Address('192.4.2.2'),
…
…
…
IPv4Address('192.4.2.253'), IPv4Address('192.4.2.254')
In another words ip network is a range of ip addresses
from '192.4.2.1' up to '192.4.2.254'
In our example 9.76.202.100 contained inside networks 9.76.202.0/24 as one of this addresses inside the range.
My idea is like this:
Any ip address can be represented as integer. One of our ip addresses
202.76.250.29 converted to integer is 3394042397.
As ip network is a range of ips, so that it is possible to convert it in a range of integers by converting first and last ip in range in integers.
For example one of our networks 16.223.132.0/24 represents range between IPv4Address('16.223.132.1') and IPv4Address('16.223.132.254'). Or integers range from 283083777 up to 283083781 with step 1.
Individual ip can be represented as range between it’s integer and it’s integer + 1 (lower bound included, upper bound excluded).
Obviously search in plain ips can be done by putting them to SET and then using SISMEMBER. But what about searching inside networks. Can we do some trick with ranges maybe?
"Best" is subjective(in memory, in speed etc) but you may use two sets/hash to store them. Since they are unique both hashes and sets would be fine. If you prefer you can use a single set/hash to save both ip and network ip addresses but i would prefer separate since they are two different type of data sets(just like database tables).
Then you can use either of those
SISMEMBER with O(1) time complexity
HEXISTS with O(1) time complexity.
It can be handled on application level with multiple commands or lua script(in a single transaction).
Depending on your choice add to your keys with SADD and HSET(the field value would be 1).
--
Edit: (hope i get it right)
For the range of network addresses create sets from the integers surrounding two dots such as 12.345.67.1-12.345.67.254 range will be represented as 12.345.67 and you will add this to the set. When you want to search for 12.345.67.x it will be parsed into 12.345.67 in your application level and you will check with SISMEMBER. Same can be done with hash with HEXISTS.
Since ip addresses contain four different numbers with three dots, you will discard last dot and last number and the rest will be representing(i assume) the network range.
For IPs you can use Set and query by certain IP within O(1) time.
For IP range, I think you can use List with Lua Script for query. List will have O(n) time for searching, but since you only have 1000 items, O(N) and O(1) will not have a huge difference for Redis in memory query.

How to group similar GTFS trips

I need to group GTFS trips to human understandable "route variants". As one route can have run different trips based on day/time etc.
Is there any preferred way to group similar trips? Trip shape_id looks promising, but is there any guarantee that all similar trips has same shape_id?
My GTFS data is imported my sql database and the database structure is the same as GTFS txt files.
UPDATE
Im not looking sql query example, im looking high level example how to group similar trips to user friendly "route variants".
Many route planning apps (like Moovit) use GTFS data as source and they display different route variants to users.
There is no official way to do this. The best way is probably to group by the ordered list of stops on each trip, sometimes known as the "stopping pattern" of the trip. The idea is discussed at a conceptual level here by Mapzen.
In practice, I have created concatenated strings of all stops on a given trip (from stop_times), and grouped by that to define similar trips. E.g., if the stops on a given trip are A, B, C, D, and E, create a string A-B-C-D-E or A_B_C_D_E and group trips on that string. This functionality is not part of the SQL spec, although MySQL implements it as GROUP_CONCAT and PostgreSQL uses arrays and array_to_string. You may also want to add route_id and shape_id into the grouping as well, to handle some corner cases.

Querying Sql server with index for range

I am using SQL SERVER 2012 that is running on windows datacenter 2012,
I have a database with a table that is build as followed :
[ID] (pk,int not null)
[Start] (float,null)
[End] (float, null)
[CID] (int,null) --country id
I have a web service that gets an IP, translate it to decimal
(may refer to this : IP address conversion to decimal and vice versa) and request the database server for the country id
The table mentioned at first contains ~200K rows with start and end values representing IP ranges as decimal and a countryid related to each range,
I have encountered a really high CPU usage against some heavy traffic we have been dealing, so i added indexes on the start and end columns, afterwards the cpu got a little bit better but i think it should have been much more, its simply suppose to work as a search in a sorted list which should be extremely fast, though the expected result i had from adding the index were far from reality,
I suppose it is because its not searching a list but searching a range
What would be the best way to efficient this situation, since i am just sure that the resources this simple action is taking me is way to much than it should.
Here is a picture from the activity monitor now (lower traffic, after indexing) :
This is running on Azure ExtraLarge VM (8 cores 14GB memory) - the vm is doing nothing but running a sql server with 1 table that only translates this 1 request ! the VM CPU on this lower traffic is ~30% and ~70% on higher traffic, i am sure some structure/logical changes should make a really small server\service handle this easily.
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
This gets you the nearest IP range above the given IP. You then need to test whether the EndIP also matches. So:
SELECT *
FROM (
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
) x
WHERE EndIP >= yourIP
This amounts to a single-row index seek. Perfect performance.
The reason SQL Server cannot automatically do this is that it cannot know that IP ranges are ordered, meaning that the next StartIP is always greater than the current EndIP. We could have ranges of the form (100, 200), (150, 250). That is clearly invalid but it could be in the table.
In my opinion you main problem is the lack of "parameterization" because (a) query compilation is/can be expensive and (b) these "unparamterized" queries seems to have a lot of executions. And the available screenshot shows two things regarding this aspect:
1) The recent expensive queries aren't "parameterized".
2) High values for "Plan count":
Plan Count The number of cached query plans for this query. A large
number might indicate a need for explicit query parameterization. For
more information, see Specifying Query Parameterization Behavior by
Using Plan Guides.
Source
So, I would try to use parameters for these queries:
SELECT TOP(1) CountryId FROM [IP] WHERE Column1 <= #param AND #param <= Column2
If you can't change the application (how SQL requests are sended to SQL Server) then you could try plan guides:
http://technet.microsoft.com/en-US/library/ms191275(v=sql.90).aspx

Indexed ranged search algorithm for IP Addresses

Given an ACL list with 10 billion IPv4 ranges in CIDR notiation or between two IPs:
x.x.x.x/y
x.x.x.x - y.y.y.y
What is an effecient search/indexing algorithm for testing that a given IP address meets the critera of one or more ACL ranges?
Lets assume most ACL range definitions span a great number of class C blocks.
Indexing points via hash tables is easy but try as I might have not been able to come up with a reasonable method for detecting which points are covered by a large list of "lines".
Had some thoughts like indexing hints at a certain level of detail -- say pre-computing at the class C level each ACL that covered that point but the table would be too large.. Or some sort of KD tree to dynamically set levels of detail.
Also had the thought that maybe there are collision detection algorithms out there that can address this.
Any hints or pointers in the right direction?
The simple Radix Tree which has been used in the longest prefix match Internet route lookups, can be scaled to hold nodes that represent the larger CIDR subnets that overlap other smaller ones. A longest match lookup will traverse these nodes which will also be selected to get the entire set of CIDR subnets that match an IP address.
Now, to hold the IP ranges in the same tree, we can convert each range into a set of CIDR subnets. This can be always done though the set may have lots of subnets (and even some host IPs -- that is, IP/32 kind CIDR addresses).
You have 10 billion rules to match 4 billion possible addresses?
Make a table of 4 billion addresses. For each of the 10 billion rules, 'paint' the addresses it applies to, doing something sensible when two or more rules apply to the same address.
You can look at the Interval tree to find all intervals that overlap with any given interval or point.
For non-overlapping ip-ranges, you can use a b-tree or compact-tries like Judy arrays (64-bits) for indexing and searching (Store the start-ip as key and the end-ip as value).