Using REGEXP_EXTRACT to get domain and subdomains - sql

I have only managed to extract the TLD of the list of websites that I have using
REGEXP_EXTRACT(Domain_name, r'(\.[^.:]*)]\.?:?[0-9]*$') AS web_tld
Example:
I have
www.example1.abc.com
www.example2.efg.123.net
I want the result
Subdomain
example1
efg
Domain
abc
123
TLD
.com
.net
EDIT:
Encountered an error in my query
'Exactly one capturing group must be specified'
when I use (.?([^.:]+).([^.:]+).([^.:]+):?[0-9]*$) as regex
SELECT
REGEXP_EXTRACT(Domain, r'(\.?([^.:]+)\.([^.:]+)\.([^.:]+):?[0-9]*$)'),
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;

As you can only use one capturing group, I think you can actually use 3 separate regular expressions to get the values you want:
SELECT
REGEXP_EXTRACT(Domain, r'([^.:]+):?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+:?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+.[^.:]+:?[0-9]*$')
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;

Note you may be better off using the HOST, DOMAIN, and TLD rather than custom regular expressions.

Related

Match partial string from list with field

I'm trying to check if a field contains a value from a list using Kusto in Log analytics/Sentinel in Azure.
The list contains top level domains but I only want matches for subdomains of these top levels domains. The list value example.com should match values such as forum.example.com or api.example.com.
I got the following code but it does exact matches only.
let domains = dynamic(["example.com", "amazon.com", "microsoft.com", "google.com"]);
DeviceNetworkEvents
| where RemoteUrl in~ (domains)
| project TimeGenerated, DeviceName, InitiatingProcessAccountUpn, RemoteUrl
I tried with endswith, but couldn't get that to work with the list.
It seems that has_any() would work for you:
let domains = dynamic(["example.com", "amazon.com", "microsoft.com", "google.com"]);
DeviceNetworkEvents
| where RemoteUrl has_any(domains)
| project TimeGenerated, DeviceName, InitiatingProcessAccountUpn, RemoteUrl
Note that you can also use the has_any_index() to get which item in the array was matched
In order to correctly match URLs with a list of domains, you need to build a regex from these domains, and then use the matches regex operator.
Make sure you build the regex correctly, in order not to allow these:
example.com.hacker.com
hackerexample.com
hacker.com/example.com
Etc...

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

group by part of url using regex splunk

I have multiple url's all start with /api/net, I want to group by next couple of strings that are separated by / like
/api/net/abc/def?key=value
/api/net/c/d?key1=value1
/api/net/j/h?key2=value2
I have below regular expression which parses all url's but I explicitly have to specify required in regular expression .
| rex field=requestPath "(?<volga>.+?(\/abc\/def)|(\/c\/d)|(\/j\/h).+?)"
volga is a named capturing group, I want to do a group by on volga without adding /abc/def, /c/d,/j/h in regular expression so that I would know number of expressions in there instead of hard coding.
There are other expressions I would not know to add, So I want to group by on next 2 words split by / after "net" and do a group by , also ignore rest of the url. Let me know if you did not understand, I could explain more.
If I understand the question correctly, this regex will parse the URL and return the two domains as 'dom1' and 'dom2', respectively. Then you can group/sort on them.
... | rex field=requestPath "\/api\/net\/(?<dom1>[^\/]+)\/(?<dom2>[^\/\?]+)"
| stats values(*) as * by dom1,dom2

Two IP Address match

I need to match two ipaddress with a regular expression:
Like 20.20.20.20
should match with 20.20.20.20
should match with [http://20.20.20.20/abcd]
should not match with 20.20.20.200
should not match with [http://20.20.20.200/abcd]
should not match with [http://120.20.20.20/abcd]
At present i am using something like this regular expression: ".*[^(\d)]20.20.20.20[^(\d)].*"
But it is not working for the 1st and 3rd case.Please help me with this regular expression.
You're ignoring the case where the line starts with 20.20.20.20:
"(.*[^(\d)]|^)20.20.20.20([^(\d)].*|$)"
seems to work for me
You can do it like this:
select * from tablename
where ip = '20.20.20.20' or ip like 'http://20.20.20.20/%'
[^(\d)] without quantifier means that you expect exactly 1 characer that is not a number
using [^(\d)]* will help

Remove all text between 2 sentences in regex

I going crazry with regex.
I need to extract a words between FROM and WHERE in this syntax:
SELECT IDClient, Client FROM Client WHERE IDClient = 1 GROUP BY IDClient, Client ORDER BY IDClient
result = Client
How can I resolve this using regular expressions?
/FROM (.*) WHERE/i
(?<=FROM\s+).*(?=\s+WHERE)
That uses a look behind and a lookahead to get what is between FROM and WHERE, and can be modified depending on whether you want the whitespace or not.
Use a regex cheat sheet, it's not too hard to work out.
You can use this online regular expression builder:
http://gskinner.com/RegExr/
Or try the tutorials at:
regular-expressions dot info