how to get rid of queries in a URL using Hive? - hive

I have a few million urls that can look like:
www.wikipedia.com/helloworld?somekey=published_links&otherkey=1
www.wikipedia.com/helloworld?wowkey=20005
www.wikipedia.com/helloworld
I would like to get rid of the url queries so that they all look like:
www.wikipedia.com/helloworld
How can I do that? Is it safe to do it with regex? Should I use parse_url instead (Hive)?
Thanks!

You can use parse_url function with a concatenation of http:// or https:// to the existing column and get the HOST and PATH values concatenate them to get the desired result.
select CONCAT(parse_url(concat('http://',col),'HOST'),
parse_url(concat('http://',col),'PATH')
)
from tbl

Related

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

How to create a correct filter string with OR and AND operators for django?

My app has a frontend on vue.js and backend on django rest framework. I need to do a filter string on vue which should do something like this:
((status=closed) | (status=canceled)) & (priority=middle)
but got an error as a response
["Invalid querystring operator. Matched: ') & '."]
After encoding my string looks like this:
?filters=((status%3D%D0%97%D0%B0%D0%BA%D1%80%D1%8B%D1%82)%20%7C%20(status%3D%D0%9E%D1%82%D0%BA%D0%BB%D0%BE%D0%BD%D0%B5%D0%BD))%20%26%20(priority%3D%D0%A1%D1%80%D0%B5%D0%B4%D0%BD%D0%B8%D0%B9)
which corresponds to
?filters=((status=closed)|(status=canceled))&(priority=middle)
How should look a correct filter string for django?
I have no problem if statement includes only | or only &. For example filter string like this one works perfect:
?filters=(status%3D%D0%97%D0%B0%D0%BA%D1%80%D1%8B%D1%82)%20%7C%20(status%3D%D0%9E%D1%82%D0%BA%D0%BB%D0%BE%D0%BD%D0%B5%D0%BD)
a.k.a. ?filters=(status=closed)|(status=canceled). But if i add an & after it and additional brackets to specify the order of conditions calculation it fails with an error.
I also tried to reduce usage of brackets and had string like this (as experiment):
?filters=(status%3D%D0%97%D0%B0%D0%BA%D1%80%D1%8B%D1%82%20%7C%20status%3D%D0%9E%D1%82%D0%BA%D0%BB%D0%BE%D0%BD%D0%B5%D0%BD)
a.k.a. ?filters=(status=closed | status=canceled). This one doesn't work - get neither error nor the data.
I need to have a mixed results in my case: both statuses (closed and canceled) and priority=middle, but a string format isn't correct. Please explain, which format would be Ok?
That doesn't look like a very uri friendly syntax you're trying to use there.
Try doing this instead:
?status[]=closed&status[]=cancelled&priority=middle
Then use request.GET.getlist('status[]') to get back the list and use the values for logical OR queryset filtering:
qs = qs.filter(status__in=request.GET.getlist('status[]', [])
and then add any additional filtering which works as logical AND.
If you're using axios, it should automatically format js status url param into proper format.

SQL search for string within string, excluding another string

I have an SQL field containing a large chunk of HTML. I'd like to identify any records where there is the string "http://" but it is not part of a string that is "http://www.example.com." Many of the records include "http://www.example.com" -- I am not looking to exclude those. Rather, to return them if there is an additional "http://" link that is not of the same format.
As an example, I would want to return these records:
http://www.foo.com is a great site but http://www.example.com is not
http://www.foo.com is a great site
but not this one:
http://www.example.com is a great site
How's this?
SELECT *
FROM table
WHERE REPLACE(field, 'http://www.example.com', '') LIKE '%http://%'

postgresql replace function using pattern matching characters

I am having a table named "OptionsDetail" with column named "URL" in postgresql database. The "URL" column contain the following data
URL
http://www.site1.com/ebw/file1/detail.action?cid=1
http://www.another2.com/ebw/file1/detail.action?cid=11&code=MYCODE
http://www.anothersite3.com/ebw/file1/detail.action?cid=12&code=ANOTHERCODE&option=ROOM
Now I want to replace the data in URL to
URL
/file1/detail.action?cid=1
/file1/detail.action?cid=11&code=MYCODE
/file1/detail.action?cid=12&code=ANOTHERCODE&menu=ROOM
I wrote the following query to perform the above functionality
UPDATE "OptionsDetail" SET "URL" = replace("URL",'http://%/ebw/file1','/file1') WHERE "URL" LIKE '%/ebw/file1%';
And also another way I wrote like this
UPDATE "OptionsDetail" SET "URL" = replace("URL",'%/ebw/file1','/file1') WHERE "URL" LIKE '%/ebw/file1%';
Query is executing successfully saying like for ex: 200 rows affected but "URL" column data is not changing the way I need, the data is remaining as it is.
Please help me to resolve this issue
The problem is that replace doesn't support wildcards like %:
replace("URL",'http://%/ebw/file1','/file1')
^^^
You can use regexp_replace instead:
UPDATE YourTable
SET URL = regexp_replace(URL, 'http://.*/ebw/file1', '/file1')
WHERE URL LIKE '%/ebw/file1%'
Note that regexp_replace uses different wildcards than like. In regular expressions, "Any number of any character" is .* instead of %.
See it working at SQL Fiddle.

Two IP Address match

I need to match two ipaddress with a regular expression:
Like 20.20.20.20
should match with 20.20.20.20
should match with [http://20.20.20.20/abcd]
should not match with 20.20.20.200
should not match with [http://20.20.20.200/abcd]
should not match with [http://120.20.20.20/abcd]
At present i am using something like this regular expression: ".*[^(\d)]20.20.20.20[^(\d)].*"
But it is not working for the 1st and 3rd case.Please help me with this regular expression.
You're ignoring the case where the line starts with 20.20.20.20:
"(.*[^(\d)]|^)20.20.20.20([^(\d)].*|$)"
seems to work for me
You can do it like this:
select * from tablename
where ip = '20.20.20.20' or ip like 'http://20.20.20.20/%'
[^(\d)] without quantifier means that you expect exactly 1 characer that is not a number
using [^(\d)]* will help