filter post regex in sql - sql

Q1: i am trying to capture abc-12345 type pattern with regex and using
'[aA-zZ]+\-[0-9]+'
I am getting most results that are correct but a few are coming back with the [ like '[abc-57489'. whats the best way to fix the column in sql to removew the '['
Q2: to capture more scenarios, i am doing:
coalesce(regexp_extract(column1,'[aA-zZ]+\-[0-9]+'),
coalesce(regexp_extract(column1,'[aA-zZ]+\- [0-9]+'),
coalesce(regexp_extract(column1,'[aA-zZ]+\ - [0-9]+'),
coalesce(regexp_extract(column1,'[aA-zZ]+\ -[0-9]+'),'')))) as columnoneadjusted,
How Can i filter out items post regex that dont have 'abc'

I found a simpler answer.
coalesce(regexp_extract(column1,'(abc)+\-[0-9]+'),
coalesce(regexp_extract(column1,'(abc)+\- [0-9]+'),
coalesce(regexp_extract(column1,'(abc)+\ - [0-9]+'),
coalesce(regexp_extract(column1,'(abc)+\ -[0-9]+'),'')))) as columnoneadjusted,

Related

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

Using regexp in Big Query to extract URLs

I've been trying to extract any URL present within my 'Text' column in Big Query. The column contains a mixture of text and URLs dotted throughout (a cell might contain more than one URL) I'm trying to use this regexp:
SELECT
REGEXP_EXTRACT (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files
I currently get 'failed to parse regular expression' when I try to run the query. I've tried modifying it but to no avail.
The regexp works in an online builder but I'm just not sure how to incorporate it into Big Query.
Any help would be much appreciated - or at least pointers on how to incorporate regular expressions into Big Query!
Try below - it is for BigQuery Standard SQL (see Enabling Standard SQL and Migrating from legacy SQL)
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id,
REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
This gives you output with id field, and repeated field with all respective URLs
If you need flattened result - you can use below variation
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id, URL
FROM (
SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
), UNNEST(URL) as URL
Note: you can use here any regexp that you will be able to find on web - but what a must is - there is only one matching group is allowed! so all inner matching group should be escaped with ?: as you can see it in above examples. So the ONLY group that you expect to see in output should be left as is - w/o ?:
Your regex has an incomplete capturing group, and has 2 unescaped characters. I don't know which online regex builder you're using, but maybe you forgot to put your new regex into it?
The problems are as follows:
(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE ---> ^1 ^^2
This is the start of a capturing group with no end. You probably want the ) right before the *.
All slashes need to be escaped. This should probably be \/ or maybe even \/\\.
Here is an example with both of my suggestions implemented: https://regex101.com/r/pt1hqS/1
Good luck fixing it!

Product Index Using Django ORM

I have a list of Products with a field called 'Title' and I have been trying to get a list of initial letters with not much luck. The closes I have is the following that dosn't work as 'Distinct' fails to work.
atoz = Product.objects.all().only('title').extra(select={'letter': "UPPER(SUBSTR(title,1,1))"}).distinct('letter')
I must be going wrong somewhere,
I hope someone can help.
You can get it in python after the queryset got in, which is trivial:
products = Project.objects.values_list('title', flat=True).distinct()
atoz = set([i[0] for i in products])
If you are using mysql, I found another answer useful, albeit using sql(django execute sql directly):
SELECT DISTINCT LEFT(title, 1) FROM product;
The best answer I could come up with, which isn't 100% ideal as it requires post processing is this.
atoz = sorted(set(Product.objects.all().extra(select={'letter': "UPPER(SUBSTR(title,1,1))"}).values_list('letter', flat=True)))

Unable to query using 4 conditions with WHERE clause

I am trying to query a database to obtain rows that matches 4 conditions.
The code I'm using is the following:
$result = db_query("SELECT * FROM transportesgeneral WHERE CiudadOrigen LIKE '$origen%' AND DepartamentoOrigen LIKE '$origendep' AND DepartamentoDestino LIKE '$destinodep' AND CiudadDestino LIKE '$destino%'");
But it is not working; Nevertheless, when I try it using only 3 conditions; ie:
$result = db_query("SELECT * FROM transportesgeneral WHERE CiudadOrigen LIKE '$origen%' AND DepartamentoOrigen LIKE '$origendep' AND DepartamentoDestino LIKE '$destinodep'");
It does work. Any idea what I'm doing wrong? Or is it not possible at all?
Thank you so much for your clarification smozgur.
Apparently this was the problem:
I was trying to query the database by using the word that contained a tittle "Petén" so I changed the database info and replaced that word to the same one without the tittle "Peten" and it worked.
Now, im not sure why it does not accept the tittle but that was the problem.
If you have any ideas on how I can use the tittle, I would appreciate that very much.

SQL Syntax and Rails: how to generate list of db records with trailing whitespace in console

I'm in the Rails console and I want to generate a list of user names that have a trailing whitespace in them. I was thinking that the syntax would look like this, but it didn't work. Any change a better programmer than me can point out what I'm doing wrong?
> User.name.where("% ")
Don't know if you're using MySQL, but an approach would be:
User.where("name LIKE '% '")
You may change this according to your database. This is kinda slow, though.
One way is
Job.all.select{|j| j =~ /^\d+$/}
but it may not be as efficient as the MySQL version.
Another possibility is to use a named scope to hide the ugly SQL:
named_scope :all_digits, lambda { |regex_str|
{ :condition => [" invoice_number REGEXP '?' " , regex_str] }
}
Then you have Job.all_digits.
Answer taken from How to specify Ruby regex when using Active Record in Rails?
You can have
regex_str = "\w+\s+$"
Thanks
Here's what I went with:
User.all.select { |c| c.name.end_with?(" ") }
This got me the list I needed.
It's based on Paritosh's first answer. I'm making his a the canonical answer because I think it's a better resource in general. My solution only helps me, but his has a lot of strategies that would be helpful.