How to I check a text consist of ascii code in PostgreSQL? - sql

I want to select only the text which consist of ascii code values.
e.g
"Grey's Anatomy : Station 19"
"Trésors sous les mers"
"Les Légendes des Studios Marvel"
"The Great North"
"Solar Opposites"
I want to select it from above titles.
"Grey's Anatomy : Station 19"
"The Great North"
"Solar Opposites"
How to I filter out by postgreSQL?

You could use regex matching.
select * from titles where title ~ '^[[:ascii:]]+$';
Example http://sqlfiddle.com/#!17/2402a1/8
https://www.postgresql.org/docs/current/functions-matching.html

Related

Alternate workaround for Lazy regular expression in snowflake since this feature is not available in snowflake

I am trying to parse the "name" and "address" from a string. I have written the regex pattern which works perfectly fine (I verified in regex101.com) with the 'ungreedy/lazy' feature of regex but not with the greedy. Here is my snowflake query:
select
TRIM(REGEXP_SUBSTR(column1,'(^\\D*)((\\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\\d+.*)|(\\d+.*))[,.\\s]+([a-zA-Z]{2})[,.\\s]+(\\d{5}|\\d{5}-\\d{4})$',1,1,'is',1)) as test
from values(TRIM('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'));
--please ignore the latter part of regex as I am fetching territory code and zip code also and they are working fine.
The above query is returning me "FIRST SECOND THIRD PO BOX"
And, if I return the 2nd group it returns me "123 DUMMY"
What I want:
case 1 - when my string is 'FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD"
output of 2nd group: "PO BOX 123 DUMMY"
case 2 - WHEN my string is 'FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD FOURTH FIFTH"
output of 2nd group: "123 DUMMY"
Please suggest workaround here in snowflake since it doesn't have lazy feature.
PS. If you want to verify in regex101, paste the below code and test string. You will see the result when you switch to Ungreedy.
(^\D*)((\bP[OST][ .]O[FFICE][ .]B[OX][ .]\d+.)|(\d+.))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$
Test String: FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789
Thanks
Writing a JavaScrip UDF is always an option, and then you can use your regex unchanged:
create or replace function parse_address(F STRING)
returns VARIANT
language JAVASCRIPT
immutable
as $$
const regex = /(^\D*)((\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\d+.*)|(\d+.*))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$/gm;
let m = regex.exec(F);
return [m[1], m[2]];
$$;
Usage:
select parse_address($1)
from values('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789')
, ('FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789')
;

Openrefine rearange value

In a csv column I have this data:
My Dog (101)
ACat(f023.12)
My Dog (101)
ACat ad
I like to rearrange them like:
101, My Dog ()
f023.12, ACat()
101, My Dog ()
To match them I could use a simple regex like (.* ?)\((.*)\) (the last row will be kept untouched) https://regex101.com/r/ivrIa3/1
Is there an easier way doing this as:
if(value.contains(/(.* ?)\((.*)\)/), value.match(/(.* ?)\((.*)\)/)[1] + ', ' + value.match(/(.* ?)\((.*)\)/)[0], value)
In OpenRefine, the easiest way would be to use a facet (like the « Text filter ») to select lines that contains (…).
Then, use the Column command « Edit cells -> Replace ».
Find: (.*)\s*\((.*)\)
Replace: $1, $2
Regards,
Antoine

sql substr variable url extraction process

Context:
till now i uses to use regexp in sql to extract variable urls. I find it very slow and want to optimize it using substr and instr commands. That's important for me cause as i'm new in sql it serves me to be more familiar with such commands.
database:
my db is made by posts extracted from social platforms. text are called "titre". It contains variables url in different formats: www, http, https. I want to create a table or table view (i m not fixed) containing those url and the related id_post.
My work:
I have noticed that url always ends with a blank space, sthg like: "toto want to share with you this www.example.com in his post"
here stands what i ve done so far:
---longueur de la chaîne de caractère depuis https
select LENGTH(substr(titre, INSTR(titre,'https:'))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
---longueur de la chaîne de caractère depuis le blanc
select LENGTH(substr(titre, INSTR(titre,' ', 171))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
--- différence pour obtenir la longueur de chaîne de caractères de l'url
select LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', 171))) as longueur_url from post_categorised_pages where id_post = '280853248721200_697941320345722';
---url
select substr(titre, 171, 54)from post_categorised_pages where id_post = '280853248721200_697941320345722';
Question:
How can i automotasize that over the whole table "post_categorised_page"?
Can i introduce case when statements to take into account https or http of www. and how can i do that?
Thanks a lot!!!!
Maybe, instead of the "HTTP", HTTPS" or "WWW" string you would need to have the name of a column.
In this case, probably, it would be helpful to have a definition table where to define all possible sources. This tabel to have 2 columns (ID and source_name).
Then, in your post_categorised_pages table, to insert also the source of the message (the ID value).
Then, into the query, to join with this definition table by ID and, instead of
select substr(titre, INSTR(titre,'https:'), (LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,'https:')))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';
to have
select substr(titre, INSTR(titre,"definition table".source_name), (LENGTH(substr(titre, INSTR(titre,"definition table".source_name))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,"definition table".source_name)))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';
Ok guys, here is the solution i have found (there is stil one mistake, see at end of post).
I use two views to finally extract my strings.
First view is create by a connect by request:
--- create intermediate table view with targeted pattern position
create or replace view Start_Position_Index as
with "post" as
(select id, text from "your_table" where id= 'xyz')
select id, instr(text,'#', 1, level) as position, text
from post
connect by level <= regexp_count(titre, '#');
then
--- create working table view with full references and blank position for each pattern match and string_lenght for each one
create or replace view _#_index as
select id, position as hashtag_pos, INSTR(text,' ', position) as blank_position, INSTR(text,' ', position) - position as string_length, text
from Start_Position_Index;
At the end you will be able to retrieve the hashtags (in that case) you were looking for in your string.
Ok so the mistakes:
- if the pattern you are looking for is at the end of your string it will retrieve a null value cause there will be no blank space (as it is at end of the string).
- it is not well optimized cause here i am working with views and not tables. I think using tables will be faster.
But i m pretty sure there is lots of things to do in order to optimize this code... any idea? The challenge were how to extract specific pattern recursively among strings whithout using costy regex and without using pl/sql stuff. What do you think of that?
How about using Oracle Full Text search?
This will index all the words from the column and will provide the hashtags or web addresses, as both are written in one word, without space in between.

Issues with JSON_EXTRACT in Presto for keys containing ' ' character

I'm using Presto(0.163) to query data and am trying to extract fields from a json.
I have a json like the one given below, which is present in the column 'style_attributes':
"attributes": {
"Brand Fit Name": "Regular Fit",
"Fabric": "Cotton",
"Fit": "Regular",
"Neck or Collar": "Round Neck",
"Occasion": "Casual",
"Pattern": "Striped",
"Sleeve Length": "Short Sleeves",
"Tshirt Type": "T-shirt"
}
I'm unable to extract field 'Short Sleeves'.
Below is the query i'm using:
Select JSON_EXTRACT(style_attributes,'$.attributes.Sleeve Length') as length from table;
The query fails with the following error- Invalid JSON path: '$.attributes.Sleeve Length'
For fields without ' '(space), query is running fine.
I tried to find the resolution in the Presto documentation, but with no success.
presto:default> select json_extract_scalar('{"attributes":{"Sleeve Length": "Short Sleeves"}}','$.attributes["Sleeve Length"]');
_col0
---------------
Short Sleeves
or
presto:default> select json_extract_scalar('{"attributes":{"Sleeve Length": "Short Sleeves"}}','$["attributes"]["Sleeve Length"]');
_col0
---------------
Short Sleeves
JSON Function Changes
The :func:json_extract and :func:json_extract_scalar functions now
support the square bracket syntax:
SELECT json_extract(json, '$.store[book]');
SELECT json_extract(json,'$.store["book name"]');
As part of this change, the set of characters
allowed in a non-bracketed path segment has been restricted to
alphanumeric, underscores and colons. Additionally, colons cannot be
used in a un-quoted bracketed path segment. Use the new bracket syntax
with quotes to match elements that contain special characters.
https://github.com/prestodb/presto/blob/c73359fe2173e01140b7d5f102b286e81c1ae4a8/presto-docs/src/main/sphinx/release/release-0.75.rst
SELECT
tags -- It is column with Json string data
,json_extract(tags , '$.Brand') AS Brand
,json_extract(tags , '$.Portfolio') AS Portfolio
,cost
FROM
TableName
Sample data for tags - {"Name": "pxyblob", "Owner": "", "Env": "prod", "Service": "", "Product": "", "Portfolio": "OPSXYZ", "Brand": "Limo", "AssetProtectionLevel": "", "ComponentInfo": ""}
Here is your Correct answer.
Let Say:
JSON : {"Travel Date":"2017-9-22", "City": "Seattle"}
Column Name: ITINERARY
And i wana extract 'Travel Date' form the current JSON then:
Query: SELECT JSON_EXTRACT(ITINERARY, "$.\"Travel Date\"") from Table
Note: Just add \" at starting and end of the key name.
Hope this will surely work for you need. :)

FullText Contains "AND DO"

I found a weird issue with a query that uses FullText index.
The following query
#1 SELECT * FROM tbparticipant where contains([FullTextQuery],'ALINE AND NASCIMENTO')
returns
ALINE DO NASCIMENTO
ALINE QUEIROZ DO NASCIMENTO
ALINE NASCIMENTO DE SOUZA
ALINE CORREIA DO NASCIMENTO
But this query
#2 SELECT * FROM tbparticipant where contains([FullTextQuery],'ALINE AND DO')
returns nothing.
I thought it would be a problem with "DO" being too short, but this query
#3 SELECT * FROM tbparticipant where contains([FullTextQuery],'ALINE AND DE')
returns
ALINE NASCIMENTO DE SOUZA
So, what's wrong with the query #2?
"Do" is on stopword list - stopwords are words considered to common or to short to have any significant meaning for full text queries. You can list your stopwords for english language like this:
select * from sys.fulltext_system_stopwords where language_id = 1033
Reference:
http://msdn.microsoft.com/en-us/library/ms142551.aspx