sql substr variable url extraction process

sql substr variable url extraction process - sql

Context:
till now i uses to use regexp in sql to extract variable urls. I find it very slow and want to optimize it using substr and instr commands. That's important for me cause as i'm new in sql it serves me to be more familiar with such commands.
database:
my db is made by posts extracted from social platforms. text are called "titre". It contains variables url in different formats: www, http, https. I want to create a table or table view (i m not fixed) containing those url and the related id_post.
My work:
I have noticed that url always ends with a blank space, sthg like: "toto want to share with you this www.example.com in his post"
here stands what i ve done so far:
---longueur de la chaîne de caractère depuis https
select LENGTH(substr(titre, INSTR(titre,'https:'))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
---longueur de la chaîne de caractère depuis le blanc
select LENGTH(substr(titre, INSTR(titre,' ', 171))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
--- différence pour obtenir la longueur de chaîne de caractères de l'url
select LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', 171))) as longueur_url from post_categorised_pages where id_post = '280853248721200_697941320345722';
---url
select substr(titre, 171, 54)from post_categorised_pages where id_post = '280853248721200_697941320345722';
Question:
How can i automotasize that over the whole table "post_categorised_page"?
Can i introduce case when statements to take into account https or http of www. and how can i do that?
Thanks a lot!!!!

Maybe, instead of the "HTTP", HTTPS" or "WWW" string you would need to have the name of a column.
In this case, probably, it would be helpful to have a definition table where to define all possible sources. This tabel to have 2 columns (ID and source_name).
Then, in your post_categorised_pages table, to insert also the source of the message (the ID value).
Then, into the query, to join with this definition table by ID and, instead of
select substr(titre, INSTR(titre,'https:'), (LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,'https:')))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';
to have
select substr(titre, INSTR(titre,"definition table".source_name), (LENGTH(substr(titre, INSTR(titre,"definition table".source_name))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,"definition table".source_name)))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';

Ok guys, here is the solution i have found (there is stil one mistake, see at end of post).
I use two views to finally extract my strings.
First view is create by a connect by request:
--- create intermediate table view with targeted pattern position
create or replace view Start_Position_Index as
with "post" as
(select id, text from "your_table" where id= 'xyz')
select id, instr(text,'#', 1, level) as position, text
from post
connect by level <= regexp_count(titre, '#');
then
--- create working table view with full references and blank position for each pattern match and string_lenght for each one
create or replace view _#_index as
select id, position as hashtag_pos, INSTR(text,' ', position) as blank_position, INSTR(text,' ', position) - position as string_length, text
from Start_Position_Index;
At the end you will be able to retrieve the hashtags (in that case) you were looking for in your string.
Ok so the mistakes:
- if the pattern you are looking for is at the end of your string it will retrieve a null value cause there will be no blank space (as it is at end of the string).
- it is not well optimized cause here i am working with views and not tables. I think using tables will be faster.
But i m pretty sure there is lots of things to do in order to optimize this code... any idea? The challenge were how to extract specific pattern recursively among strings whithout using costy regex and without using pl/sql stuff. What do you think of that?

How about using Oracle Full Text search?
This will index all the words from the column and will provide the hashtags or web addresses, as both are written in one word, without space in between.

Related

UTF8 changed to Latina 1 - Umlauts are not considered

I am currently updating a table using the UPDATE command, whereby a text section is also read from another table using a substring. The command works fine so far. The only problem are the umlauts, which are not taken into account during the update. As I found out, for some reason the substring is rewritten into the format latina1, although the corresponding column of the table (action) has utf8 preset. Enclosed is the code for updating.
SQL:
update vms_vertrag_datei d
inner join vms_vertrag_verlauf v ON d.vertrag = v.vertrag
SET d.nutzer = v.nutzer, d.uploaddatum= v.timestamp
WHERE d.filename in (SELECT DISTINCT SUBSTRING(v.aktion,LOCATE('"',v.aktion)+1,(((LENGTH(v.aktion))-LOCATE('"', REVERSE(v.aktion))-1)-LOCATE('"',v.aktion)))FROM vms_vertrag_verlauf v)
AND v.aktion like 'Datei%hinzugefügt';
Does anyone know how I can now also consider text with umlauts? Am just after longer online research something at despair.

How to get From & To Ip Address from CIDR BigQuery

BigQuery provides updated geoip2 public dataset here [bigquery-publicdata -> geolite2 -> ipv4_city_blocks] which contains network column with IPv4 CIDR values.
How do I convert the CIDR values in the network column via BigQuery SQL (and not via a utility outside BigQuery) into start & end ip-address values so that I can find if an IP address is within a range or no? Would be helpful if you can provide the query to obtain the range ips for a CIDR value in the table.

Below is for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION cidrToRange(CIDR STRING)
RETURNS STRUCT<start_IP STRING, end_IP STRING>
LANGUAGE js AS """
var beg = CIDR.substr(CIDR,CIDR.indexOf('/'));
var end = beg;
var off = (1<<(32-parseInt(CIDR.substr(CIDR.indexOf('/')+1))))-1;
var sub = beg.split('.').map(function(a){return parseInt(a)});
var buf = new ArrayBuffer(4);
var i32 = new Uint32Array(buf);
i32[0] = (sub[0]<<24) + (sub[1]<<16) + (sub[2]<<8) + (sub[3]) + off;
var end = Array.apply([],new Uint8Array(buf)).reverse().join('.');
return {start_IP: beg, end_IP: end};
""";
SELECT network, IP_range.*
FROM `bigquery-public-data.geolite2.ipv4_city_blocks`,
UNNEST([cidrToRange(network)]) IP_range
It took about 60 sec to process all 3,037,858 rows with result like below

This query will do the job:
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT city_name, SUM(c) c, ST_GeogPoint(AVG(longitude), AVG(latitude)) point
FROM (
SELECT ip, city_name, c, latitude, longitude, geoname_id
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
WHERE city_name IS NOT null
GROUP BY city_name, geoname_id
ORDER BY c DESC
LIMIT 5000`
Find more details on:
https://towardsdatascience.com/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2

The first thing you need to check is, if that function already exists, so please refer to the BigQuery Functions and Operators documentation.
If not, you need to use Standard SQL User-Defined Functions (UDF), which lets you create a function using another SQL expression or another programming language, such as JavaScript.
Keep in mind when using UDF JavaScript function, BigQuery initializes a JavaScript environment with the function's contents on every shard of execution. There is no optimization to avoid loading the environment, so it can slow down the query.
Regarding to GeoIP2 City and Country CSV Databases site, there is a utility to convert 'network' column to start/end IPs or start/end integers. Refer to Github site for details.

January 2023 solution
Just wanted to respond to Felipe's comment here. I'm not sure why he is suggesting an alternate solution using Snowflake, as his existing solution works just fine. The only difference is that you need to create the dataset yourself.
I managed to solve this by going through the exact same steps listed in Felipe's very helpful original blog article:
Sign-up to MaxMind and download the Geolite2 databases (link)
Download the two CSV files GeoLite2-City-Blocks-IPv4.csv and GeoLite2-City-Locations-en.csv, upload them to a GCP bucket, and create tables from them. I lazily used the BQ automated schema feature and it worked just fine :)
Simply create a geolite2_locs table using a query similar to the one below (just keep or drop your columns as required for your use-case)
CREATE OR REPLACE TALBLE `dataset.geolite2_locs` OPTIONS() AS (
SELECT
ip_ref.network,
NET.IP_FROM_STRING(REGEXP_EXTRACT(ip_ref.network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(ip_ref.network, r'/(.*)' ) AS INT64) mask,
ip_ref.geoname_id,
city_ref.continent_name as continent_name,
city_ref.country_name as country_name,
city_ref.city_name as city_name,
city_ref.subdivision_1_name as subdivision_1_name,
city_ref.subdivision_2_name as subdivision_2_name,
ip_ref.latitude as latitude,
ip_ref.longitude as longitude,
FROM `geolite2`.`geolite2-ipv4` ip_ref LEFT JOIN `geolite2`.`geolite2-city-en` city_ref USING (geoname_id)
);
Adapt the query in Felipe's guide or just replace the fh-bigquery.geocode.201806_geolite2_city_ipv4_locs with your new table in his answer above.
Should take you at max 1 hour to get this going. Hope it helps.

Like Clause over an 'Element' - ORACLE APEX

I encounter some problems that i don't understand with APEX.... Well, let's be specific.
I ve got a select element retrieving a top 50 of most liked url (P11_URL). This is populate by a table view, TOp_Domains.
I create an element called "Context" that have to print all text containing the URL selected by the user from the element select. Those Texts come from another table, let's say "twitter_post".
I create a dynamic action (show only) with this sql/statement:
Select TXT, NB_RT, RANK
from myschema.twitter_post
where TXT like '%:P11_URL%'
group by TXT, NB_RT, RANK
.... and it doesn't work... I think APEX don't like like clause... But i don't know how to do. Let's keep in min an url could have been shared by multiple Tweets, that's why this element "context" is important for me.
I tried to bypass the problem by building a State (in french Statique) and a dynamic action that will refresh the state but it doesn't work neither... bouhououououou
TriX

Right click on the 'P11_URL' and create DA. Event :change, Item:P11_URL. As the true action of the DA, select 'Set Value'. Write your query in the sql stmt area. In the page items to submit, select 'P11_URL' . In the 'Affected Items': select 'Context'.
Query should be :
Select TXT, NB_RT, RANK
from myschema.twitter_post
where TXT like '%' || :P11_URL || '%'
group by TXT, NB_RT, RANK

So
Thanks to #Madona... Their example made me realised my mistake. I wrote the answer here for futher help if somebody encouter the same porblem.
A list select element get as arguments a display value (the one you want to be shown in your screen.... if you want so....^^ ) and a return value (in order, I think to linked dynamic actions). So to solved my problem i had to shape my sql statement as:
select hashtags d, hastags r
from my table
order by 1
[let s say that now in Apex it s an object called P1_HASHTAGS]
First step problem solving.
In fact, the ranking as second value, as i put into my sql statement was making some mitsakens into my 'Where like' clause search... well... Newbie am i!
Second step was to correctly formate the sql statement receiving the datas from my select lov (P1_HASHTAGS) into my interactive report. As shown here:
Select Id, hashtags
from my table
where txt like '%'||:P1_HASHTAGS||'%'
And it works!
Thank you Madona your example helped me figure my mistakes!

PowerPivot - Dax Find Text in another Cell - Cannot get string value

I have what is simple data and I want to return the Home or Motor if a find search returns True.
For Example
Skills Type
I Home Sr
A Mot Pre
Type is my custom column
Starting with just returning True and it fails right here with
=FIND("Home",[Skills])
With
Calculation error in column 'Table1'[]: The search Text provided to
function 'FIND' could not be found in the given text.
Ultimately I want to use If Find is "Home" Return Home if "Motor" return Motor
Desired Output (please note there are other starting variations to the Skills so cannot use a fixed search point in text)
Skills Type
I Home Sr Home
A Mot Pre Motor

Use the following expression for Type column:
=
IF (
IFERROR ( SEARCH ( "Mot*", [Skills], 1, 0 ), 0 ),
"Motor",
IF ( IFERROR ( SEARCH ( "Hom*", [Skills], 1, 0 ), 0 ), "Home", "Nothing" )
)
It will generate Home or Motor for any occurrence of Hom* and Mot*, note the * wildcard
It should produce:
The screenshot shows the table generated in Power BI, but this solution works in PowerPivot too, I don't have access to PowerPivot in this moment.
Note if any occurrence is not found it will put "Nothing" in your
Type column so you can replace "Nothing" by the string you want to
appear in that case.
SEARCH function documentation can be read here.
Let me know if this helps.

looking for db2 text function or method I can do a text contain rather than like

I'm looking for a db2 function that does a text contain search. At present I am running the following query against the data below....
SELECT distinct
s.search_id,
s.search_heading,
s.search_url
FROM repman.search s, repman.search_tags st
WHERE s.search_id = st.search_id
AND ( UPPER(s.search_heading) LIKE (cast('%REPORT%' AS VARGRAPHIC(32)))
OR (UPPER(st.search_tag) LIKE cast('%REPORT%' AS VARGRAPHIC(32)))
)
ORDER BY s.search_heading;
Which returns...
But if I change the search text to %REPORTS% rather than %REPORT% (which I need to do) the like search does not work and I get zero results.
I read a link that used a function named CONTAINS like below but when trying to use the function I get an error.
SELECT distinct
s.search_id,
s.search_heading,
s.search_url
FROM repman.search s, repman.search_tags st
WHERE s.search_id = st.search_id
AND CONTAINS(s.search_heading, 'REPORTS') = 1
Has anynoe got any suggestions? I'm on db2 version DB2/LINUXPPC 9.1.6.
Thanks

In order to look for a pattern in a string, you can use Regular Expressions. They are built-in DB2 with xQuery since DB2 v9. There are also other ways to do that. I wrote an article in my blog (in Spanish that you can translate) about Regular Expressions in DB2.
xmlcast(xmlquery('fn:matches(\$TEXT,''^[A-Za-z 0-9]*$'')')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

sql substr variable url extraction process - sql

How about using Oracle Full Text search? This will index all the words from the column and will provide the hashtags or web addresses, as both are written in one word, without space in between.

Related

UTF8 changed to Latina 1 - Umlauts are not considered

How to get From & To Ip Address from CIDR BigQuery

Like Clause over an 'Element' - ORACLE APEX

PowerPivot - Dax Find Text in another Cell - Cannot get string value

looking for db2 text function or method I can do a text contain rather than like

Categories

Resources