Related
I need to correct Names of users by removing prefixes before I can process the names.
For example, my list of prefixes is:
am, auf, auf dem, aus der, d, da, de, de l’, del, de la, de le, di, do, dos, du,
im, la, le, mac, mc, mhac, mhíc, mhic giolla, mic, ni, ní, níc, o, ó,
ua, ui, uí, van, van de, van den, van der, vom, von, von dem, von den, von der
I want to remove any of these prefixes from the First Name if they are present.
For example - inputs:
Outputs:
I know I can take a brute force approach and do a replace 40 odd times, but was wondering if there is a better/smarter way to do this, given the list of names that need to be processed can be in the tens of thousands, daily.
Thank you
You could use apply:
select t.*, v.prefix_free_first_name
from t outer apply
(select top (1) left(t.first_name, len(t.first_name) - len(v.prefix) - 1) as prefix_free_first_name
from (values ('am'), ('auf'), . . .
) v(prefix)
where t.first_name like '% ' + v.prefix
order by len(v.prefix) desc
) v;
Note: This handles the situation where multiple prefixes match a name, such as "de le" and "le".
I am in need of solving a problem for my project.
I need to clean up an address field in PostgreSQL by removing everything to the right of a street name.
And I have found it here: PostgreSQL replace characters to right of string
SELECT regexp_replace('100 broadway street 100', '(.*)(Street).*', '\1\2', 'i');
However, I would like to replace '100 broadway street 100' more flexibly, like this:
SELECT regexp_replace('100 broadway street 100', '(.*)(Street OR Str. OR Ward OR W. OR Dist).*', '\1\2', 'i');
Can someone help me write the right syntax or have any other links I haven't found yet?
Input 1: "100 Alexandre de Rhodes Street, District 10, HCM City"
Input 2: "100 Quang Trung Str., District 10, HCM City"
Input 3: "123 Newton St., GV District, HCM City"
Output 1: "100 ABC Street, Ward 16"
Output 2: "100 Quang Trung Str."
.v.v..
ie will need to remove the string behind the road name
I think you are looking for | operator like this
SELECT regexp_replace('100 broadway Dist 100', '(.*)(Street|Str|Ward|Dist).*', '\1\2', 'i');
Output
100 broadway Dist
Update based on comments
You can replace .* with ..
SELECT regexp_replace('100 broadway Dist Str 100 Str abc Street',
'(.)(Street|Dist|Ward|Str).*', '\1\2', 'i');
Output
100 broadway Dist
I am searching for a particular value in the string, each value in this string designates a clause which looks like this
K1,K2,K3,K114,K22,K110,...
Standard results
Lets say I want to search for people who have K1
When I use the following filter in my where section
where q2.klause like '%K1%'
it will get me all people who have a k1, however if someone has K11 for instance without the K1 it will also add it which is not what I want!
So I tried this approach
where q2.klause like '%K1'
Sadly it also doesnt work because it only gets people who only have K1 alone, unshared without clauses.
I want anybody who has a K1 whether they have/not other klauses beside it!
UIDs Klause
6548 K1,K35,K37,K4
48 K1,K34
486 K1,K14
8974 K1
456568 K11,K12,K2
8814 K2,K14,K34
6248 K14,K2
2236 K1,K35,K37,K4
547 K2
397812 K2
586 K2,K11
1358 K1,K13,K14
5856 K1,K14
9872 K1,K14
64789 K1,K14
22344 K1,K14,K35,K37
4788 K1,K14
4587 K1,K14,K35,K37
14561 K11,K12,K14,K2
232156 K1,K14
156 K1,K114
475 K11,K12,K14,K2
45645 K13,K14
456454 K13,K14
In this case there are 14 people who have K1. This is the final answer required! Please be aware this should be also done with other klauses aswell.
Use a regular expression:
WHERE q2.klause ~ '\mk1\M'
\m matches the beginning of a word and \M the end of a word.
You should fix your data model. But you can use like this way:
where ',' || q2.klause || ',' like '%,K1,%'
WHERE q2.klause LIKE '%K1'
OR q2.klause LIKE '%K1,%'
I would convert the column value to an array and use the contains operator:
where string_to_array(q2.klause,',') #> 'K1'
alternatively:
where 'K1' = any(string_to_array(q2.klause,','))
Context:
till now i uses to use regexp in sql to extract variable urls. I find it very slow and want to optimize it using substr and instr commands. That's important for me cause as i'm new in sql it serves me to be more familiar with such commands.
database:
my db is made by posts extracted from social platforms. text are called "titre". It contains variables url in different formats: www, http, https. I want to create a table or table view (i m not fixed) containing those url and the related id_post.
My work:
I have noticed that url always ends with a blank space, sthg like: "toto want to share with you this www.example.com in his post"
here stands what i ve done so far:
---longueur de la chaîne de caractère depuis https
select LENGTH(substr(titre, INSTR(titre,'https:'))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
---longueur de la chaîne de caractère depuis le blanc
select LENGTH(substr(titre, INSTR(titre,' ', 171))) from post_categorised_pages where id_post = '280853248721200_697941320345722';
--- différence pour obtenir la longueur de chaîne de caractères de l'url
select LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', 171))) as longueur_url from post_categorised_pages where id_post = '280853248721200_697941320345722';
---url
select substr(titre, 171, 54)from post_categorised_pages where id_post = '280853248721200_697941320345722';
Question:
How can i automotasize that over the whole table "post_categorised_page"?
Can i introduce case when statements to take into account https or http of www. and how can i do that?
Thanks a lot!!!!
Maybe, instead of the "HTTP", HTTPS" or "WWW" string you would need to have the name of a column.
In this case, probably, it would be helpful to have a definition table where to define all possible sources. This tabel to have 2 columns (ID and source_name).
Then, in your post_categorised_pages table, to insert also the source of the message (the ID value).
Then, into the query, to join with this definition table by ID and, instead of
select substr(titre, INSTR(titre,'https:'), (LENGTH(substr(titre, INSTR(titre,'https:'))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,'https:')))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';
to have
select substr(titre, INSTR(titre,"definition table".source_name), (LENGTH(substr(titre, INSTR(titre,"definition table".source_name))) - LENGTH(substr(titre, INSTR(titre,' ', (INSTR(titre,"definition table".source_name)))))))from post_categorised_pages where id_post = '280853248721200_697941320345722';
Ok guys, here is the solution i have found (there is stil one mistake, see at end of post).
I use two views to finally extract my strings.
First view is create by a connect by request:
--- create intermediate table view with targeted pattern position
create or replace view Start_Position_Index as
with "post" as
(select id, text from "your_table" where id= 'xyz')
select id, instr(text,'#', 1, level) as position, text
from post
connect by level <= regexp_count(titre, '#');
then
--- create working table view with full references and blank position for each pattern match and string_lenght for each one
create or replace view _#_index as
select id, position as hashtag_pos, INSTR(text,' ', position) as blank_position, INSTR(text,' ', position) - position as string_length, text
from Start_Position_Index;
At the end you will be able to retrieve the hashtags (in that case) you were looking for in your string.
Ok so the mistakes:
- if the pattern you are looking for is at the end of your string it will retrieve a null value cause there will be no blank space (as it is at end of the string).
- it is not well optimized cause here i am working with views and not tables. I think using tables will be faster.
But i m pretty sure there is lots of things to do in order to optimize this code... any idea? The challenge were how to extract specific pattern recursively among strings whithout using costy regex and without using pl/sql stuff. What do you think of that?
How about using Oracle Full Text search?
This will index all the words from the column and will provide the hashtags or web addresses, as both are written in one word, without space in between.
I am trying to break up the following fixed string into several columns as street ,city, state & zip code. Is it possible to do this in SQLDF via the INSTR & Subtr method?
Sample Address String. The difficult part is the NV and zip code parsing.
727 Wright Brothers Ln, Las Vegas, NV 89119, USA
I am able to parse the city/street information using sqldf/instr but unable to parse the final two values for state/zip code
parsed_tweetAddressdf <- sqldf("SELECT lon, lat, result, substr(result,0,instr(result,',')) AS street, substr(result,instr(result,',')+1,instr(result,',')-1) AS city from tweetAddressdf")
Here are some alternatives. They all use instr and substr as required by the question although the third also writes out the data and reads it back in (in addition to using instr and substr). Notes at the end point out that it is also easy to do this in plain R or using read.pattern in gsubfn.
1) Assume state, zip and country fields are fixed width With only one sample record it is impossible to know what your general case is but if we assume that every record ends in SS ZZZZZ, USA where SS is the two letter state abbreviation and ZZZZZ is a 5 digit zip then this works:
DF <- data.frame(v = "727 Wright Brothers Ln, Las Vegas, NV 89119, USA")
library(sqldf)
sqldf("select
substr(v, 0, instr(v, ',')) street,
substr(v, instr(v, ',') + 2, length(v) - 16 - instr(v, ',')) city,
substr(v, -13, 2) state,
substr(v, -10, 5) zip
from DF")
giving:
street city state zip
1 727 Wright Brothers Ln Las Vegas NV 89119
2) Separate strictly based on commas (except state/zip) This approach avoids certain assumptions in (1) at the expense of additional complication. It takes the first two comma separated fields, the 2 character state and everything after that to the next comma as the zip.
It uses a triple nested select. The innermost select denoted a parses the input string into: street and a.rest. The next one proceeding outward denoted b returns the street already parsed from a, and parses a.rest into city and the b.rest. The outermost one returns the street and city already parsed plus it returns the two state characters in b.rest and everything beyond them in b.rest to the next comma as zip.
library(sqldf)
sqldf("
select
street,
city,
substr(b.rest, 1, 2) state,
substr(b.rest, 4, instr(b.rest, ',') - 4) zip
from (
select
street,
substr(a.rest, 0, instr(a.rest, ',')) city,
substr(a.rest, instr(a.rest, ',') + 2) rest
from (select
substr(v, 0, instr(v, ',')) street,
substr(v, instr(v, ',') + 2) rest
from DF) a) b
")
giving:
street city state zip
1 727 Wright Brothers Ln Las Vegas NV 89119
3) read.csv.sql If it's OK to write it out and read it back in then we can use read.csv.sql, a wrapper around sqldf. Although the question did not ask for it, this one also parses out the country:
write.table(DF, "addresses.csv", row.names = FALSE, col.names = FALSE,
sep = ",", quote = FALSE)
read.csv.sql("addresses.csv", header = FALSE, sql =
"select V1 street,
V2 city,
substr(V3, 2, 2) state,
substr(V3, 4) zip,
V4 country
from file")
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA
Note 1: This is also easy in plain R.
dd <- read.table(text = as.character(DF$v), sep = ",",
col.names = c("street", "city", "state_zip", "country"))
transform(dd,
state = substring(state_zip, 2, 3),
zip = substring(state_zip, 4))[c(1, 2, 5, 6, 4)]
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA
Note 2: It is even easier using read.pattern from gsubfn:
library(gsubfn)
pat <- "(.*), (.*), (..) (.*), (.*)"
read.pattern(text = as.character(DF$v), pattern = pat,
col.names = c("street", "city", "state", "zip", "country"))
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA