Stripping strings of left brackets and slashes in pandas dataframe - pandas

I need to strip the characters after the third '-', after the first "(", and after the first '/' , and keep the result in a new column, keepcat.
violation_code violation_description keepcat
ticket_id
22056 9-1-36(a) Failure of owner to obtain certificate of compliance 9-1-36
27586 61-63.0600 Failed To Secure Permit For Lawful Use Of Building 61-63.0600
18738 61-63.0500 Failed To Secure Permit For Lawful Use Of Land 61-63.0500
18735 61-63.0100 Noncompliance/Grant Condition/BZA/BSE 61-63.0100
23812 61-81.0100/32.0066 Open Storage/ Residential/ Vehicles 61-81.0100/32.0066
26686 61-130.0000/130.0300 Banner/ Signage/ Antenna 61-130.0000/130.0300
325555 9-1-43(a) - (Structu Fail to comply with an Emergency 9-1-43
I have managed to delete the dashes ("-") and the brackets ("(") with this:
df['keepcat']=df['violation_code'].apply(lambda x: "-".join(x.split("-")[:3]) and x.split('(')[0].strip())
however, when I am adding "/" it does not delete the slashes...
I have tried
df['violation_code'].apply(lambda x: "-".join(x.split("-")[:3]) and x.split('(')[0].strip()) and x.split('/')[0].strip() )
thank you.

Does it work if you parse the conditions separately:
df['keepcat'] = df['violation_code'].apply(lambda x: "-".join(x.split("-")[:3]))
df['keepcat'] = df['keepcat'].apply(lambda x: x.split('(')[0].strip())
df['keepcat'] = df['keepcat'].apply(lambda x: x.split('/')[0].strip())

Related

Pattern match using regexp_extract_all

I am trying to build a array from this string and need help with pattern on regexp_extract_all.
Here is my input string contains keyword value pairs
BEGIN
DECLARE p_JSON STRING DEFAULT """
{
"instances": [{
"LT_20MN_SalesContrctCnt": 388.0,
"Pyramid_Index": '',
"MARKET": "'Growth Markets','Europe'",
"SERVICE_DIM": "'S&C','F&M'",
"SG_MD": "'All Service Group'"
}]}
""";
SELECT split(x,":")[OFFSET(0)] as keyword, split(x,":")[OFFSET(1)] keyword_value
FROM unnest(split(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'([\'\"\[\]{}])', ''))) as x
END;
The above SQL is failing at SPLIT due to , with in the data.
All I am trying to do here is build a two columns Keyword and value.
The idea here is if I can extract each row using REGEXP_EXTRACT_ALL with out the last "," then I should be able to split into keyword and keyword_value columns. Btw the names or number of keywords/values are not fixed.
Intended output from REGEXP_EXTRACT_ALL:
"LT_20MN_SalesContrctCnt": 388.0
"Pyramid_Index": ''
"MARKET": "'Growth Markets','Europe'"
"SERVICE_DIM": "'S&C','F&M'"
"SG_MD": "'All Service Group'"
Appreciate if you can suggest a better way to handle this.
Thanks in advance.
Using your sample data, I just added an extra REGEXP_REPLACE to replace ," to #" so we can avoid splitting using ,. See approach below:
SELECT
SPLIT(arr,":")[OFFSET(0)] as keyword,
SPLIT(arr,":")[OFFSET(1)] as keyword_value,
FROM sample_data,
UNNEST(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'[\[\]{}]',''),r',"','#"'),'#')) arr
Output:

Multiline regex match in MariaDB/Mediawiki

I am trying to match text (contained in a Mediawiki template) in multiple lines via the Replace Text extension in MW 1.31, server running MariaDB 10.3.22.
An example of the template is the following (other templates may exist on the same page):
{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}}
Above and below could be other templates, with varying number of line breaks I.e.
{{MyTemplatename
|Text=text, text, text
}}
{{WoodhouseENELnames
|Text=text, text, text
}}
{{OtherTemplatename
|Text= text, text, text
}}
There are varying number of lines and/or line breaks within the template. I want to match the full template and delete it; that is match from {{WoodhouseENELnames to the closing }} but without matching any templates further down, that is, stop matching if further {{ are encountered.
The closest I got was using something like:
Find
({{WoodhouseENELnames\n\|Text=)(.*?)\n+(.*?)\n+(.*?)\n+(.*?)(\n+}})
And adding/removing (.*?)\n+ in the regex to match cases with more or less lines. The problem is that this expression might inadvertently match other templates following this one.
Is there a regex that would match all possible text/line breaks contained within the template (in a lazy way, as there may be other templates below and above) in the same page? The templates are delimited by opening {{ and closing }})?
Edited to clear up any confusing
This is a recursion simulation for use on
Java, Python style engines that do not support function calls (recursion)
(?s)(?={{WoodhouseENELnames)(?:(?=.*?{{(?!.*?\1)(.*}}(?!.*\2).*))(?=.*?}}(?!.*?\2)(.*)).)+?.*?(?=\1)(?:(?!{{).)*(?=\2$)
Recursion Simulation demo
Just check matchs for result
This is real recursion for use on Perl, PCRE style engines
(?s){{WoodhouseENELnames((?:(?>(?:(?!{{|}}).)+)|{{(?1)}})*)}}
Recursion demo
Note that Dot-Net is done differently and is not included here
I can only think of a brute-force, iterative approach using a recursive query.
The idea is to walk through the string, starting at the first occurence of string part '{{WoodhouseENELnames'. From there on, we can set a counter that keeps tracks of how many opening and closing brackets were met. When the count reaches 0, we know the pattern is exhausted. The final step is to rebuild a string that retains the parts prior to and after the pattern.
For this to work, you need a unique column to identify each row. I assumed id.
with recursive cte as (
select
n_open n0,
n_open n1,
1 cnt,
mycol,
id
from (select t.*, locate('{{WoodhouseENELnames', mycol) n_open from mytable t) x
where n_open > 0
union all
select
n0,
n1 + 2 + case when n_open > 0 and n_open < n_close then n_open else n_close end,
cnt + case when n_open > 0 and n_open < n_close then 1 else -1 end,
mycol,
id
from (
select
c.*,
locate('{{', substring(mycol, n1 + 2)) n_open,
locate('}}', substring(mycol, n1 + 2)) n_close
from cte c
) x
where cnt > 0
)
select id, concat(substring(mycol, 1, min(n0) - 1), substring(mycol, max(n1) + 1)) mycol
from cte
group by id
Demo on DB Fiddle
Set-up - I added string parts before and after the pattern (including double brackets for extra fun):
create table mytable(id int, mycol varchar(2000));
insert into mytable values (
1,
'{{abcd{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}} efgh{{'
);
Results:
id | mycol
-: | :------------
1 | {{abcd efgh{{
MariaDB uses the PCRE-Regex engine.
If you can assure, that
the opening tag of your template ({{WoodhouseENELnames) starts on a new line
the closing tag of your template (}}) starts on a new line
no other closing tags (}}) in between starts on a new line, the follwoing Regex will do:
(?ms)^{{WoodhouseENELnames.+?^}}
Description:
(?ms) thells the regex that ^ matches any linebreak in the text and that . also matches newlines.
Then serach for your opening tag on a new line.
Search for the shortest possible string including any character (also newlines) up to
a closing tag (}}) on a new line.
If you want to capture the match, enclose the regex within (and )
EDIT:
As The PCRE2 supports recursive patterns, the follwing, more complexe regex will match, regardless of the beginnig-of-line-constraints above:
(?msx)
({{WoodhouseENELnames # group 1: Matching the whole template
( # group 2: Mathing the contents of the Template, including subpatters.
[^{}]* # Search zewro or more characters except { or }
{{ # The beginning of a subpattern
( # Containg if:
[^{}]++ # Search zewro or more characters except { or }
| (?2) # or the recursive pattern group 2
)* # Zero or more times
}} # The closing of the subpattern.
[^{}]* # Search zewro or more characters except { or }
)
}}
)
Cave-at: Doesn't cater for single { or } within the templates.
EDIT 2
I hate giving up before the job is done :-) This regex should work regardless of all contstraints above:
(?msx) # Note the additional 'x'-Option, allowing free spacing.
({{WoodhouseENELnames # Searcdh group 1 - Top level template:
( # Search group 2 - top level template contents:
( # Search-group 3 - Subtemplate contents:
[^{}]* # Zero or more characters except { or }
| {(?!{) # or a single { not follwed by a {
| }(?!}) # or a single } not follwed by a }
)* # Closing search group 3
{{ # Opening subtemplate tag
( # Search group 4:
(?3)* # Reusing serach group 3, zero or more times
| (?2) # or Recurse search group 2 (of which, this is a part)
)* # Group 4 zero or more times
}} # Closing subtemplate tag
(?3)* # Reusing search group 3, zero or more times
) # Closing Search group 2 - Template contents
}} # Top-level Template closing tag
) # Closing Search group 1
The last two solutions are based on the PCRE2 documentation

I want to split the string and keep the first word only

I've a dataframe which contains details of cars. Now I want keep only the brand name and remove the model name.
I've tried using the str.split function to separate the car name. However it gives me a list and then I'm not able to extract the first name.
splitted = df['CarName'].str.split(' ',1)
Expected result:
alfa-romero
Audi
VW
Acutal result:
[alfa-romero, giulia]
[alfa-romero, stelvio]
[alfa-romero, Quadrifoglio]
[audi, 100 ls]
[audi, 100ls]
you can do in two ways, one as WeNYoBen explained in his comment, or by using extract against a list of Brands
df['brand'] =df['cars'].str.split(' ',1).str[0]
or
pattern =['audi', 'alfa-romero']
df['brand_2'] =df['cars'].str.extract("(" + "|".join(pattern) +")", expand=False)
Then you can do
splitted = df['CarName'].str.split(' ',1).str[0]
This could be achieved using pandas.DataFrame.apply with str.split
df['res']= df['CarName'].apply(lambda x : str(x).split(' ')[0])

Check constraint for Emails in an Oracle Database

I've searched everywhere for a decent and logical CHECK constraint to validate that an email is in the right format. So far I've found really long and unnecessary expressions like:
create table t (
email varchar2(320) check (
regexp_like(email, '[[:alnum:]]+#[[:alnum:]]+\.[[:alnum:]]')
)
);
and
create table stk_t (
email varchar2(320) check (
email LIKE '%#%.%' AND email NOT LIKE '#%' AND email NOT LIKE '%#%#%'
)
);
Surely there is a simpler way?
I'm using Oracle 11g database and SQL Developer IDE.
This is what I have:
constraint Emails_Check check (Emails LIKE '%_#%_._%')
Can someone please let me know if this is the most efficient way of validating emails?
You can try this
email varchar2(255) check (
email LIKE '%#%.%' AND email NOT LIKE '#%' AND email NOT LIKE '%#%#%' )
CREATE TABLE MYTABLE(
EMAIL VARCHAR2(30) CHECK(REGEXP_LIKE (EMAIL,'^[A-Za-z]+[A-Za-z0-9.]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'))
)
Explanation of Regular Expression
^ #start of the line
[_A-Za-z0-9-]+ # must start with string in the bracket [ ], must contains one or more (+)
( # start of group #1
\\.[_A-Za-z0-9-]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #1, this group is optional (*)
# # must contains a "#" symbol
[A-Za-z0-9]+ # follow by string in the bracket [ ], must contains one or more (+)
( # start of group #2 - first level TLD checking
\\.[A-Za-z0-9]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #2, this group is optional (*)
( # start of group #3 - second level TLD checking
\\.[A-Za-z]{2,} # follow by a dot "." and string in the bracket [ ], with minimum length of 2
) # end of group #3
$ #end of the line
Stumbled upon this answer while hunting for a simple solution on the internet:
ALTER TABLE YourTableName
ADD CONSTRAINT YourConstraintName CHECK(YourColumnName LIKE '%___#___%.__%')
All points to #bhanu_nz here

Using SQLDF to select specific values from a column

SQLDF newbie here.
I have a data frame which has about 15,000 rows and 1 column.
The data looks like:
cars
autocar
carsinfo
whatisthat
donnadrive
car
telephone
...
I wanted to use the package sqldf to loop through the column and
pick all values which contain "car" anywhere in their value.
However, the following code generates an error.
> sqldf("SELECT Keyword FROM dat WHERE Keyword="car")
Error: unexpected symbol in "sqldf("SELECT Keyword FROM dat WHERE Keyword="car"
There is no unexpected symbol, so I'm not sure whats wrong.
so first, I want to know all the values which contain 'car'.
then I want to know only those values which contain just 'car' by itself.
Can anyone help.
EDIT:
allright, there was an unexpected symbol, but it only gives me just car and not every
row which contains 'car'.
> sqldf("SELECT Keyword FROM dat WHERE Keyword='car'")
Keyword
1 car
Using = will only return exact matches.
You should probably use the like operator combined with the wildcards % or _. The % wildcard will match multiple characters, while _ matches a single character.
Something like the following will find all instances of car, e.g. "cars", "motorcar", etc:
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
And the following will match "car" or "cars":
sqldf("SELECT Keyword FROM dat WHERE Keyword like 'car_'")
This has nothing to do with sqldf; your SQL statement is the problem. You need:
dat <- data.frame(Keyword=c("cars","autocar","carsinfo",
"whatisthat","donnadrive","car","telephone"))
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
# Keyword
# 1 cars
# 2 autocar
# 3 carsinfo
# 4 car
You can also use regular expressions to do this sort of filtering. grepl returns a logical vector (TRUE / FALSE) stating whether or not there was a match or not. You can get very sophisticated to match specific items, but a basic query will work in this case:
#Using #Joshua's dat data.frame
subset(dat, grepl("car", Keyword, ignore.case = TRUE))
Keyword
1 cars
2 autocar
3 carsinfo
6 car
Very similar to the solution provided by #Chase. Because we do not use subset we do not need a logical vector and can use both grep or grepl:
df <- data.frame(keyword = c("cars", "autocar", "carsinfo", "whatisthat", "donnadrive", "car", "telephone"))
df[grep("car", df$keyword), , drop = FALSE] # or
df[grepl("car", df$keyword), , drop = FALSE]
keyword
1 cars
2 autocar
3 carsinfo
6 car
I took the idea from Selecting rows where a column has a string like 'hsa..' (partial string match)