Regular Expression Check for Capital Names in PostgreSQL - sql

I have a database, which holds many first names, occurring like the following pattern:
The name can consist of many first names (like double, or triple names), separated by either a '-' or a ' '.
Each of the names consists of either lowercase or UPPERCASE letters or a capital first letter and the rest lowercase.
I would like to write a query to count all names which have either just UPPERCASE letters, or do not have a capital letter after a break of two words.
Sample Table and Data
CREATE TABLE names( name VARCHAR, PRIMARY KEY(name) );
INSERT INTO names values('Veronika isabella');
INSERT INTO names values('Veronika Isabella');
INSERT INTO names values('Michael Karl Otto- Emil');
INSERT INTO names values('Michael karl-Otto-emil');
INSERT INTO names values('philipp');
INSERT INTO names values('Philipp');

SELECT count(*) AS misfits
FROM names
WHERE name !~ '[[:lower:]]' -- not a single lower case letter
OR name ~ '\m[[:lower:]]' -- lower case letter at beginning of a word
OR name ~ '[[:lower:]][[:upper:]]'; -- lower case letter after upper case
Details in the manual.
Or maybe initcap() fits your requirements (like a_horse commented).
SELECT count(*) AS misfits
FROM names
WHERE name <> initcap(name);
SQL Fiddle.

Related

Match specific pattern only up until character using RegEx

I am trying to create a RegEx to get the database, schema and table names from an SQL CREATE TABLE statement:
Example 1:
CREATE TABLE "finance"."invoices_1" (
"abc" NUMBER(15,0),
"def" VARCHAR2(200),
"ghi" DATE
);
Example 2:
CREATE TABLE "commerce"."finance"."invoices_1" (
"abc" NUMBER(15,0),
"def" VARCHAR2(200),
"ghi" DATE
);
The depth of the schema varies, so I'm trying to come up with a regular experession that match the names, whether they include the database name, schema name or only the table name:
\"[A-Za-z0-9_]*\"
Unfortunately, this also matches the column names. Is there a way to match this specific pattern only up until the first opening bracket?
Try this:
"\w+"(?:\."\w+"){0,2}(?=\s*\()
"\w+" -- any word text inside the double quotes (f.e. "commerce")
(?:\."\w+"){0,2} -- optionally up to 2 times extended table name part (f.e. ."finance"."invoices_1" or ."invoices_1")
(?=\s*\() -- positive lookahead to look for the opening bracket symbol (()
Demo: https://regex101.com/r/0F6egD/1

I am unable to drop a column from DB2 table

I am trying to drop a column from my DB2 table.
Table name = Instructor
Column name is Page
Command used is:
ALTER TABLE instructor
DROP COLUMN page;
I am getting this error
Column, attribute, or period "PAGE" is not defined in "GFQ70186.INSTRUCTOR".. SQLCODE=-205, SQLSTATE=42703, DRIVER=4.25.1301
Please help me to understand this error
If your column name is Page (i.e. with a capital P and lower case age) then you will need to use double quotes
ALTER TABLE INSTRUCTOR
DROP COLUMN "Page"
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0000720.html
Ordinary identifier:
An ordinary identifier is an uppercase letter followed by zero or more characters, each of which is an uppercase letter, a digit, or the underscore character. Note that lowercase letters can be used when specifying an ordinary identifier, but they are converted to uppercase when processed
Delimited identifier:
A delimited identifier is a sequence of one or more characters enclosed by double quotation marks. Leading blanks in the sequence are significant. A delimited identifier can be used when the sequence of characters does not qualify as an ordinary identifier. In this way an identifier can include lowercase letter

how to separate columns in hive

I have a file:
id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica
I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.
One possible solution is using RegexSerDe:
CREATE TABLE table my_table (
id string,
name string,
address string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$')
location 'put location here'
;
Replace location property with your table location and put the file(s) into that location.
First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.
Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'
Also you can easily test Regex for extracting columns even without creating table, like this:
select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
Result:
002|michael|57-D,costa rica
In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.
Answering question in the comment. You can have address with comma and one more column without comma like this:
select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102, mont vert|sydney
Checking comma is optional in Address column:
hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102 mont vert|sydney
Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562
[^,] means not a comma, last column can be everything except comma.
And of course add one more column to the DDL.

Comparing non identical fields in two different tables

I am trying to compare names in 2 different tables.
In Table1 the field is called Name1 and has values like Lynn Smith.
In Table2, the field is called Name2 and it has the value like Lynn Smith (Extra)
How can I compare the two name values ignoring the text in the brackets?
I want to write a query where I need some other fields where the main name is the same.
One method would use like:
select . . .
from t1 join
t2
on t2.name2 like t1.name1 + ' (%)';
However, this is probably not efficient. If you want performance, you can extract the name into a separate column in the second table and create an index on it:
alter table t2 add column name_cleaned as
(left(name2, charindex(' (', name2 + ' (') - 2));
create index idx_t2_name_cleaned on t2(name_cleaned);
Then you can phrase the query as:
select . . .
from t1 join
t2
on t2.name2_cleaned = t1.name1;
One way to do this is to direct compare the names after cleaning up on one side.
Unlike Gordon's answer, I'd do this with another table containing data to compare from table2.
SELECT Table2Id, Name2, NULL as cleanedName INTO NewTable FROM Table2
Now we update the cleanedName column to strip off extra information from Name2 column like below. You may also create an index on this table.
UPDATE cleanedName
SET cleanedName = LEFT (name2,CHARINDEX('(',Name2))
Now drop and re-create index on CleanedName column and then compare with Table1.Name1 column
If all the values in Table2 Column2 have space between the end of the second name and the first (open) bracket then you could use this:
SELECT SUBSTRING('Lynn Smith (Extra)',1,PATINDEX('%(%','Lynn Smith (Extra)')-2)
If you were to replace 'Lynn Smith (Extra)' with the column name:
SELECT SUBSTRING('name2',1,PATINDEX('%(%','name2')-2)
then it would show a list of the values in name2 without the text in the brackets, in other words, in the same format (as such) as the names in name1 on table1.
SUBSTRING and PATINDEX are String functions.
SUBSTRING asks for three 'arguments': (1) expression (2) start and (3) length.
(1) As you can see above the first argument can be (amongst other things)
either a constant - 'Lynn Smith (Extra)' or a column - 'name2'
(2) the start of the result you want so, in this example, the first (or left)
character in the string in the column or constant is signified by the number 1.
(3) how many characters do you want to see in the result? In this example I have used PATINDEX to create a number (see below).
PATINDEX asks for two arguments: (1) %pattern% and (2) expression
(1) is the character or group of characters (shape or 'pattern') you are looking
to locate, the reason for the wildcard characters %% either side of the
pattern is because there may be characters either side of the pattern
(2) is (amongst other things) the constant or column that contains the pattern
from argument 1.
Whilst SUBSTRING returns character data (part of the string) PATINDEX produces a number, that number is the first character in the pattern (given as a number, counting from the left of the expression).

How to select rows containing ONLY cyrillic characters in UPPERCASE from the table using LIKE statement in MS SQL

I want to select rows where column [Name] contains ONLY Cyrillic characters in UPPERCASE, and comma and hyphen from the table using LIKE :
SELECT *
FROM Clients
WHERE NAME LIKE '%[А-Я][,-]%' COLLATE Cyrillic_General_CS_AS
Or using explicit pattern:
SELECT *
FROM Clients
WHERE NAME LIKE '%[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ][,-]%' COLLATE Cyrillic_General_CS_AS
But these selects rows in which at least one character exists in pattern (but allows any other characters not exists in pattern).
Maybe using ^ (NOT predicate) excluding any other characters like this:
SELECT *
FROM Clients
WHERE NAME LIKE '%[^A-Z][./=+]%' COLLATE Cyrillic_General_CS_AS
But this requires enumeration a large number of unnecessary characters.
How best to make a selection?
Use a double negative. Search for rows where the column doesn't contain at least one character not in the set you're interested in:
SELECT *
FROM Clients
WHERE NAME NOT LIKE '%[^-АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ,]%' COLLATE Cyrillic_General_CS_AS
(I'm not quite sure what you were attempting to do by placing the hyphen and comma in a separate grouping, but I've moved them into the same group for now, since that seems to make some sense)