Match specific pattern only up until character using RegEx - sql

I am trying to create a RegEx to get the database, schema and table names from an SQL CREATE TABLE statement:
Example 1:
CREATE TABLE "finance"."invoices_1" (
"abc" NUMBER(15,0),
"def" VARCHAR2(200),
"ghi" DATE
);
Example 2:
CREATE TABLE "commerce"."finance"."invoices_1" (
"abc" NUMBER(15,0),
"def" VARCHAR2(200),
"ghi" DATE
);
The depth of the schema varies, so I'm trying to come up with a regular experession that match the names, whether they include the database name, schema name or only the table name:
\"[A-Za-z0-9_]*\"
Unfortunately, this also matches the column names. Is there a way to match this specific pattern only up until the first opening bracket?

Try this:
"\w+"(?:\."\w+"){0,2}(?=\s*\()
"\w+" -- any word text inside the double quotes (f.e. "commerce")
(?:\."\w+"){0,2} -- optionally up to 2 times extended table name part (f.e. ."finance"."invoices_1" or ."invoices_1")
(?=\s*\() -- positive lookahead to look for the opening bracket symbol (()
Demo: https://regex101.com/r/0F6egD/1

Related

SELECT column name starting with numeral

This one is out of morbid curiosity. I have a very badly named table here:
CREATE TABLE badtable (
id INT PRIMARY KEY,
"customer name" VARCHAR(63),
"order" VARCHAR(12),
"1st" date,
"last-date" date
);
I am trying to show when you might desperately need delimited column names. However, the following is not an error:
SELECT
"customer name",
"order",
1st, -- no delimiter
"last-date"
FROM badtable;
Instead it happily gives me a column called st.
This works on both PostgreSQL and Microsoft SQL Server, so it’s not limited to a quirk of one of them.
How is the 1st column name being interpreted?
In some situations whitespace is not required as long as the DBMS is able to read the expression unambiguously.
select 1st
selects a 1. What follows is the alias name. Hence the same as
select 1 st
or
select 1 as st

how to separate columns in hive

I have a file:
id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica
I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.
One possible solution is using RegexSerDe:
CREATE TABLE table my_table (
id string,
name string,
address string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$')
location 'put location here'
;
Replace location property with your table location and put the file(s) into that location.
First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.
Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'
Also you can easily test Regex for extracting columns even without creating table, like this:
select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
Result:
002|michael|57-D,costa rica
In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.
Answering question in the comment. You can have address with comma and one more column without comma like this:
select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102, mont vert|sydney
Checking comma is optional in Address column:
hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102 mont vert|sydney
Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562
[^,] means not a comma, last column can be everything except comma.
And of course add one more column to the DDL.

Regular Expression Check for Capital Names in PostgreSQL

I have a database, which holds many first names, occurring like the following pattern:
The name can consist of many first names (like double, or triple names), separated by either a '-' or a ' '.
Each of the names consists of either lowercase or UPPERCASE letters or a capital first letter and the rest lowercase.
I would like to write a query to count all names which have either just UPPERCASE letters, or do not have a capital letter after a break of two words.
Sample Table and Data
CREATE TABLE names( name VARCHAR, PRIMARY KEY(name) );
INSERT INTO names values('Veronika isabella');
INSERT INTO names values('Veronika Isabella');
INSERT INTO names values('Michael Karl Otto- Emil');
INSERT INTO names values('Michael karl-Otto-emil');
INSERT INTO names values('philipp');
INSERT INTO names values('Philipp');
SELECT count(*) AS misfits
FROM names
WHERE name !~ '[[:lower:]]' -- not a single lower case letter
OR name ~ '\m[[:lower:]]' -- lower case letter at beginning of a word
OR name ~ '[[:lower:]][[:upper:]]'; -- lower case letter after upper case
Details in the manual.
Or maybe initcap() fits your requirements (like a_horse commented).
SELECT count(*) AS misfits
FROM names
WHERE name <> initcap(name);
SQL Fiddle.

Is there any term like 'DOT(.) notation' used in SQL joins?

Is there any term like 'DOT(.) notation' used in SQL joins?
if practised, pls explain how to use it.
Thanks in advance.
Yes here is how you do it
When you do your SELECT
SELECT firstname, lastname from dbo.names n -- The n becomes an alias
JOIN address a --- another alias
on a.userid = n.userid
Collated from multiple sources of official documentation.
Dot notation (sometimes called the membership operator) allows you to qualify an SQL identifier with another SQL identifier of which it is a component. You separate the identifiers with the period ( . ) symbol. For example, you can qualify a column name with any of the following SQL identifiers:
Table name: table_name.column_name
View name: view_name.column_name
Synonym name: syn_name.column_name
These forms of dot notation are called column projections.
You can also use dot notation to directly access the fields of a named or unnamed ROW column, as in the following example:
row-column name.field name
This use of dot notation is called a field projection. For example, suppose you have a column called rect with the following definition:
CREATE TABLE rectangles
(
area float,
rect ROW(x int, y int, length float, width float)
)
The following SELECT statement uses dot notation to access field length of the rect column:
SELECT rect.length FROM rectangles WHERE area = 64
Selecting Nested Fields
When the ROW type that defines a column itself contains other ROW types, the column contains nested fields. Use dot notation to access these nested fields within a column.
For example, assume that the address column of the employee table contains the fields: street, city, state, and zip. In addition, the zip field contains the nested fields: z_code and z_suffix. A query on the zip field returns values for the z_code and z_suffix fields. You can specify, however, that a query returns only specific nested fields. The following example shows how to use dot notation to construct a SELECT statement that returns rows for the z_code field of the address column only:
SELECT address.zip.z_code
FROM employee
Rules of Precedence
The database server uses the following precedence rules to interpret dot notation:
schema name_a . table name_b . column name_c . field name_d
table name_a . column name_b . field name_c . field name_d
column name_a . field name_b . field name_c . field name_d
When the meaning of an identifier is ambiguous, the database server uses precedence rules to determine which database object the identifier specifies. Consider the following two tables:
CREATE TABLE b (c ROW(d INTEGER, e CHAR(2));
CREATE TABLE c (d INTEGER);
In the following SELECT statement, the expression c.d references column d of table c (rather than field d of column c in table b) because a table identifier has a higher precedence than a column identifier:
SELECT *
FROM b,c
WHERE c.d = 10
For more information about precedence rules and how to use dot notation with ROW columns, see the IBM Informix: Guide to SQL Tutorial.
Using Dot Notation with Row-Type Expressions
Besides specifying a column of a ROW data type, you can also use dot notation with any expression that evaluates to a ROW type. In an INSERT statement, for example, you can use dot notation in a subquery that returns a single row of values. Assume that you created a ROW type named row_t:
CREATE ROW TYPE row_t (part_id INT, amt INT)
Also assume that you created a typed table named tab1 that is based on the row_t ROW type:
CREATE TABLE tab1 OF TYPE row_t
Assume also that you inserted the following values into table tab1:
INSERT INTO tab1 VALUES (ROW(1,7));
INSERT INTO tab1 VALUES (ROW(2,10));
Finally, assume that you created another table named tab2:
CREATE TABLE tab2 (colx INT)
Now you can use dot notation to insert the value from only the part_id column of table tab1 into the tab2 table:
INSERT INTO tab2
VALUES ((SELECT t FROM tab1 t
WHERE part_id = 1).part_id)
The asterisk form of dot notation is not necessary when you want to select all fields of a ROW-type column because you can specify the column name alone to select all of its fields. The asterisk form of dot notation can be quite helpful, however, when you use a subquery, as in the preceding example, or when you call a user-defined function to return ROW-type values.
Suppose that a user-defined function named new_row returns ROW-type values, and you want to call this function to insert the ROW-type values into a table. Asterisk notation makes it easy to specify that all the ROW-type values that the new_row( ) function returns are to be inserted into the table:
INSERT INTO mytab2 SELECT new_row (mycol).* FROM mytab1
References to the fields of a ROW-type column or a ROW-type expression are not allowed in fragment expressions. A fragment expression is an expression that defines a table fragment or an index fragment in SQL statements like CREATE TABLE, CREATE INDEX, and ALTER FRAGMENT.
Additional Examples of How to Specify Names With the Dot Notation
Dot notation is used for identifying record fields, object attributes, and items inside packages or other schemas. When you combine these items, you might need to use expressions with multiple levels of dots, where it is not always clear what each dot refers to. Here are some of the combinations:
Field or Attribute of a Function Return Value
func_name().field_name
func_name().attribute_name
Schema Object Owned by Another Schema
schema_name.table_name
schema_name.procedure_name()
schema_name.type_name.member_name()
Packaged Object Owned by Another User
schema_name.package_name.procedure_name()
schema_name.package_name.record_name.field_name
Record Containing an Object Type
record_name.field_name.attribute_name
record_name.field_name.member_name()
Differences in Name Resolution Between PL/SQL and SQL
The name resolution rules for PL/SQL and SQL are similar. You can avoid the few differences if you follow the capture avoidance rules. For compatibility, the SQL rules are more permissive than the PL/SQL rules. SQL rules, which are mostly context sensitive, recognize as legal more situations and DML statements than the PL/SQL rules.
PL/SQL uses the same name-resolution rules as SQL when the PL/SQL compiler processes a SQL statement, such as a DML statement. For example, for a name such as HR.JOBS, SQL matches objects in the HR schema first, then packages, types, tables, and views in the current schema.
PL/SQL uses a different order to resolve names in PL/SQL statements such as assignments and procedure calls. In the case of a name HR.JOBS, PL/SQL searches first for packages, types, tables, and views named HR in the current schema, then for objects in the HR schema.

What are valid table names in SQLite?

What are the combination of characters for a table name in SQLite to be valid? Are all combinations of alphanumerics (A-Z, a-z and 0-9) constitute a valid name?
Ex. CREATE TABLE 123abc(...);
What about a combination of alphanumerics with dashes "-" and periods ".", is that valid as well?
Ex. CREATE TABLE 123abc.txt(...);
Ex. CREATE TABLE 123abc-ABC.txt(...);
Thank you.
I haven't found a reference for it, but table names that are valid without using brackets around them should be any alphanumeric combination that doesn't start with a digit:
abc123 - valid
123abc - not valid
abc_123 - valid
_123abc - valid
abc-abc - not valid (looks like an expression)
abc.abc - not valid (looks like a database.table notation)
With brackets you should be able to use pretty much anything as a table name:
[This should-be a_valid.table+name!?]
All of these are allowed, but you may have to quote them in "".
sqlite> CREATE TABLE "123abc"(col);
sqlite> CREATE TABLE "123abc.txt"(col);
sqlite> CREATE TABLE "123abc-ABC.txt"(col);
sqlite> select tbl_name from sqlite_master;
123abc
123abc.txt
123abc-ABC.txt
In general, though, you should stick to the alphabet.
Per Clemens on the sqlite-users mailing list:
Everything is allowed, except names beginning with "sqlite_".
CREATE TABLE "TABLE"("#!#""'☺\", "");
You can use keywords ("TABLE"), special characters (""#!#""'☺\"), and even the empty string ("").
From SQLite documentation on CREATE TABLE, the only names forbidden are those that begin with sqlite_ :
Table names that begin with "sqlite_" are reserved for internal use. It is an error to attempt to create a table with a name that starts with "sqlite_".
If you use periods in the name you will have issues with your SQL Queries. So I would say avoid those.