BigQuery extract string between two "." - google-bigquery

I have three string types:
'en-ae.sssports.com','amazon.co.uk','farfetch.com'
I require the client name i.e. Sssports, Amazon, Farfetch from each.
Had tried using regexp '(?:.)[^.]*(?:.)' to extract string between the two dots but I require a dynamic code that extracts data from all three sting types.

You may try below query
SELECT INITCAP(RTRIM(NET.REG_DOMAIN(str), '.' || NET.PUBLIC_SUFFIX(str))) AS client_name
FROM UNNEST(['en-ae.sssports.com','amazon.co.uk','farfetch.com']) str
Query reesults
+-------------+
| client_name |
+-------------+
| Sssports |
| Amazon |
| Farfetch |
+-------------+
References
NET.REG_DOMAIN()
Takes a URL as a STRING and returns the registered or registerable domain (the public suffix plus one preceding label), as a STRING.
NET.PUBLIC_SUFFIX()
Takes a URL as a STRING and returns the public suffix (such as com, org, or net) as a STRING.

Related

single vs double quotes in WHERE clause returning different results

It seemed that Athena was including CSV column headers in my query results. I recreated the tables with the DDL included below using TBLPROPERTIES ("skip.header.line.count"="1") to remove the headers.
I'm running the following queries to validate that the CREATE TABLE DDL worked. The only difference between the queries below is the use of single vs double quotes in the WHERE clause. The issue is that I'm getting different result when running them.
Query 1:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
The query above returns the actual data (see sample table below), rather than only rows where the file_name field is "file_name".
+-------+--------------------+
| Row # | file_name |
+-------+--------------------+
| 1 | |
| 2 | 1586786323.8194735 |
| 3 | |
| 4 | 1586858857.3117666 |
| 5 | 1586858857.3117666 |
| 6 | 1586858857.3117666 |
| ... | |
+-------+--------------------+
Query 2:
SELECT
file_name
FROM table
WHERE file_name = 'file_name'
The query above returns no results, as expected if the CSV column headers are not being included in the results.
I'm quite confused by the first query returning any results at all. I've scoured the AWS documentation at this point and doesn't seem I did anything wrong with the DDL and SQL should not care whether I use single vs. double quotes. What am I missing here?
DDL:
CREATE EXTERNAL TABLE `table` (
`file_name` string,
`ticker` string,
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION
's3://{bucket_name}/{folder}/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Single quotes are the SQL standard for delimiting strings.
Double quotes are used for escaping delimiters. So "file_name" refers to the column of that name. Some databases also accept double quotes for strings. That is just confusing. Don't do that.
In your original tags, for instance, Hive uses backticks to escape identifiers and double quotes for strings. Presto uses double quotes (which is the standard) to delimit identifiers.
Just to expand on Gordon's answer a little. Your first query:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
In this case, the double quotes are causing the query engine to treat "file_name" as a column identifier, not a value, so that query is functionally the same as:
SELECT
file_name
FROM table
WHERE file_name = file_name
Obviously (when written that way) the condition is always true, so the full table is returned.

Only return fields that contain numbers or special characters EXCEPT . Error

In Redshift I want to return fields that contain numbers or special characters EXCEPT . (anything other and a-z and A-Z)
The following gets me anything that contains a number but I need to extend this to any special character except full stop (.)
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]'
I need something like:
SELECT DISTINCT name
FROM table
WHERE name ~ '[0-9]' OR name ~'[,#';:#~[]{}etcetc'
Sample Data:
name
john
joh1n1
j!ohn!
jo!h2n
joh.n
jo.&hn
j.3ohn
j.$9ohn
Expected Output:
name
joh1n1
j!ohn!
jo!h2n
jo.&hn
j.3ohn
j.$9ohn
You may use
WHERE name !~ '^[[:alpha:].]+$'
Here, all records that do not consist of only alphabetic or dot symbols will be returned. ^ matches the start of a string position, [[:alpha:].]+ matches one or more letters or dots and $ matches the end of string position.
If it is for PostgreSQL you may use
WHERE name SIMILAR TO '%[^[:alpha:].]%'
The SIMILAR TO operator accepts POSIX character classes and bracket expressions and wildcards, too, and requires a full string match. So, % allows any chars before any 1 char other than letter or dot ([^[:alpha:].]), and then there may also be any other chars till the end of the string.
You can do:
SELECT DISTINCT name FROM table WHERE name !~* '[a-z]'
This means: match on names that do not contain any alphanumeric character.
Operator !~* means:
Does not match regular expression, case insensitive
Edit based on the provided sample data and expected results.
If you want to match on names that contain at least one character other than an alphabetic character or a dot, then you can do:
select * from mytable where name ~* '[^a-z.]'
Demo on DB Fiddle:
with mytable(name) as (values
('john'),
('joh1n1'),
('j!ohn!'),
('jo!h2n'),
('joh.n'),
('jo.&hn'),
('j.3ohn'),
('j.$9ohn')
)
select * from mytable where name ~* '[^a-z.]'
| name |
| :------ |
| joh1n1 |
| j!ohn! |
| jo!h2n |
| jo.&hn |
| j.3ohn |
| j.$9ohn |

Format a number to NOT have commas (1,000,000 -> 1000000) in Google BigQuery

In Bigquery: How do we format a number that will be part of the result set that should be not having commas: like 1,000,000 to 1000000 ?
I am assuming that your data type is string here.
You can use the REGEXP_REPLACE function to remove certain symbols from strings.
SELECT REGEXP_REPLACE("1,000,000", r',', '') AS Output
Returns:
+-----+---------+
| Row | Output |
+-----+---------+
| 1 | 1000000 |
+-----+---------+
If your data contains strings with and without commas, this function will return the ones without as they are so you don't need to worry about filtering the input.
Documentation for this function can be found here.

Extract particular character using StandardSQL

I would like to extract particular character from strings using StandardSQL.
I would like to extract the character after limit=.
For instance, from below strings I would like to extract 10, 3 and null. For everything that has null I also would like to make all null = 1.
partner=&limit=10
partner=aex&limit=3&filters%5Bpartner%5D
partner=aex&limit=&filters%5Bpartner%5D
I only know how to use substring function but the problem here is the positions of limit= are not always the same.
You can use REGEXP_EXTRACT. For example:
SELECT REGEXP_EXTRACT('partner=aex&limit=3&filters%5Bpartner%5D', 'limit=(\\d+)');
+-------+
| $col1 |
+-------+
| 3 |
+-------+

Query to search substring in column

I have a table that has a substring value in the column and I want to write a query that checks if input string has the substring.
My table looks like:
| company | host |
| ------- | ---------- |
| ebay | ebay.com |
| google | google.com |
| yahoo | yahoo.com |
My input will be like www.ebay.com or https://www.ebay.com or www.qa.ebay.com or www.dev.ebay.com..
If I get any of the inputs I want to return the first record.
I tried looking at the CHARINDEX, INSTR but they are work in reverse. My scenario is I have substring to be searched in table and the actual string as input.
Any help is appreciated.
You can use like for this, but you also need string concatenation. In ANSI standard SQL, this looks like:
select t.*
from t
where #inputstring like concat('%.', t.host)
where #inputstring is the string you are inputting.
Note: You can also use the concatenation infix operation, which is typically || (standard) or +.
You can use the SQL wildcard like so:
SELECT * FROM table WHERE host LIKE '%ebay.com';
Go for this:
SELECT * FROM table WHERE host LIKE '%SearchString%'
It will pull all rows containing the SearchString.
You can achieve this using like operator.
Select * from yourtable
where ? like concat('%', company, '%');
parameter ? with your input.