Format a number to NOT have commas (1,000,000 -> 1000000) in Google BigQuery - google-bigquery

In Bigquery: How do we format a number that will be part of the result set that should be not having commas: like 1,000,000 to 1000000 ?

I am assuming that your data type is string here.
You can use the REGEXP_REPLACE function to remove certain symbols from strings.
SELECT REGEXP_REPLACE("1,000,000", r',', '') AS Output
Returns:
+-----+---------+
| Row | Output |
+-----+---------+
| 1 | 1000000 |
+-----+---------+
If your data contains strings with and without commas, this function will return the ones without as they are so you don't need to worry about filtering the input.
Documentation for this function can be found here.

Related

How to extract string between quotes, with a delimiter in Snowflake

I've got a bunch of fields which are double quoted with delimiters but for the life of me, I'm unable to get any regex to pull out what I need.
In short - the delimiters can be in any order and I just need the value that's between the double quotes after each delimiter. Some sample data is below, can anyone help with what regex might extract each value? I've tried
'delimiter_1=\\W+\\w+'
but I only seem to get the first word after the delimiter (unfortunately - they do have spaces in the value)
some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"
The problem is returning a varying numbers of values from the regex function. For example, if you know that there will 4 delimiters, then you can use REGEXP_SUBSTR for each match, but if the text will have varying delimiters, this approach doesn't work.
I think the best solution is to write a function to parse the text:
create or replace function superparser( SRC varchar )
returns array
language javascript
as
$$
const regexp = /([^ =]*)="([^"]*)"/gm;
const array = [...SRC.matchAll(regexp)]
return array;
$$;
Then you can use LATERAL FLATTEN to process the returning values from the function:
select f.VALUE[1]::STRING key, f.VALUE[2]::STRING value
from values ('some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"') tmp(x),
lateral flatten( superparser(x) ) f;
+-------------+------------------+
| KEY | VALUE |
+-------------+------------------+
| delimiter_1 | some value |
| delimiter_2 | some other value |
| delimiter_4 | another value |
| delimiter_3 | the last value |
+-------------+------------------+

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

single vs double quotes in WHERE clause returning different results

It seemed that Athena was including CSV column headers in my query results. I recreated the tables with the DDL included below using TBLPROPERTIES ("skip.header.line.count"="1") to remove the headers.
I'm running the following queries to validate that the CREATE TABLE DDL worked. The only difference between the queries below is the use of single vs double quotes in the WHERE clause. The issue is that I'm getting different result when running them.
Query 1:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
The query above returns the actual data (see sample table below), rather than only rows where the file_name field is "file_name".
+-------+--------------------+
| Row # | file_name |
+-------+--------------------+
| 1 | |
| 2 | 1586786323.8194735 |
| 3 | |
| 4 | 1586858857.3117666 |
| 5 | 1586858857.3117666 |
| 6 | 1586858857.3117666 |
| ... | |
+-------+--------------------+
Query 2:
SELECT
file_name
FROM table
WHERE file_name = 'file_name'
The query above returns no results, as expected if the CSV column headers are not being included in the results.
I'm quite confused by the first query returning any results at all. I've scoured the AWS documentation at this point and doesn't seem I did anything wrong with the DDL and SQL should not care whether I use single vs. double quotes. What am I missing here?
DDL:
CREATE EXTERNAL TABLE `table` (
`file_name` string,
`ticker` string,
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION
's3://{bucket_name}/{folder}/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Single quotes are the SQL standard for delimiting strings.
Double quotes are used for escaping delimiters. So "file_name" refers to the column of that name. Some databases also accept double quotes for strings. That is just confusing. Don't do that.
In your original tags, for instance, Hive uses backticks to escape identifiers and double quotes for strings. Presto uses double quotes (which is the standard) to delimit identifiers.
Just to expand on Gordon's answer a little. Your first query:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
In this case, the double quotes are causing the query engine to treat "file_name" as a column identifier, not a value, so that query is functionally the same as:
SELECT
file_name
FROM table
WHERE file_name = file_name
Obviously (when written that way) the condition is always true, so the full table is returned.

Extract particular character using StandardSQL

I would like to extract particular character from strings using StandardSQL.
I would like to extract the character after limit=.
For instance, from below strings I would like to extract 10, 3 and null. For everything that has null I also would like to make all null = 1.
partner=&limit=10
partner=aex&limit=3&filters%5Bpartner%5D
partner=aex&limit=&filters%5Bpartner%5D
I only know how to use substring function but the problem here is the positions of limit= are not always the same.
You can use REGEXP_EXTRACT. For example:
SELECT REGEXP_EXTRACT('partner=aex&limit=3&filters%5Bpartner%5D', 'limit=(\\d+)');
+-------+
| $col1 |
+-------+
| 3 |
+-------+

Concat multiple rows with a delimiter in Hive

I need to concat string values row wise with '~' as delimiter.
I have the following data:
I need to concat 'Comment' column for each 'id' in the ascending order of 'row_id' with '~' as delimiter.
Expected output is as below:
GROUP_CONCAT is not an option since its not recognized in my Hive version.
I can use collect_set or collect_list, but I won't be able to insert delimiter in between.
Is there any workaround?
collect_list returns array, not string.
Array can be converted to delimited string using concat_ws.
This will work, with no specific order of comments.
select id
,concat_ws('~',collect_list(comment)) as comments
from mytable
group by id
;
+----+-------------+
| id | comments |
+----+-------------+
| 1 | ABC~PRQ~XYZ |
| 2 | LMN~OPQ |
+----+-------------+