How to count number commas present in a CSV file - sql

i have one csv file (ABC.txt) which has data as below :
1234,"djjdjd",45566,84774,45666,"djdjd"
i want to count number of commas present in a row of this CSV file.
how can i get this .

REGEXP_COUNT() can do this easily:
with tbl(data_string) as
(
select '1234,"djjdjd",45566,84774,45666,"djdjd"' from dual
)
select regexp_count(data_string, ',') from tbl;

For a pure (oracle) sql solution, try
SELECT NVL(LENGTH(REGEXP_REPLACE(csv_row, '[^,]', '')), 0) FROM csv_table
assuming that you have the data from your csv file stored in the database.
The query replaces all charcters but commas in the original string, so determining the length becomes equivalent to counting.
Note that you might want to treat commas inside csv fields differently.
Alternative ( see Alex K.'s comment )
SELECT LENGTH(csv_row) - NVL(LENGTH(REPLACE(csv_row, ',', '')), 0) FROM csv_table

Related

Impala / Hive - Extract text from comma delimited string where each separation doesnt match a pattern

is there a way in hive or impala to extract a string from a delimited string but only where the string i want doesnt match one or multiple patterns?
For instance, i have a field with IPs (the number varies depending on network adapters):
169.254.182.175,192.168.0.1,10.199.44.111
I would like to extract the IP that doesnt start with 169.254. (there could be many of these) and doesnt equal 192.168.0.1
The IPs can be in any order as well.
I tried doing substr with nested cases but due the unknown number of ips in the string it didnt work out.
Could this be accomplished with regex_extract or something similar?
Thanks,
You may use regexp_replace with capturing group for patterns that you do not want to keep and specify only groups of interest in the replacement string.
See example below in Impala (impalad version 3.4.0):
select
addr_list,
/*Concat is used just for visualization*/
rtrim(ltrim(regexp_replace(addr_list,concat(
/*Group of 169.254.*.* that should be excluded*/
'(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
/*Another group for 192.168.0.1*/
'(192\.168\.0\.1)', '|',
/*And the group that we need to keep*/
'(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
/*So keep the third group in the output.
Other groups will be replaced with empty string*/
), '\\3'), ','), ',') as ip_whitelist
from(values
('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
('10.58.3.142,169.254.2.12'),
('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
addr_list
ip_whitelist
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2
10.199.44.111
10.58.3.142,169.254.2.12
10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3
192.100.0.2,154.16.171.3
regexp_extract works differently for unknown reason, because the same regex with 3 as return group doesn't return anything at all for case 1 and 3.
select
t.addr_list,
rtrim(ltrim(regexp_replace(addr_list, r.regex, '\\3'), ','), ',') as ip_whitelist,
regexp_extract(addr_list, r.regex, 3) as ip_wl_extract
from(values
('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
('10.58.3.142,169.254.2.12'),
('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
cross join (
select concat(
'(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
'(192\.168\.0\.1)', '|',
'(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
) as regex
) as r
addr_list
ip_whitelist
ip_wl_extract
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2
10.199.44.111
10.58.3.142,169.254.2.12
10.58.3.142
10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3
192.100.0.2,154.16.171.3

Regex: how to get the text between a few colons?

So, i have a lot of strings like the ones below in my database:
product1:1stparty:single_aduls:android:
product2:3rdparty:married_adults:ios:
product3:3rdparty:other_adults:android:
I need a regex to get only the text after the product name and before the device category. So, in the first line I'd get 1stparty:single_aduls, in the second 3rdparty:married_adults and in the third 3rdparty:other_adults. I'm stuck and can't find a way to solve that. Could anyone help me please?
As a regular expression, you can use:
select regexp_extract('product1:1stparty:single_aduls:android:', '^[^:]*:(.*):[^:]*:$')
This returns every after the first colon and before the penultimate colon.
We can try using REGEXP_REPLACE here:
SELECT REGEXP_REPLACE(val, r"^.*?:|:[^:]+:$", "") AS output
FROM yourTable;
This approach removes either the leading ...: or trailing :...: from the column, leaving behind the content you want. Here is a demo showing that the regex replacement is working:
Demo
You can also use standard split function and access result array element by index, which is quite clear to read and understand.
with a as (
select split('product1:1stparty:single_aduls:android:', ':') as splitted
)
select splitted[ordinal(2)] || ':' || splitted[ordinal (3)] as subs
from a
Consider below example
with your_table as (
select 'product1:1stparty:single_aduls:android:' txt union all
select 'product2:3rdparty:married_adults:ios:' union all
select 'product3:3rdparty:other_adults:android:'
)
select *,
(
select string_agg(part, ':' order by offset)
from unnest(split(txt, ':')) part with offset
where offset in (1, 2)
) result
from your_table
with output

Split and rejoin part of a string in BigQuery

I'm querying the github sample_files dataset in bigquery and I want to get the the path excluding the filename.
So if I have /path/to/file.txt
I want it to return /path/to
In python I could do something like
"/".join(str.split(a, "/")[0:-1])
but I'm not sure how to do that in bigquery/sql
Any ideas? THanks!
One method is regexp_replace():
regexp_replace('/path/to/file.txt', '/[^/]+$', '')
I would use REGEXP_EXTRACT as in below example
REGEXP_EXTRACT(full_path, r'(.+)/[^/]*$')
Split and rejoin part of a string in BigQuery
If for some reason you need or more comfortable with mimicking same approach (Split and rejoin) with SPLIT as in your question - you can use below approach (provided along with sample data for testing , playing with)
#standardSQL
WITH `project.dataset.table` AS (
SELECT '/path/to/file.txt' full_path UNION ALL
SELECT '/path/to/'
)
SELECT full_path,
(
SELECT STRING_AGG(part, '/')
FROM UNNEST(SPLIT(full_path, '/')) part WITH OFFSET
WHERE OFFSET < ARRAY_LENGTH(SPLIT(full_path, '/')) - 1
) path
FROM `project.dataset.table`
with output
Row full_path path
1 /path/to/file.txt /path/to
2 /path/to/ /path/to

create new columns from xml value in hive

I have a column desc_txt in my table and its contents are quite similar to that of xml like shown below-
desc_txt
-----------
<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>
Requirement is to have a new table/view created from this table having additional columns like Criticality, Country, City along with the column values like High, India, Indore, respectively.
How can this be achieved in Hive/Impala?
This can be done in two steps. I assumed you have only four columns to pull.
Load the data as is in a table. Put everything in a column.
Then use this below SQL to split the data multiple columns. I assumed 4 columns, you can increase as per your requirement.
with t as (
SELECT rtrim(ltrim(
regexp_replace( replace( trim(
regexp_replace(
regexp_replace("<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>","</?[^>]*>",",")
,',,',',') ), ' ,', ',' ), '(,){2,}', ','),','),',')
str)
select split_part(str, ',', 1) as first_col,
split_part(str, ',', 2) as second_col,
split_part(str, ',', 3) as third_col,
split_part(str, ',', 4) as fourth_col
from t
The query is tricky - first it replaces all tags with comma in them, then it replaces multiple commas with single comma, then it removes comma from start and end of the string. split function then splits whole string based on comma and create individual columns.
HTH...

Split string into words using Postgres

I am looking for some help in separating scientific names in my data. I want to take only the genus names and group them, but they are both connected in the same column. I saw the SQL Sever had a CHARINDEX command, but PostgreSQL does not. Does there need to be a function created for this? If so, how would it look?
I want to change 'Mallotus philippensis' to just 'Mallotus' or to just 'philippensis'
I am currently using Postgres 11, 12.
Use SPLIT_PART:
WITH yourTable AS (
SELECT 'Mallotus philippensis'::text AS genus
)
SELECT
SPLIT_PART(genus, ' ', 1) AS genus,
SPLIT_PART(genus, ' ', 2) AS species
FROM yourTable;
Demo
Probably string_to_array will be slightly more efficient than split_part here because string splitting will be done only once for each row.
SELECT
val_arr[1] AS genus,
val_arr[2] AS species
FROM (
SELECT string_to_array(val, ' ') as val_arr
FROM (
VALUES
('aaa bbb'),
('cc dddd'),
('e fffff')
) t (val)
) tt;