How to Parse simple data in BigQuery - google-bigquery

I need to do a simple parse of some data coming from a field. Example:
1/2
1/3
10/20
12/31
I simply need to Split or Parse this on "/". Is there a simple function that will allow me to do this?

Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, '1/2' list UNION ALL
SELECT 2, '1/3' UNION ALL
SELECT 3, '10/20' UNION ALL
SELECT 4, '15/' UNION ALL
SELECT 5, '12/31'
)
SELECT id,
SPLIT(list, '/')[SAFE_OFFSET(0)] AS first_element,
SPLIT(list, '/')[SAFE_OFFSET(1)] AS second_element
FROM `project.dataset.table`
-- ORDER BY id
with result as below
Row id first_element second_element
1 1 1 2
2 2 1 3
3 3 10 20
4 4 15
5 5 12 31

Check the following SQL functions:
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#split
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#regexp_extract_all
For example with SPLIT you can do
SELECT parts[SAFE_OFFSET(0)], parts[SAFE_OFFSET(1)]
FROM (SELECT SPLIT(field) parts FROM UNNEST(["1/2", "10/20"]) field)

Related

Converting monthly to daily data

I have monthly data that I would like to transform to daily data. The data looks like this. The extraction_dt is in date format.
isin
extraction_date
yield
001
2013-01-31
100
001
2013-02-28
110
001
2013-03-31
105
...
...
...
002
2013-01-31
200
...
...
...
And I would like to have something like this
isin
extraction_dt
yield
001
2013-01-01
100
001
2013-01-02
100
001
2013-01-03
100
..
.....
...
001
2013-02-01
110
...
...
...
I tried the following code but it does not work. I get the error message AnalysisException: Could not resolve table reference: 'cte'. How would you convert monthly to daily data?
with cte as
(select isin, extraction_dt, yield
from datashop
union all
select isin, extraction_dt, dateadd(d, 1, extraction_dt) AS date_dt, yield
from cte
where datediff(m,date_dt,dateadd(d, 1, date_dt))=0
)
select isin, date_dt,
1.0*isin / count(*) over (partition by isin, date_dt) AS daily_yield
from cte
order by 1,2
I can suggest easy solution.
generate a date series
match it with your data so it gets repeated.
So, here is the SQL you can use for Impala.
select isin, extraction_dt, a.dt AS date_dt, yield
from
datashop d,
(
select now() - INTERVAL (a.a + (10 * b.a) + (100 * c.a) + (1000 * d.a) ) DAY as dt
from (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as a
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as b
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as c
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as d
) a
WHERE
from_timestamp(a.dt,'yyyy/MM') =from_timestamp(d.extraction_dt,'yyyy/MM')
order by 1,2,3
the alias a is going to generate a series of dates.
WHERE - this clause will restrict to the month of extraction_dt. and you will get all possible values for a month.
ORDER BY - will show a nice output.
Your WITH clause has a recursive (self-referencing) query. In most SQL dialects, this requires using WITH RECURSIVE, not plain WITH. According to the Impala SQL reference, Impala does not support recursive common table expressions:
The Impala WITH clause does not support recursive queries in the
WITH, which is supported in some other database systems.
In other words, you cannot do this in Impala.

how to select set of records is ID is present in one of them

Here is the table where ORGID/USERID makes unique combination:
ORGID USERID
1 1
1 2
1 3
1 4
2 1
2 5
2 6
2 7
3 9
3 10
3 11
I need to select all records (organizations and users) wherever USERID 1 is present. So USERID=1 is present in ORGID 1 and 2 so then select all users for these organizations including user 1 itself, i.e.
ORGID USERID
1 1
1 2
1 3
1 4
2 1
2 5
2 6
2 7
Is it possible to do it with one SQL query rather than SELECT *.. WHERE USERID IN (SELECT...
You could use exists:
select *
from mytable t
where exists (select 1 from mytable t1 where t1.orgid = t.orgid and t1.userid = 1)
Another option is window functions. In Postgres:
select *
from (
select t.*,
bool_or(userid = 1) over(partition by orgid) has_user_1
from mytable t
) t
where has_user_1
Or a more generic approach, that uses portable expressions:
select *
from (
select t.*,
max(case when userid = 1 then 1 else 0 end) over(partition by orgid) has_user_1
from mytable t
) t
where has_user_1 = 1
Yes, you can do it with a single select statement - no in or exists conditions, no analytic or aggregate functions in a subquery, etc. Why you want to do it that way is not clear; in any case, it is possible that the solution below is also more efficient than the alternatives. You will have to test on your real-life data to see if that is true.
The solution below has two potential disadvantages: it only works in Oracle (it uses a proprietary extension of SQL, the match_recognize clause); and it only works in Oracle 12.1 or higher.
with
my_table(orgid, userid) as (
select 1, 1 from dual union all
select 1, 2 from dual union all
select 1, 3 from dual union all
select 1, 4 from dual union all
select 2, 1 from dual union all
select 2, 5 from dual union all
select 2, 6 from dual union all
select 2, 7 from dual union all
select 3, 9 from dual union all
select 3, 10 from dual union all
select 3, 11 from dual
)
-- End of SIMULATED data (for testing), not part of the solution.
-- In real life you don't need the WITH clause; reference your actual table.
select *
from my_table
match_recognize(
partition by orgid
all rows per match
pattern (x* a x*)
define a as userid = 1
);
Output:
ORGID USERID
---------- ----------
1 1
1 2
1 3
1 4
2 1
2 5
2 7
2 6
You can use exists:
select ou.*
from orguser ou
where exists (select 1
from orguser ou ou2
where ou2.orgid = ou.orgid and ou2.userid = 1
);
Apart from Exists and windows function, you can use IN as follows:
select *
from your_table t
where t.orgid in (select t1.orgid from your_table t1 where t1.userid = 1)

How to get all sums values with out each element using BigQuery?

I have a table in BigQuery. I want to count all sums of values in column removing each element alternately by id. As output I want to see removed id and sum of other values.
WITH t as (SELECT 1 AS id, "LY" as code, 34 AS value
UNION ALL
SELECT 2, "LY", 45
UNION ALL
SELECT 3, "LY", 23
UNION ALL
SELECT 4, "LY", 5
UNION ALL
SELECT 5, "LY", 54
UNION ALL
SELECT 6, "LY", 78)
SELECT lv id, SUM(lag) sum_wo_id
FROM
(SELECT *, FIRST_VALUE(id) OVER (ORDER BY id DESC) lv, LAG(value) OVER (Order by id) lag from t)
GROUP BY lv
In example above I can see sum of values with out id = 6. How can I modify this query to get sums without another ids like 12346, 12356, 12456, 13456, 23456 and see which one removed?
Below is for BigQuery Standard SQL
Assuming ids are distinct - you can simply use below
#standardSQL
SELECT id AS removed_id,
SUM(value) OVER() - value AS sum_wo_id
FROM t
if applied to sample data from your question - output is
Row removed_id sum_wo_id
1 1 205
2 2 194
3 3 216
4 4 234
5 5 185
6 6 161
In case if id is not unique - you can first group by id as in below example
#standardSQL
SELECT id AS removed_id,
SUM(value) OVER() - value AS sum_wo_id
FROM (
SELECT id, SUM(value) AS value
FROM t
GROUP BY id
)

How to bucketize a column using another table in Bigquery SQL?

I have a column grams in the table info which can be any positive integer.
Also, I have table map which has two columns price and grams, in which grams can take some discreet values (~lets say 50) and are in ascending order.
I want to add a column in table info named cost by fetching price from table map such that info.grams <= map.grams(smallest). In other words, I want to bucketize my info.grams based on discreet values of map.grams and fetch values of price.
What I know?
I can use CASE WHEN to bucketize info.grams like below and then join two tables and fetch price. But since the discreet values are not limited I want to find a clean way of doing it without making my query a mess.
CASE WHEN grams<=1 THEN 1
WHEN grams<=5 THEN 5
WHEN grams<=10 THEN 10
WHEN grams<=20 THEN 20
WHEN grams<=30 THEN 30
...
Below is for BigQuery Standard SQL
You can use RANGE_BUCKET function for this
#standardSQL
SELECT i.*,
price_map[SAFE_OFFSET(RANGE_BUCKET(grams, grams_map))] price
FROM `project.dataset.info` i,
(
SELECT AS STRUCT
ARRAY_AGG(grams + 1 ORDER BY grams) AS grams_map,
ARRAY_AGG(price ORDER BY grams) AS price_map
FROM `project.dataset.map`
)
You can test play with above using sample data as in below example
#standardSQL
WITH `project.dataset.info` AS (
SELECT 1 AS grams UNION ALL
SELECT 3 UNION ALL
SELECT 5 UNION ALL
SELECT 7 UNION ALL
SELECT 10 UNION ALL
SELECT 13 UNION ALL
SELECT 15
), `project.dataset.map` AS (
SELECT 5 AS grams, 0.99 price UNION ALL
SELECT 10, 1.99 UNION ALL
SELECT 15, 2.99
)
SELECT i.*,
price_map[SAFE_OFFSET(RANGE_BUCKET(grams, grams_map))] price
FROM `project.dataset.info` i,
(
SELECT AS STRUCT
ARRAY_AGG(grams + 1 ORDER BY grams) AS grams_map,
ARRAY_AGG(price ORDER BY grams) AS price_map
FROM `project.dataset.map`
)
with result
Row grams price
1 1 0.99
2 3 0.99
3 5 0.99
4 7 1.99
5 10 1.99
6 13 2.99
7 15 2.99
Oh, it would be nice to use standard SQL for this, with lead() and join:
select i.*, m.*
from info i left join
(select m.*, lead(grams) over (order by trams) as next_grams
from map m
) m
on i.grams >= m.grams and
(i.grams < next_grams or next_grams is null);
However, one limitation of BigQuery is that it does not support non-equi outer joins. So, you can convert the map table to an array and use unnest() to do what you want:
with info as (
select 1 as grams union all select 5 union all select 10 union all select 15
),
map as (
select 5 as grams, 'a' as bucket union all
select 10 as grams, 'b' as bucket union all
select 15 as grams, 'c' as bucket
)
select i.*,
(select map
from unnest(m.map) map
where map.grams >= i.grams
order by map.grams
limit 1
) m
from info i cross join
(select array_agg(map order by grams) as map
from map
) m;
In addition to Gordon's Mikhail's answers. I would like to suggest a third alternative, using FIRST_VALUE(), which is a built-in method in BigQuery, and the knowledge of window.
Starting from the principle that if we use LEFT JOIN between the info and map tables using grams as the primary key, respectively, we would have null values for each gram whose is not specified in the map table. For this reason, we will use this table (with the null values) to price all the grams with the next available price. In order to achieve that, we will use FIRST_VALUE(). According to the documentation:
Returns the value of the value_expression for the first row in the
current window frame.
Thus, we will select the first non null value between the current row and the next non-value row for each row where price is null. The syntax will be as follows:
#sample data info
WITH info AS (
SELECT 1 AS grams UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 7 UNION ALL
SELECT 8 UNION ALL
SELECT 9 UNION ALL
SELECT 10 UNION ALL
SELECT 11 UNION ALL
SELECT 13 UNION ALL
SELECT 15 UNION ALL
SELECT 16 UNION ALL
SELECT 18 UNION ALL
SELECT 19 UNION ALL
SELECT 20
),
#sample data map
map AS (
SELECT 5 AS grams, 1.99 price UNION ALL
SELECT 10, 2.99 UNION ALL
SELECT 15, 3.99 UNION ALL
SELECT 20, 4.99
),
#using left join, so there are rows with price = null
t AS (
SELECT i.grams, price
FROM info i LEFT JOIN map USING(grams)
ORDER BY grams
)
SELECT grams, first_value(price IGNORE NULLS)OVER (ORDER BY grams ASC ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) AS price
FROM t ORDER BY grams
and the output,
Row grams price
1 1 1.99
2 2 1.99
3 3 1.99
4 4 1.99
5 5 1.99
6 6 2.99
7 7 2.99
8 8 2.99
9 9 2.99
10 10 2.99
11 11 3.99
12 13 3.99
13 15 3.99
14 16 4.99
15 18 4.99
16 19 4.99
17 20 4.99
The last SELECT statement perform the action we describe above. In addition, I would like to point that:
UNBOUNDED FOLLOWING: The window frame ends at the end of the
partition.
And
CURRENT ROW :The window frame starts at the current row.

How to unnest/explode/flatten the comma separated value in a column in Amazon Redshift?

I am trying to generate a new row for each value in col2. As the value is in string format, I need to wrap it in double quotes before using any Redshift json function on it.
Input:
col1(int) col2(varchar)
1 ab,cd,ef
2 gh
3 jk,lm,kn,ut,zx
Output:
col1(int) col2(varchar)
1 ab
1 cd
1 ef
2 gh
3 jk
3 lm
3 kn
3 ut
3 zx
with NS AS (
select 1 as n union all
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10
)
select
TRIM(SPLIT_PART(B.col2, ',', NS.n)) AS col2
from NS
inner join table B ON NS.n <= REGEXP_COUNT(B.col2, ',') + 1
Here, the NS (number sequence) is a CTE that returns a list of number from 1 to N, here we have to make sure that our max number is greater than the size of our maximum tags, so you can try adding more numbers to the list depending on your context.