Reshape from wide to long in big query (standard SQL) - sql

Unfortunately reshaping in BQ it's not as easy as in R and I can't export my data for this project.
Here is input
date country A B C D
20170928 CH 3000.3 121 13 3200
20170929 CH 2800.31 137 23 1614.31
Expected output
date country Metric Value
20170928 CH A 3000.3
20170928 CH B 121
20170928 CH C 13
20170928 CH D 3200
20170929 CH A 2800.31
20170929 CH B 137
20170929 CH C 23
20170929 CH D 1614.31
Also my table has many more columns and rows (but I assume a lot of manual will be required)

Below is for BigQuery Standard SQL and does not require repeating selects depends on number of columns. It will pick as many as you have and transform them into metrics and values
#standardSQL
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')
You can test / play with above using dummy data as in your question
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
metric, SAFE_CAST(value AS FLOAT64) value
FROM (
SELECT DATE, country,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
FROM `project.dataset.yourtable` t,
UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')
result is as expected
DATE country metric value
20170928 CH A 3000.3
20170928 CH B 121.0
20170928 CH C 13.0
20170928 CH D 3200.0
20170929 CH A 2800.31
20170929 CH B 137.0
20170929 CH C 23.0
20170929 CH D 1614.31

You need UNION which is denoted using commas in bigquery
SELECT date, country, Metric, Value
FROM (
SELECT date, country, 'A' as Metric, A as Value FROM your_table
), (
SELECT date, country, 'B' as Metric, B as Value FROM your_table
), (
SELECT date, country, 'C' as Metric, C as Value FROM your_table
) , (
SELECT date, country, 'D' as Metric, D as Value FROM your_table
)

Most answers that I managed to find required specifying the name of EVERY column to be melt. This is not tractable when I have hundreds to thousands of columns in the table. Here is an answer that works for an arbitrarily wide table.
It utilizes dynamic SQL and automatically extract multiple column names from the data schema, collate a command string and then evaluate that string. This is intended to mimic Python pandas.melt() / R reshape2::melt() behavior.
I intentionally did not create user defined functions because of some undesirable properties of UDFs. Depending on how you use this, you may or may not want to do that.
Input:
id0 id1 _2020_05_27 _2020_05_28
1 1 11 12
1 2 13 14
2 1 15 16
2 2 17 18
Output:
id0 id1 date value
1 2 _2020_05_27 13
1 2 _2020_05_28 14
2 2 _2020_05_27 17
2 2 _2020_05_28 18
1 1 _2020_05_27 11
1 1 _2020_05_28 12
2 1 _2020_05_27 15
2 1 _2020_05_28 16
#standardSQL
-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: lunahuang#google.com
-- run this script with Google BigQuery Web UI in the Cloud Console
-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments
DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
WITH original AS (
-- query to retrieve the original table
%s
),
nested AS (
SELECT
[
-- sub command to be automatically generated
%s
] as s,
-- equivalent to id_vars in pandas.melt()
%s,
FROM original
)
SELECT
-- equivalent to id_vars in pandas.melt()
%s,
-- equivalent to var_name in pandas.melt()
s.key AS %s,
-- equivalent to value_name in pandas.melt()
s.value AS %s,
FROM nested
CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
WITH
columns AS (
-- query to retrieve the column names
-- equivalent to value_vars in pandas.melt()
-- the resulting table should have only one column
-- with the name: column_name
%s
),
scs AS (
SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
FROM columns
)
SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");
-- -- -- EXAMPLE BELOW -- -- --
-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
id0 INT64,
id1 INT64,
_2020_05_27 INT64,
_2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);
-- MELTING STARTS --
-- execute these two command to melt the table
-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
-- please do not change this argument
subcmd,
-- query to retrieve the column names
-- equivalent to value_vars in pandas.melt()
-- the resulting table should have only one column
-- with the name: column_name
"""
SELECT column_name
FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
"""
) INTO subcmd;
-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
-- please do not change this argument
cmd,
-- query to retrieve the original table
"""
SELECT *
FROM `tmp.example`
""",
-- please do not change this argument
subcmd,
-- equivalent to id_vars in pandas.melt()
-- !!please type these twice!!
"id0, id1", "id0, id1",
-- equivalent to var_name in pandas.melt()
"date",
-- equivalent to value_name in pandas.melt()
"value"
);

Related

Oracle SQL: Merging multiple columns into 1 with conditions

I am new to SQL and don't really have a lot of experience. I need help on this where I have Table A and I want to write a SQL query to generate the result. Any help would be greatly appreciated! Thanks!
Table A
Name
Capacity A
Capacity B
Capacity C
Plant 1
10
20
Plant 2
10
Result Table
Name
Type
Capacity
Plant 1
A,C
10,20
Plant 2
B
10
I know listagg function might be able to combine few columns into one, but is there anyway for me to generate the additional column 'Type' where its smart enough to know which column I am taking my value from? Preferably without creating any additional views/table.
Use NVL2 (or CASE) and concatenate the columns and trim any excess trailing commas:
SELECT Name,
RTRIM(
NVL2(CapacityA,'A,',NULL)
||NVL2(CapacityB,'B,',NULL)
||NVL2(CapacityC,'C',NULL),
','
) AS type,
RTRIM(
NVL2(CapacityA,CapacityA||',',NULL)
||NVL2(CapacityB,CapacityB||',',NULL)
||NVL2(CapacityC,CapacityC,NULL),
','
) AS capacity
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name (name, capacitya, capacityb, capacityc) AS
SELECT 'Plant1', 10, NULL, 20 FROM DUAL UNION ALL
SELECT 'Plant2', NULL, 10, NULL FROM DUAL;
Outputs:
NAME
TYPE
CAPACITY
Plant1
A,C
10,20
Plant2
B
10
db<>fiddle here
Here's one option:
sample data in lines #1 - 4
temp CTE simply - conditionally - concatenates types and capacities
final query (line #17)
removes double separators (commas) (regexp)
removes superfluous leading/trailing commas (trim)
SQL> with test (name, capa, capb, capc) as
2 (select 'Plant1', 10, null, 20 from dual union all
3 select 'Plant2', null, 10, null from dual
4 ),
5 temp as
6 (select name,
7 --
8 case when capa is not null then 'A' end ||','||
9 case when capb is not null then 'B' end ||','||
10 case when capc is not null then 'C' end as type,
11 --
12 case when capa is not null then capa end ||','||
13 case when capb is not null then capb end ||','||
14 case when capc is not null then capc end as capacity
15 from test
16 )
17 select name,
18 trim(both ',' from regexp_replace(type , ',+', ',')) as type,
19 trim(both ',' from regexp_replace(capacity, ',+', ',')) as capacity
20 from temp;
NAME TYPE CAPACITY
------ ---------- ----------
Plant1 A,C 10,20
Plant2 B 10
SQL>

Random Digit Data Masking on Id_number in Vertica

I need to mask 4 digit of id_number rondomly between 6th and 12th digits.
Example:
555444888777 --> 55544x8x8xx7
I wrote following code but it is rondom for every 2 digit. Is there any solution for rondomly masking with given intervals and how many digits for needed to mask in Vertica ?
SELECT OVERLAYB(OVERLAYB(OVERLAYB(OVERLAYB('555444888777', 'x', 5+RANDOMINT(2)),'x', 7+RANDOMINT(2)),'x', 9+RANDOMINT(2)),'x',11+RANDOMINT(2));
If you really want to randomly replace a digit with an 'x' at , randomly, the first or second digit after positions 5,7,9 and 11, as you coded it, then create a function as I did, so you don't need to re-code the nested OVERLAYB() calls every time.
You can replace 5,7,9 and 11 with a RANDOMINT() call, too, if you want more variance.
If, however, you want to vary the number of replacements (from 4 times to another number of times), you will have to re-write the function for a different number of replacements. Or go through the trouble of writing a UDx (User Defined Extension), in C++, Java, R or Python.
Check the Vertica docu for that; start here:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATEFUNCTIONUDF.htm?zoom_highlight=create%20function
Having said that, here goes the function with your same functionality, and its test:
CREATE OR REPLACE FUNCTION maskrand(
s VARCHAR(256)
)
RETURN VARCHAR(256)
AS
BEGIN
RETURN (
OVERLAYB(
OVERLAYB(
OVERLAYB(
OVERLAYB(
s
, 'x'
, 5+RANDOMINT(2)
)
,'x'
, 7+RANDOMINT(2)
)
,'x'
, 9+RANDOMINT(2)
)
,'x'
, 11+RANDOMINT(2)
)
);
END;
-- test ...
WITH indata(s) AS (
SELECT '555444888777'
UNION ALL SELECT '666333444888'
)
SELECT
s, maskrand(s) AS masked
FROM indata;
-- out s | masked
-- out --------------+--------------
-- out 555444888777 | 5554x48x8xx7
-- out 666333444888 | 6663x3x4x88x
This is a bit crazy, but seems to work:
with my_range as (
select 6 as n
union all select 7 as n
union all select 8 as n
union all select 9 as n
union all select 10 as n
union all select 11 as n
union all select 12 as n),
random_positions as (
select n
from my_range
order by random()
limit 4),
quartiles as (
select n,
ntile(4) over(order by n) as quartile
from random_positions),
sorted_positions as (
select max(case when quartile = 1 then n end) as random1,
max(case when quartile = 2 then n end) as random2,
max(case when quartile = 3 then n end) as random3,
max(case when quartile = 4 then n end) as random4
from quartiles)
select '555444888777' as string_original,
overlay(
overlay(
overlay(
overlay('555444888777' placing 'x' from random1)
placing 'x' from random2)
placing 'x' from random3)
placing 'x' from random4)
as string_masked
from sorted_positions;
Sample executions:
$ vsql -f tmp.sql
string_original | string_masked
-----------------+---------------
555444888777 | 55544xx8xx77
(1 row)
$ vsql -f tmp.sql
string_original | string_masked
-----------------+---------------
555444888777 | 55544xx8x7x7
(1 row)
$ vsql -f tmp.sql
string_original | string_masked
-----------------+---------------
555444888777 | 55544x8x8x7x
(1 row)
Explanation:
my_range builds a recordset with numbers from 6 to 12
random_positions sorts the recordset randomly, only returning 4 positions
quartiles assigns the quartile to each random position so sorted_positions can pivot the results into a single record
finally, the last select applies the OVERLAY function four times at the four random positions

Return five rows of random DNA instead of just one

This is the code I have to create a string of DNA:
prepare dna_length(int) as
with t1 as (
select chr(65) as s
union select chr(67)
union select chr(71)
union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i, round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;
execute dna_length(20);
I am trying to figure out how to re-write this to give a table of 5 rows of strings of DNA of length 20 each, instead of just one row. This is for PostgreSQL.
I tried:
CREATE TABLE dna_table(g int, dna text);
INSERT INTO dna_table (1, execute dna_length(20));
But this does not seem to work. I am an absolute beginner. How to do this properly?
PREPARE creates a prepared statement that can be used "as is". If your prepared statement returns one string then you can only get one string. You can't use it in other operations like insert, e.g.
In your case you may create a function:
create or replace function dna_length(int) returns text as
$$
with t1 as (
select chr(65) as s
union
select chr(67)
union
select chr(71)
union
select chr(84))
, t2 as (select s,
row_number() over () as rn
from t1)
, t3 as (select generate_series(1, $1) as i,
round(random() * 4 + 0.5) as rn)
, t4 as (select t2.s
from t2
join t3 on (t2.rn = t3.rn))
select array_to_string(array(select s from t4), '') as dna
$$ language sql;
And use it in a way like this:
insert into dna_table(g, dna) select generate_series(1,5), dna_length(20)
From the official doc:
PREPARE creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
About functions.
This can be much simpler and faster:
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1,100) g -- 100 = 5 rows * 20 nucleotides
GROUP BY g%5;
random() produces random value in the range 0.0 <= x < 1.0. Multiply by 4 and take the mathematical ceiling with ceil() (cheaper than round()), and you get a random distribution of the numbers 1-4. Convert to ACTG, and aggregate with GROUP BY g%5 - % being the modulo operator.
About string_agg():
Concatenate multiple result rows of one column into one, group by another column
As prepared statement, taking
$1 ... the number of rows
$2 ... the number of nucleotides per row
PREPARE dna_length(int, int) AS
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1, $1 * $2) g
GROUP BY g%$1;
Call:
EXECUTE dna_length(5,20);
Result:
| dna |
| :------------------- |
| ATCTTCGACACGTCGGTACC |
| GTGGCTGCAGATGAACAGAG |
| ACAGCTTAAAACACTAAGCA |
| TCCGGACCTCTCGACCTTGA |
| CGTGCGGAGTACCCTAATTA |
db<>fiddle here
If you need it a lot, consider a function instead. See:
What is the difference between a prepared statement and a SQL or PL/pgSQL function, in terms of their purposes?

How to build "Star Rating" report in BigQuery (or sparklines, or color gradients)

Suppose I have the followng sample input:
WITH Ratings AS (
(SELECT 'A' name, 2 score) UNION ALL
(SELECT 'B' name, 0 score) UNION ALL
(SELECT 'C' name, 5 score) UNION ALL
(SELECT 'D' name, 1 score))
Where score is number between 0 and 5.
How can I produce a report showing names and corresponding number of stars ?
We can build star rating as a string using two Unicode characters:
★ - Unicode code point 9733
☆ - Unicode code point 9734
We can use CODE_POINTS_TO_STRING function to build the stars, and REPEAT function to produce the right number of stars
Combined together the solution for sample input will be:
WITH Ratings AS (
(SELECT 'A' name, 2 score) UNION ALL
(SELECT 'B' name, 0 score) UNION ALL
(SELECT 'C' name, 5 score) UNION ALL
(SELECT 'D' name, 1 score))
SELECT
name,
CONCAT(
REPEAT(CODE_POINTS_TO_STRING([9733]), score),
REPEAT(CODE_POINTS_TO_STRING([9734]), 5-score)) score
FROM Ratings
It will produce the following result:
name score
A ★★☆☆☆
B ☆☆☆☆☆
C ★★★★★
D ★☆☆☆☆
My entry does a color gradient, because sparklines only look good with certain fonts - and that's not a font that the BigQuery web UI uses.
During a day, when is Stack Overflow the most active per tag:
#standardSQL
CREATE TEMP FUNCTION barchart(v ARRAY<FLOAT64>, mm STRUCT<min FLOAT64, max FLOAT64>) AS ((
SELECT STRING_AGG(SUBSTR('🏿🏾🏽🏼🏻', 1+CAST(ROUND(y) AS INT64), 1), '')
FROM (SELECT IFNULL(SAFE_DIVIDE((e-mm.min),(mm.max-mm.min))*4, 0) y FROM UNNEST(v) e)));
CREATE TEMP FUNCTION vbar(v ARRAY<FLOAT64>) AS (
barchart(v, (SELECT AS STRUCT MIN(a), MAX(a) FROM UNNEST(v) a))
);
WITH top_tags AS (
(SELECT x.value FROM (SELECT APPROX_TOP_COUNT(tag, 24) x FROM `bigquery-public-data.stackoverflow.posts_questions`, UNNEST(SPLIT(tags,'|')) tag WHERE EXTRACT(YEAR FROM creation_date)>=2016), UNNEST(x) x)
)
SELECT tag, vbar(ARRAY_AGG(1.0*hhh.count ORDER BY hhh.value)) gradient, SUM(hhh.count) c
FROM (
SELECT tag, APPROX_TOP_COUNT(EXTRACT(HOUR FROM creation_date), 24) h_h
FROM `bigquery-public-data.stackoverflow.posts_questions`, UNNEST(SPLIT(tags,'|')) tag
WHERE tag IN (SELECT * FROM top_tags) AND EXTRACT(YEAR FROM creation_date)>=2016
GROUP BY 1
), UNNEST(h_h) hhh
GROUP BY tag
ORDER BY STRPOS(gradient, '🏼')
Row gradient c tag
1 🏿🏿🏿🏿🏾🏽🏼🏼🏼🏻🏻🏻🏻🏼🏼🏼🏼🏽🏽🏽🏽🏾🏾🏿 317538 android
2 🏿🏿🏿🏿🏾🏽🏼🏼🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏽🏽🏽🏾🏾🏿 59445 asp.net
3 🏿🏿🏿🏿🏾🏽🏼🏼🏼🏻🏻🏻🏼🏼🏼🏼🏽🏽🏽🏽🏾🏾🏾🏿 159134 ios
4 🏿🏿🏿🏿🏾🏽🏼🏼🏼🏻🏻🏻🏻🏻🏻🏼🏼🏽🏽🏽🏽🏾🏾🏿 111988 angularjs
5 🏿🏿🏿🏿🏾🏾🏽🏼🏼🏻🏻🏻🏻🏻🏻🏼🏼🏼🏽🏽🏽🏽🏾🏿 212843 jquery
6 🏿🏿🏿🏾🏾🏾🏽🏼🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏼🏽🏽🏽🏾🏿 138143 mysql
7 🏿🏿🏿🏿🏿🏾🏽🏼🏼🏻🏻🏻🏼🏻🏻🏻🏻🏼🏼🏼🏼🏽🏾🏾 107586 swift
8 🏿🏿🏿🏿🏾🏾🏽🏼🏼🏻🏻🏻🏼🏻🏼🏼🏼🏽🏽🏽🏽🏾🏾🏿 318294 php
9 🏿🏿🏿🏿🏾🏾🏽🏼🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏼🏽🏽🏽🏾🏾 84723 json
10 🏿🏿🏿🏿🏿🏾🏽🏼🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏼🏼🏽🏽🏾🏾 233100 html
11 🏿🏿🏿🏿🏿🏾🏽🏼🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏼🏽🏽🏽🏾🏿 390245 java
12 🏿🏿🏿🏿🏿🏾🏽🏽🏼🏻🏻🏼🏻🏻🏻🏻🏼🏽🏽🏽🏽🏽🏾🏿 83787 angular
13 🏿🏿🏿🏿🏾🏾🏽🏽🏼🏼🏼🏼🏼🏻🏻🏻🏼🏼🏽🏽🏽🏽🏾🏿 70150 sql-server
14 🏿🏿🏿🏿🏿🏾🏽🏽🏼🏻🏻🏻🏻🏻🏻🏻🏼🏼🏼🏼🏽🏽🏾🏾 534663 javascript
15 🏿🏿🏿🏿🏿🏾🏽🏽🏼🏻🏻🏼🏼🏻🏻🏻🏼🏼🏽🏽🏽🏾🏾🏿 291541 c#
16 🏿🏿🏿🏿🏿🏿🏾🏾🏽🏼🏼🏽🏼🏼🏻🏻🏻🏻🏻🏼🏼🏽🏽🏾 65668 c
17 🏿🏿🏿🏿🏿🏾🏽🏽🏽🏼🏼🏼🏼🏻🏻🏻🏼🏼🏼🏼🏽🏽🏾🏿 111792 sql
18 🏿🏿🏿🏿🏿🏾🏾🏽🏽🏼🏻🏼🏼🏻🏻🏻🏻🏼🏼🏼🏼🏽🏾🏾 158999 css
19 🏿🏿🏿🏿🏿🏿🏾🏽🏽🏼🏼🏼🏼🏻🏻🏻🏻🏼🏼🏼🏼🏽🏽🏾 88146 arrays
20 🏿🏿🏿🏿🏿🏿🏾🏾🏽🏼🏼🏽🏼🏼🏻🏻🏻🏼🏼🏼🏼🏼🏽🏾 61840 ruby-on-rails
21 🏿🏿🏿🏿🏿🏿🏾🏾🏽🏼🏼🏼🏼🏻🏻🏻🏼🏼🏼🏼🏼🏽🏾🏾 136265 c++
22 🏿🏿🏿🏿🏿🏾🏽🏽🏽🏻🏻🏼🏼🏻🏻🏻🏻🏼🏼🏼🏽🏽🏾🏾 104218 node.js
23 🏿🏿🏿🏿🏿🏿🏿🏾🏾🏽🏽🏽🏼🏼🏻🏻🏻🏼🏼🏼🏼🏽🏾🏾 360396 python
24 🏿🏿🏿🏿🏿🏿🏿🏾🏾🏽🏽🏽🏽🏼🏻🏻🏻🏼🏼🏼🏼🏽🏾🏾 98690 r
And a more compact shaded gradient, but with only 3 values:
#standardSQL
CREATE TEMP FUNCTION barchart(v ARRAY<FLOAT64>, mm STRUCT<min FLOAT64, max FLOAT64>) AS ((
SELECT STRING_AGG(SUBSTR('▓▒░', 1+CAST(ROUND(y) AS INT64), 1), '')
FROM (SELECT IFNULL(SAFE_DIVIDE((e-mm.min),(mm.max-mm.min))*2, 0) y FROM UNNEST(v) e)));
CREATE TEMP FUNCTION vbar(v ARRAY<FLOAT64>) AS (
barchart(v, (SELECT AS STRUCT MIN(a), MAX(a) FROM UNNEST(v) a))
);
WITH top_countries AS (
(SELECT x.value FROM (SELECT APPROX_TOP_COUNT(country_code, 12) x FROM `ghtorrent-bq.ght_2017_09_01.users`), UNNEST(x) x)
)
SELECT vbar(ARRAY_AGG(1.0*hhh.count ORDER BY hhh.value)) gradient, SUM(hhh.count) c, country_code
FROM (
SELECT country_code, APPROX_TOP_COUNT(EXTRACT(HOUR FROM a.created_at), 24) h_h
FROM `githubarchive.year.2017` a
JOIN `ghtorrent-bq.ght_2017_09_01.users` b
ON a.actor.login=b.login
WHERE country_code IN (SELECT * FROM top_countries)
AND actor.login NOT IN (SELECT value FROM (SELECT APPROX_TOP_COUNT(actor.login, 1000) x FROM `githubarchive.year.2017` WHERE type='WatchEvent'), UNNEST(x))
AND a.type='WatchEvent'
GROUP BY 1
), UNNEST(h_h) hhh
GROUP BY country_code
ORDER BY STRPOS(gradient, '░')
Row gradient c country_code
1 ░░░░░░░▒▒▒▒▒▒▒▒▓▓▓▓▓▓▒▒░ 204023 au
2 ▒░░░░░░░░░▒▒▒▒▒▒▒▓▓▓▓▓▓▒ 293589 jp
3 ▓▒░░▒▒░░░░▒▒▒▒▒▒▒▓▓▓▓▓▓▓ 2125724 cn
4 ▓▓▓▒▒░░░░░░░░▒▒▒▒▒▒▒▒▓▓▓ 447092 in
5 ▓▓▓▓▓▓▒▒░░░░░░░░▒▒▒▒▒▒▒▓ 381510 ru
6 ▓▓▓▓▓▓▒▒░░░░░░░░▒▒▒▒▒▒▒▒ 545906 de
7 ▓▓▓▓▓▓▓▒░░░▒░░░░▒▒▒▒▒▒▒▒ 395949 fr
8 ▓▓▓▓▓▓▓▒▒░░░░░░░░▒▒▒▒▒▒▒ 491068 gb
9 ▒▒▒▒▓▓▓▓▓▓▓▒░░░▒░░░░░▒▒▒ 419608 br
10 ▒▒▒▒▒▒▒▓▓▓▓▓▓▒▒░░░░░░░░▒ 2443381 us
11 ▒▒▒▒▒▒▒▓▓▓▓▓▓▒▒░░░░░░░▒▒ 294793 ca
And a short code for sparklines - works great with Data Studio:
#standardSQL
CREATE TEMP FUNCTION barchart(v ARRAY<FLOAT64>, mm STRUCT<min FLOAT64, max FLOAT64>) AS ((
SELECT STRING_AGG(SUBSTR('▁▂▃▄▅▆▇█', 1+CAST(ROUND(y) AS INT64), 1), '')
FROM (SELECT IFNULL(SAFE_DIVIDE((e-mm.min),(mm.max-mm.min))*7, 0) y FROM UNNEST(v) e)));
CREATE TEMP FUNCTION vbar(v ARRAY<FLOAT64>) AS (
barchart(v, (SELECT AS STRUCT MIN(a), MAX(a) FROM UNNEST(v) a))
);
Adding more-less generic option for producing time-series/sparklines type of report
#standardSQL
CREATE TEMP FUNCTION sparklines(arr ARRAY<INT64>) AS ((
SELECT STRING_AGG(CODE_POINTS_TO_STRING([code]), '')
FROM UNNEST(arr) el,
UNNEST([(SELECT MAX(el) FROM UNNEST(arr) el)]) mx,
UNNEST([(SELECT MIN(el) FROM UNNEST(arr) el)]) mn
JOIN UNNEST([9602, 9603, 9605, 9606, 9607]) code WITH OFFSET pos
ON pos = CAST(IF(mx = mn, 1, (el - mn) / (mx - mn)) * 4 AS INT64)
));
WITH series AS (
SELECT 1 id, [3453564, 5343333, 2876345, 3465234] arr UNION ALL
SELECT 2, [5743231, 3276438, 1645738, 2453657] UNION ALL
SELECT 3, [1,2,3,4,5,6,7,8,9,0] UNION ALL
SELECT 4, [3245876, 2342879, 5876324, 7342564]
)
SELECT
id, TO_JSON_STRING(arr) arr, sparklines(arr) sparklines
FROM series
with result as below
Row id arr sparklines
1 1 [3453564,5343333,2876345,3465234] ▃▇▂▃
2 2 [5743231,3276438,1645738,2453657] ▇▅▂▃
3 3 [1,2,3,4,5,6,7,8,9,0] ▂▃▃▅▅▆▆▇▇▂
4 4 [3245876,2342879,5876324,7342564] ▃▂▆▇
Adding Mosha's version (taken from his comments below)
#standardSQL
CREATE TEMP FUNCTION barchart(v ARRAY<FLOAT64>, MIN FLOAT64, MAX FLOAT64) AS (
IF(
MIN = MAX,
REPEAT(CODE_POINTS_TO_STRING([9603]), ARRAY_LENGTH(v)),
(
SELECT STRING_AGG(CODE_POINTS_TO_STRING([9601 + CAST(ROUND(y) AS INT64)]), '')
FROM (
SELECT SAFE_DIVIDE(e-min, MAX - MIN) * 7 y
FROM UNNEST(v) e)
)
)
);
CREATE TEMP FUNCTION vbar(v ARRAY<FLOAT64>) AS (
barchart(v, (SELECT MIN(a) FROM UNNEST(v) a), (SELECT MAX(a) FROM UNNEST(v) a))
);
WITH numbers AS (
SELECT 1 id, [3453564., 5343333., 2876345., 3465234.] arr UNION ALL
SELECT 2, [5743231., 3276438., 1645738., 2453657.] UNION ALL
SELECT 3, [1.,2,3,4,5,6,7,8,9,0] UNION ALL
SELECT 4, [3245876., 2342879, 5876324, 7342564]
)
SELECT
id, TO_JSON_STRING(arr) arr, vbar(arr) sparklines
FROM numbers
if applied to same dummy data as above versions - produces below
Row id arr sparklines
1 1 [3453564,5343333,2876345,3465234] ▃█▁▃
2 2 [5743231,3276438,1645738,2453657] █▄▁▂
3 3 [1,2,3,4,5,6,7,8,9,0] ▂▃▃▄▅▆▆▇█▁
4 4 [3245876,2342879,5876324,7342564] ▂▁▆█
More craziness here 😊
Totally useless - but fun to play with
Applying all different options presented in this post for image processing and drawing (using profile pictures of those contribute into this post) + some new
1st and 2nd result (for Felipe's picture) produced using Felipe's Color Gradient approach with different scaling options
3rd result - using Felipe's Shaded Gradient approach
4th result - using Mikhail's(mine)/Mosha's Spark-line approach
Finally 5th and 6th results - using ASCII characters sets representing ASCII Shades of Gray - respectively:
Short set - " .:-=+*#%#"
Full (long) set - "$#B%8&WM#*oahkbdpqwmZO0QLCJUYXzcvunxrjft/\|()1{}[]?-_+~<>i!lI;:,"^``'. "
Code is trivial and literally same as in respective answers - the only difference is that data used in above exercises is image's pixels data that is simply acquired using HTML canvas getImageData() Method - obviously outside of BigQuery - with just simple html page
Options for getting crazy here and having fun playing with image transformation / processing - limitless! but probably useless outside of just learning scope 😜
Fitting vertical bar chart into single character is challenging because there are only 8 different heights we could use. But horizontal bar charts don't have this limitation, we can scale horizontal chart by arbitrary length. Example below uses 30, and it shows number of births per day of week as horizontal bar chart. Data is based on public dataset:
create temp function hbar(value int64, max int64) as (
repeat('█', cast(30 * value / max as int64))
);
select
['sunday', 'monday', 'tuesday', 'wednesday',
'thursday', 'friday', 'saturday'][ordinal(wday)] wday, bar from (
select wday, hbar(count(*), max(count(*)) over()) bar
from `bigquery-public-data.samples.natality`
where wday is not null
group by 1
order by 1 asc)
Results in
wday bar
---------------------------------------------
sunday ███████████████████
monday ███████████████████████████
tuesday ██████████████████████████████
wednesday ██████████████████████████████
thursday █████████████████████████████
friday █████████████████████████████
saturday █████████████████████

Compare two delimited strings and return corresponding value in PL SQL

I have two columns with a hashtag delimited value, i.e. Email#Web#Telephone#SMS#MMS & 0#0#0#1#0 Note that each delimited value of the second column matches up with its corresponding delimited value in the first column, i.e. Email = 0, Web = 0, Telephone = 0, SMS = 1 etc.
Based on a parameter, I want to return the matching value of the second column. i.e. incoming param = Web#Telephone#SMS, thus the value that I want to return is 0#0#1.
This need to be done in PL SQL, and I have no clue where to start, which explains the lack of sample code.
Any help please?
There are a couple of very useful utility functions in an Oracle package called APEX_UTIL. (This package concerns Oracle Application Express aka APEX, but can be used anywhere). They are:
apex_util.string_to_table
apex_util.table_to_string
Using string_to_table you can convert the delimited string into a table of values:
declare
v_table apex_application_global.vc_arr2; -- This is the table type apex_util uses
begin
v_table := apex_util.table_to_string ('Email#Web#Telephone#SMS#MMS', '#');
end;
You now have an array with 5 elements ('Email', 'Web', 'Telephone', 'SMS', 'MMS');
You can do the same with the values string to get a table with elements ('0', '0', '0', '1', 0'). And you can do the same with the parameter to get a table with elements ('Web', 'Telephone', 'SMS').
You can then use PL/SQL logic to build a new array with elements for the values you need to return, i.e. ('0', '0', '1'). I have left this part to you!
Finally you can turn that back into a delimited string:
return apex_util.table_to_string (v_return_table, '#');
Firstly, you should normalize the table and have the attributes in different columns rather than delimited strings in a single column.
Anyway, you could do it in many ways using the techniques to Split comma delimited strings in a table
For example, using REGEXP_SUBSTR and CONNECT BY clause:
SQL> WITH DATA(attr, val) AS(
2 SELECT 'Email#Web#Telephone#SMS#MMS', '0#0#0#1#0' FROM dual
3 )
4 SELECT lines.COLUMN_VALUE,
5 trim(regexp_substr(t.attr, '[^#]+', 1, lines.COLUMN_VALUE)) attr,
6 trim(regexp_substr(t.val, '[^#]+', 1, lines.COLUMN_VALUE)) val
7 FROM data t,
8 TABLE (CAST (MULTISET
9 (SELECT LEVEL FROM dual CONNECT BY LEVEL <= regexp_count(t.attr, '#')+1
10 ) AS sys.odciNumberList ) ) lines
11 /
COLUMN_VALUE ATTR VAL
------------ --------------------------- ---------
1 Email 0
2 Web 0
3 Telephone 0
4 SMS 1
5 MMS 0
SQL>
Now, you can get the respective values for each attribute.
You could put the entire logic in a FUNCTION and return the corresponding values of each attribute and call the function in SELECT statement.
For example,
SQL> CREATE OR REPLACE
2 FUNCTION get_val_from_attr(
3 attr_name VARCHAR2)
4 RETURN NUMBER
5 IS
6 var_val NUMBER;
7 BEGIN
8 WITH DATA(attr, val) AS
9 ( SELECT 'Email#Web#Telephone#SMS#MMS', '0#0#0#1#0' FROM dual
10 ),
11 t2 AS
12 (SELECT lines.COLUMN_VALUE,
13 trim(regexp_substr(t.attr, '[^#]+', 1, lines.COLUMN_VALUE)) attr,
14 trim(regexp_substr(t.val, '[^#]+', 1, lines.COLUMN_VALUE)) val
15 FROM data t,
16 TABLE (CAST (MULTISET
17 (SELECT LEVEL FROM dual CONNECT BY LEVEL <= regexp_count(t.attr, '#')+1
18 ) AS sys.odciNumberList ) ) lines
19 )
20 SELECT val INTO var_val FROM t2 WHERE attr = attr_name;
21 RETURN var_val;
22 END;
23 /
Function created.
Let's call the function:
SQL> SELECT get_val_from_attr('Email') FROM dual;
GET_VAL_FROM_ATTR('EMAIL')
--------------------------
0
SQL> SELECT get_val_from_attr('SMS') FROM dual;
GET_VAL_FROM_ATTR('SMS')
------------------------
1