BigQuery : unnest 1 table into 2 while preserving refs - sql

I'm currently trying to import BQ tables full of arrays into a third party vizualisation tool that do not support them. I'm more of a node/nosql guy and that BQ step is somehow the complex exception within the project, so I believe I'm not correctly approching the problem to begin with.
A table looks like that:
Entry ID (primary)
User ID
...more metadata (finite)
Field ID 1 (dynamic)
Field ID 2 (dynamic)
...more fields (dynamic)
K1
U1
strings & numbers
[Value ID1, Value ID2]
[Value ID5, Value ID6]
...more arrays of values
K2
U1
strings & numbers
[Value ID2]
[Value ID5, Value ID6]
...more arrays of values
K3
U2
strings & numbers
[Value ID1]
[Value ID4, Value ID6]
...more arrays of values
Some more context:
our system follow a simple pattern: 1 org = 1 dataset = many users
the datasets are organized the exact same way accross orgs (when it comes to the number of tables and their IDs)
from now on I'll focus on one given table per org (let's call it "the Data table"): the one shared above
that Data table only share half of its schema accross orgs (primary key, user id, and some more columns with other metadata, it's finite and known), the second part of the schema (all the "Field .." columns) vary from an org to another (both the number of columns and the column names)
everything we're discussing will be handled by a node process that iterates over the org datasets, so it must be generic enough to handle all of them
any intermediary step, like running another pre-process to create intermediary tables or views, is acceptable
although I used JS notation for arrays, the BQ schema of the "fields" is in string/repeated, but it is possible to alter the way tables are exported to BigQuery if necessary
What I've tried:
flattening the table by parsing the arrays to string within node the moment those tables are exported to BigQuery => the third party doesn't support custom logic on cells, so in the end, the vizualisation can't correctly interpret the value
doing everything in the "What I believe I should do" beneath but through Node only: ie by reading BQ, parsing and mapping, then creating the 2 views => it screams inefficiency as I believe Node should only handle the automation part and simply send the query to BQ
doing that through SQL, but even though I can read it and run simple queries, as soon as I'm trying to mix UNNEST, JOIN and dynamic number of unknown columns, I'm kinda lost
What I believe I should do:
the third party allows to create Data Model and relations before vizualising, so I could have a view with one row per "values group", and another view that looks like the initial table, except the arrays of values are replaced by a string referencing the "primary key" of that "values group" view
The 2 outputs would look like that:
Refs
Ref ID
Value 1 (index 0)
Value 2
Value 3
...values
Ref1
Value ID1
Value ID2
Ref2
Value ID5
Value ID6
Ref3
Value ID2
Ref4
Value ID1
Ref5
Value ID4
Value ID6
Map
Entry ID (primary)
User ID
...more metadata (finite)
Field ID 1 (dynamic)
Field ID 2 (dynamic)
...more fields (dynamic)
K1
U1
strings & numbers
Ref1
Ref2
...more refs
K2
U1
strings & numbers
Ref3
Ref2
...more refs
K3
U2
strings & numbers
Ref4
Ref5
...more refs
The questions:
does it sounds logical (from a data analysis standpoint) and doable (from a BQ query standpoint) ?
I keep thinking 1 process > 1 read > 2 outputs for efficiency because of Node, but I should actually have one query from Data table to UNNEST into the Refs view, and then, another query from Data & Refs to generate the Map view, right ?
should I use GENERATE_UUID() to handle the RefID generation or there's something else more suited ?
Thanks for making it so far, I'll gladly take any input at that point.

You want to bring the nested table back to a relation data structure.
This is possible and depending on the requirements a good choice to do.
Please be aware that following query is tested only for a small dataset.
with tbl as
(Select "K1" as EntryID , "U1" as UserID, "strings & numbers" as metadata, ["Value ID1", "Value ID2"] as ID1, ["Value ID5", "Value ID6"] as ID2,
Union all Select "K2", "U1", "strings & numbers", ["Value ID2"], ["Value ID5", "Value ID6"],
Union all Select "K3", "U2", "strings & numbers", ["Value ID1"], ["Value ID4", "Value ID6"],
),
ref as (
select *, row_number() over (order by ref_name) as ref_id
from (
select distinct format("%T",ID1) as ref_name,
ID1[safe_offset(0)] as Value1,
ID1[safe_offset(1)] as Value2,
ID1[safe_offset(2)] as Value3,
ID1[safe_offset(3)] as Value4,
from (select id1 from tbl union all select id2 from tbl)
)
)
Select T.* except(ID1,Id2) #
,A.ref_id as Field_ID1
,B.ref_id as Field_ID2
from tbl T
left join ref A on format("%T",ID1)= A.ref_name
left join ref B on format("%T",ID2)= B.ref_name
First we generate your sample table tbl.
Table ref
The rows ID1 and ID2 are combined (union)
The array is converted into string format("%T",ID1) in the column ref_name
For each entry of the array, we generate a column ID1[safe_offset(0)] as Value1
The select distinct keeps only unique items
finally, we create a row_number for a unique reference id
This is put in ref table. However, you should safe this in a table create or replace table yourdatset.ref_table
Table Map
We query the tbl table without the array columns
We convert the array column ID1 to a string and join the reference id from the ref table: format("%T",ID1)= A.ref_name
The same has to be done for ID2

Related

Mapping array to composite type to a different row type

I want to map an array of key value pairs of GroupCount to a composite type of GroupsResult mapping only specific keys.
I'm using unnest to turn the array into rows, and then use 3 separate select statements to pull out the values.
This feels like a lot of code for something so simple.
Is there an easier / more concise way to do the mapping from the array type to the GroupsResult type?
create type GroupCount AS (
Name text,
Count int
);
create type GroupsResult AS (
Cats int,
Dogs int,
Birds int
);
WITH unnestedTable AS (WITH resultTable AS (SELECT ARRAY [ ('Cats', 5)::GroupCount, ('Dogs', 2)::GroupCount ] resp)
SELECT unnest(resp)::GroupCount t
FROM resultTable)
SELECT (
(SELECT (unnestedTable.t::GroupCount).count FROM unnestedTable WHERE (unnestedTable.t::GroupCount).name = 'Cats'),
(SELECT (unnestedTable.t::GroupCount).count FROM unnestedTable WHERE (unnestedTable.t::GroupCount).name = 'Dogs'),
(SELECT (unnestedTable.t::GroupCount).count FROM unnestedTable WHERE (unnestedTable.t::GroupCount).name = 'Birds')
)::GroupsResult
fiddle
http://sqlfiddle.com/#!17/56aa2/1
A bit simpler. :)
SELECT (min(u.count) FILTER (WHERE name = 'Cats')
, min(u.count) FILTER (WHERE name = 'Dogs')
, min(u.count) FILTER (WHERE name = 'Birds'))::GroupsResult
FROM unnest('{"(Cats,5)","(Dogs,2)"}'::GroupCount[]) u;
db<>fiddle here
See:
Aggregate columns with additional (distinct) filters
Subtle difference: our original raises an exception if one of the names pops up more than once, while this will just return the minimum count. May or may not be what you want - or be irrelevant if duplicates can never occur.
For many different names, crosstab() is typically faster. See:
PostgreSQL Crosstab Query

SQL query to get conflicting values in JSONB from a group

I have a table defined similar to the one below. location_id is a FK to another table. Reports are saved in an N+1 fashion: for a single location, N reporters are available, and there's one report used as the source of truth, if you will. Reports from reporters have a single-letter code (let's say R), the source of truth has a different code (let's say T). The keys for the JSONB column are regular strings, values are any combination of strings, integers and integral arrays.
create table report (
id integer not null primary key,
location_id integer not null,
report_type char(1),
data jsonb
)
Given all the information above, how can I get all location IDs where the data values for a given set of keys (supplied at query time) are not all the same for the report_type R?
There are at least two solid approaches, depending on how complex you want to get and how numerous and/or dynamic the keys are. The first is very straightforward:
select location_id
from report
where report_type = 'R'
group by location_id
having count(distinct data->'key1') > 1
or count(distinct data->'key2') > 1
or count(distinct data->'key3') > 1
The second construction is more complex, but has the advantage of providing a very simple list of keys:
--note that we also need distinct on location id to return one row per location
select distinct on(location_id) location_id
--jsonb_each returns the key, value pairs with value in type JSON (or JSONB) so the value field can handle integers, text, arrays, etc
from report, jsonb_each(data)
where report_type = 'R'
and key in('key1', 'key2', 'key3')
group by location_id, key
having count(distinct value) > 1
order by location_id

Tricky PostgreSQL join and order query

I've got four tables in a PostgreSQL 9.3.6 database:
sections
fields (child of sections)
entries (child of sections)
data (child of entries)
CREATE TABLE section (
id serial PRIMARY KEY,
title text,
"group" integer
);
CREATE TABLE fields (
id serial PRIMARY KEY,
title text,
section integer,
type text,
"default" json
);
CREATE TABLE entries (
id serial PRIMARY KEY,
section integer
);
CREATE TABLE data (
id serial PRIMARY KEY,
data json,
field integer,
entry integer
);
I'm trying to generate a page that looks like this:
section title
field 1 title | field 2 title | field 3 title
entry 1 | data 'as' json | data 1 json | data 3 json <-- table
entry 2 | data 'df' json | data 5 json | data 6 json
entry 3 | data 'gh' json | data 8 json | data 9 json
The way I have it set up right now each piece of 'data' has an entry it's linked to, a corresponding field (that field has columns that determine how the data's json field should be interpreted), a json field to store different types of data, and an id (1-9 here in the table).
In this example there are 3 entries, and 3 fields and there is a data piece for each of the cells in between.
It's set up like this because one section can have different field types and quantity than another section and therefore different quantities and types of data.
Challenge 1:
I'm trying to join the table together in a way that it's sortable by any of the columns (contents of the data for that field's json column). For example I want to be able to sort field 3 (the third column) in reverse order, the table would look like this:
section title
field 1 title | field 2 title | field 3 title
entry 3 | data 'gh' json | data 8 json | data 9 json
entry 2 | data 'df' json | data 5 json | data 6 json
entry 1 | data 'as' json | data 1 json | data 3 json <-- table
I'm open to doing it another way too if there's a better one.
Challenge 2:
Each field has a 'default value' column - Ideally I only have to create 'data' entries when they have a value that isn't that default value. So the table might actually look like this if field 2's default value was 'asdf':
section title
field 1 title | field 2 title | field 3 title
entry 3 | data 'gh' json | data 8 json | data 9 json
entry 2 | data 'df' json | 'asdf' | data 6 json
entry 1 | data 'as' json | 'asdf' | data 3 json <-- table
The key to writing this query is understanding that you just need to fetch all the data for single section and the rest you just join. You also can't with your schema directly filter data by section so you'll need to join entry just for that:
SELECT d.* FROM data d JOIN entries e ON (d.entry = e.id)
WHERE e.section = ?
You can then join field to each row to get defaults, types and titles:
SELECT d.*, f.title, f.type, f."default"
FROM data d JOIN entries e ON (d.entry = e.id)
JOIN fields f ON (d.field = f.id)
WHERE e.section = ?
Or you can select fields in a separate query to save some network traffic.
So this was an answer, here come bonuses:
Use foreign keys instead of integers to refer to other tables, it will make database check consistency for you.
Relations (tables) should be called in singular by convention, so it's section, entry and field.
Referring fields are called <name>_id, e.g. field_id or section_id also by convention.
The whole point of JSON fields is to store a collection with not statically defined data, so it would made much more sense to not use entries and data tables, but single table with JSON containing all the fields instead.
Like this:
CREATE TABLE row ( -- less generic name would be even better
id int primary key,
section_id int references section (id),
data json
)
With data fields containing something like:
{
"title": "iPhone 6",
"price": 650,
"available": true,
...
}
#Suor has provided good advice, some of which you already accepted. I am building on the updated schema.
Schema
CREATE TABLE section (
section_id serial PRIMARY KEY,
title text,
grp integer
);
CREATE TABLE field (
field_id serial PRIMARY KEY,
section_id integer REFERENCES section,
title text,
type text,
default_val json
);
CREATE TABLE entry (
entry_id serial PRIMARY KEY,
section_id integer REFERENCES section
);
CREATE TABLE data (
data_id serial PRIMARY KEY,
field_id integer REFERENCES field,
entry_id integer REFERENCES entry,
data json
);
I changed two more details:
section_id instead of id, etc. "id" as column name is an anti-pattern that's gotten popular since a couple of ORMs use it. Don't. Descriptive names are much better. Identical names for identical content is a helpful guideline. It also allows to use the shortcut USING in join clauses:
Don't use reserved words as identifiers. Use legal, lower-case, unquoted names exclusively to make your life easier.
Are PostgreSQL column names case-sensitive?
Referential integrity?
There is another inherent weakness in your design. What stops entries in data from referencing a field and an entry that don't go together? Closely related question on dba.SE
Enforcing constraints “two tables away”
Query
Not sure if you need the complex design at all. But to answer the question, this is the base query:
SELECT entry_id, field_id, COALESCE(d.data, f.default_val) AS data
FROM entry e
JOIN field f USING (section_id)
LEFT JOIN data d USING (field_id, entry_id) -- can be missing
WHERE e.section_id = 1
ORDER BY 1, 2;
The LEFT JOIN is crucial to allow for missing data entries and use the default instead.
SQL Fiddle.
crosstab()
The final step is cross tabulation. Cannot show this in SQL Fiddle since the additional module tablefunc is not installed.
Basics for crosstab():
PostgreSQL Crosstab Query
SELECT * FROM crosstab(
$$
SELECT entry_id, field_id, COALESCE(d.data, f.default_val) AS data
FROM entry e
JOIN field f USING (section_id)
LEFT JOIN data d USING (field_id, entry_id) -- can be missing
WHERE e.section_id = 1
ORDER BY 1, 2
$$
,$$SELECT field_id FROM field WHERE section_id = 1 ORDER BY field_id$$
) AS ct (entry int, f1 json, f2 json, f3 json) -- static
ORDER BY f3->>'a'; -- static
The tricky part here is the return type of the function. I provided a static type for 3 fields, but you really want that dynamic. Also, I am referencing a field in the json type that may or may not be there ...
So build that query dynamically and execute it in a second call.
More about that:
Dynamic alternative to pivot with CASE and GROUP BY

Assign unique ID's to three tables in SELECT query, ID's should not overlap

I am working on SQL Sever and I want to assign unique Id's to rows being pulled from those three tables, but the id's should not overlap.
Let's say, Table one contains cars data, table two contains house data, table three contains city data. I want to pull all this data into a single table with a unique id to each of them say cars from 1-100, house from 101 - 200 and city from 300- 400.
How can I achieve this using only select queries. I can't use insert statements.
To be more precise,
I have one table with computer systems/servers host information which has id from 500-700.
I have another tables, storage devices (id's from 200-600) and routers (ids from 700-900). I have already collected systems data. Now I want to pull storage systems and routers data in such a way that the consolidated data at my end should has a unique id for all records. This needs to be done only by using SELECT queries.
I was using SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS UniqueID and storing it in temp tables (separate for storage and routers). But I believe that this may lead to some overlapping. Please suggest any other way to do this.
An extension to this question:
Creating consistent integer from a string:
All I have is various strings like this
String1
String2Hello123
String3HelloHowAreYou
I Need to convert them in to positive integers say some thing like
String1 = 12
String2Hello123 = 25
String3HelloHowAreYou = 4567
Note that I am not expecting the numbers in any order.Only requirement is number generated for one string should not conflict with other
Now later after the reboot If I do not have 2nd string instead there is a new string
String1 = 12
String3HelloHowAreYou = 4567
String2Hello123HowAreyou = 28
Not that the number 25 generated for 2nd string earlier can not be sued for the new string.
Using extra storage (temp tables) is not allowed
if you dont care where the data comes from:
with dat as (
select 't1' src, id from table1
union all
select 't2' src, id from table2
union all
select 't3' src, id from table3
)
select *
, id2 = row_number() over( order by _some_column_ )
from dat

Fetch a single field from DB table into itab

I want to fetch the a field say excep_point from a transparent table z_accounts for the combination of company_code and account_number. How can I do this in ABAP SQL?
Assume that table structure is
|company_code | account_number | excep_point |
Assuming you have the full primary key...
data: gv_excep_point type zaccounts-excep_point.
select single excep_point
into gv_excep_point
from zaccounts
where company_code = some_company_code
and account_number = some_account_number.
if you don't have the full PK and there could be multiple values for excep_point
data: gt_excep_points type table of zaccounts-excep_point.
select excep_point
into table gt_excep_points
from zaccounts
where company_code = some_company_code
and account_number = some_account_number.
There is at least another variation, but those are 2 I use most often.
For information only. When you selects data into table you can write complex expressions to combine different fields. For example, you have internal table (itab) with two fields "A" and "B". And you are going to select data from DB table (dbtab) wich have 6 columns - "z","x","y","u","v","w". And for example each field is type char2 You aim to cimbine "z","x","y","u" in "A" field of internal table and "v","w" in "B" field. You can write simple code:
select z as A+0(2)
x as A+2(2)
y as A+4(2)
u as A+6(2)
v as B+0(2)
w as B+2(2) FROM dbtab
INTO CORRESPONDING FIELDS OF TABLE itab
WHERE <where condition>.
This simple code makes you job done very simple
In addition to Bryans answer, here is the official online documentation about Open SQL.