Cluster table by nested fields - google-bigquery

I have worked with partitioned and clustered tables before. I always use simple fields for this, but now I have the necessity to cluster my table by some nested fields. This is a reduced example of my schema:
id
timestamp
source
|-email
|-service
|-country
I do not have any problem for the partitioning as the timestamp field is a simple one. However, I would like to cluster the table by source.service and source.country fields and I do not know how to refer to those nested columns when declaring the clusters on table creation.
I have search over other related questions and documentation on Internet but I have not found any solution. Is there a way to achieve it?

The partitioning column must be a top-level field.
You cannot use a leaf field from a RECORD (STRUCT) as the partitioning column.
The partitioning column must be either a scalar DATE, TIMESTAMP, or DATETIME column. While the mode of the column can be REQUIRED or NULLABLE, it cannot be REPEATED (array-based).
Check all limitations
Similar limitation for clustering
Clustering columns must be top-level, non-repeated columns of one of the following types:
DATE
BOOL
GEOGRAPHY
INT64
NUMERIC
BIGNUMERIC
STRING
TIMESTAMP
DATETIME

Related

Is json suitable for holding values coming from dynamic columns

Is json suitable for holding values coming from dynamic columns?
I want to store data about documents where the structure is known in advance, e.g. Name, Name, Year, Place, etc. For now, I have defined these fields as columns in a table in MS SQL.
However, I would like to store data about "dynamic" documents, where the user himself will select a field from a certain pool and insert a value. For now, I store this data as json in one column. For example once will be something like this:'{"Name":"John Doe","Place":"Chicago"}'
and other times it can only be: '{"Name": "John Doe"}'
I wonder how to unify this data. I thought that data from first type of documents (with a known number of columns) should also be stored in json. But I don't know if this is a good approach with large amounts of data, eg 100,000 records.
First, 100,000 rows is not a large number of columns. It is doubtful that you are even talking about a gigabyte of data.
Both XML and JSON incur overhead for storing field names in the data. If you have lots of repeated field names, then lots of redundant field names are being stored. And bigger rows slow down queries.
JSON and XML also have challenges in verifying the field names. This can be handled through the application or constraints. That said, they can be quite useful when the attributes are rarely used.
Your sample data simply suggests NULLable columns. You can have name and place. If there is no place then the value is NULL. Given that you have a fixed pool, there is a good change that this structure is the simplest and most efficient. The only downside is that adding a new column requires adding a column to a table. And that can be an expensive operation.
An alternative is an EAV model, which is mentioned in the comments. This solves the problem of repeating the names, because you can use ids instead. So, you could have:
create table optionalFields (
optionFieldId int identity(1, 1) primary key,
name varchar(255)
);
create table userOptionalFields (
userOptionalFieldId int identity (1, 1) primary key,
userId int references users(userId),
optionalFieldId int references optionalFields(optionalFieldId),
value varchar(255)
);
The downside to an EAV model is that it is simplest when all the values are strings, and that can be a little tricky if some of the values are numbers or dates. On the positive side, the database ensures that the fields are valid.
The choice between the different data models depends on factors such as:
The data type of the values.
The total number of fields.
How often new fields are added.
Whether the fields (for a given user) are updated and if so, if they are updated one-by-one or all-at-once.
How common fields are for a given user.
How familiar you are with XML and JSON.
Whether field names have synonyms (for instance "FullName for "Name").
Whether field names ever change. For instance, might "Name" suddenly become "FullName"?
And no doubt other issues as well.

Index Integer Substring of Varchar ID PostgreSQL

I am going to be creating a very large table (320k+ rows) that I am going to be doing many complicated operations on so performance is very important. Each row will be a reference to a page / entity from an external site that already has unique IDs. In order to keep the data easy to read and for consistency reasons I would rather use those external IDs as my own row IDs, however the problem is that the IDs are in the format of XXX######## where the XXX part is always the same identical string prefix and the second ######## part is a completely unique number. From what I know, using varchar ids is measurably slower performance wise, and only looking at the numerical part will have the same results.
What is the best way to do this? I still want to be able to do queries like WHERE ID = 'XXX########' and have the actual correct ids displayed in result sets rather than trimmed ones. Is there a way to define getters and setters for a column? Or is there a way to create an index that is a function on just the numerical part of the id?
Since your ID column (with format XXX########) is a primary key, there will already be an index on that column. If you wish to create an index based on the "completely unique number" portion of the ID, it is possible to create an expression index in Postgres:
CREATE INDEX pk_substr_idx ON mytable (substring(id,4));
This will create an index on the ######## portion of your column. However, bear in mind that the values stored in the index will be text, not numbers. Therefore, you might not be able to see any real benefit to having this index around (i.e., you'll only be able to check for equality = and not comparison >/</>=/<=.
The other drawback of this approach is that for every row you insert, you'll be updating two indexes (the one for the PK, and the one for the substring).
Therefore, if at all possible, I would recommend splitting your ID into separate prefix (the XXX portion) and id_num (the ######## portion) columns. Since you stated that "the XXX part is always the same identical string prefix", you would stand to reap a performance benefit by either 1) splitting the string into two columns or 2) hard-code the XXX portion into your app (since it's "always the same identical string prefix") and only store the numeric portion in the database.
Another approach (if you are willing to split the string into separate prefix and id_num columns) is to create a composite index. The table definition would then look something like:
CREATE TABLE mytable (
prefix text,
id_num int,
<other columns>,
PRIMARY KEY (prefix, id_num)
);
This creates a primary key on the two columns, and you would be able to see your queries use the index if you write your application with two columns in mind. Again, you would need to split the ID up into text and number portions. I believe this is the only way to get the best performance out of your queries. Any value that mixes text and numbers will ultimately be stored and interpreted as text.
Disclosure: I work for EnterpriseDB (EDB)
Use an IDENTITY type column for the primary key and load the external IDs as a separate column

A single table that represents multiple tables

I have a problem with finding a way to represent multiple tables hash tables into a single table.
Say I have 3 tables with the format:
Table1(Table1_PK1,Table1_PK2,Table1_PK3,Table1_Hash)
Table2(Table2_PK1,Table2_PK2,Table2_Hash)
Table3(Table3_Pk1,Table3_PK2,Table3_PK3,Table3_PK4,Table3_PK5,Table3_Hash)
Table1_PK1,Table1_PK2,Table1_PK3... are columns and they might have different datatypes (VARCHAR, INT or DATETIME ...).
My question is if there is a way to create a single table (fixed number of columns) that can represent all of these 3 tables (may be more in practical).
I am trying to do this for my database tool. Each table actual a table which contains primary keys and a hash data associating with them.
Since you're apparently building a database tool, not a database, it might make more sense to do this in application code rather than in a database table.
In a different answer, you commented
I am still looking for a dynamic way to do it without knowing how many primary keys a table can have.
A table can have only one primary key. That primary key can consist of more than one column, though. (You already knew this; you were just using the wrong words, which might confuse others.)
A table can also have an arbitrary number of other keys, which will be either declared (as NOT NULL UNIQUE) or "undeclared" (by creating an index that guarantees uniqueness over a set of columns).
You can look all that stuff up at run time in one or both of two ways. (Links go to documentation for PostgreSQL.)
System tables, sometimes called system catalogs
information_schema views
As far as I know, all modern SQL platforms implement at least one of these interfaces. The information_schema views are covered in the SQL standards, but there seems to be some room for interpretation. They don't look quite the same on all platforms.
Why combine the 3 tables into one? Would be really bad db design. But here's a way to do it:
The one table will have a column for each of the 3 tables' columns you want in the final table. I am making the assumption that TableX_Hash is the same type, so that remains as one unique column:
Table_All_in_One (
Table1_PK1,
Table1_PK2,
Table1_PK3,
# space just for clarity of grouping
Table2_PK1,
Table2_PK2,
Table3_PK1,
Table3_PK2,
Table3_PK3,
Table3_PK4,
Table3_PK5,
TableX_Hash # Assuming all the _Hash'es are the same type+length,
# otherwise, add Table1_Hash, Table2_Hash, Table3_Hash
# This can be your new primary key
)
The Primary Keys (PKx) are required to be non-NULL only in their own tables. For this table, they have to allow nulls. The idea is that each row of this new table will only hold the data for one of the tables. The other columns will be empty for that row. If you want to associate the row of one table with another, you can add that to the same row or add FK_Table1_Hash, FK_Table2_Hash and FK_Table3_Hash columns which will refer to the TableX_Hash value of a record.
PS: I wonder if what you are really looking for is a View and not this really bad all-in-one table.
Edit: Combining them into one "without knowing how many primary keys a table can have." as per your comment:
Store all the _PKs concatenated into one column:
Table_All_in_One (
New_PK,
TableX_Hash,
Table1_PKx, # Concatenated PKs of Table1
Table2_PKx, # Concatenated PKs of Table2, etc.
...,
# OR just one
TableX_PKs, # concatenate all the PK's into one VARCHAR field
# Add a pipe `|` between them optionally.
Table_Num # If using just one, then you'll need to store the table number
)
You will not be able to conveniently pick records based on part of their composite primary key. It will always have to be TableX_PKs = CONCAT_WS('|', Table1_PK1, Table1_PK2, ...). So your only dependency is the number of PKs in the original column.
In order to model a bunch of tables you will need 3 tables. An entity table that contains the table names of the tables you wish to set up this way called a factor or entity table. A Factor_detail table that contains all the columns and their associated properties of the tables. A table, factor_detail_value, for storing things like lookup values for lookup tables. I'm trying to learn more about this myself as well because we are using this technique at work as well. Genrate sql on the fly for any table so encoded, and store the data in a repository pertiinant to the data itself. This way if a table changes and you need to add a column or change a datatype, you can add a row to the factor detail table without affecting a database shut down in production. In most businesses a four hour shut down to make a sql data table change can cost thousands of dollars. If you are dealing with insurance for example, each additional state that you sell insurance in has different requirements for being able to seel it and that will result in table changes. We reduced our table count way down from over 700 tables in this manner also we can make changes without database shut down thus avoiding the loss in revenue.

How to use Oracle Indexes

I am a PHP developer with little Oracle experience who is tasked to work with an Oracle database.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL. Instead they seem to create an index out of two fields.
For example I noticed that one of the indexes is a combination of a Date Field and foreign key ID field. The Date field seems to store the entire date and timestamp so the combination is fairly unique.
If the index name was PLAYER_TABLE_IDX how would I go about using this index in my PHP code?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
Any advice Oracle/PHP gurus?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
There's no way around that you have to use reference all the columns in a composite primary key to get a unique row.
You can't use an index directly in a SQL query.
In Oracle, you use the hint syntax to suggestion an index that should be used, but the only means of hoping to use an index is by specifying the column(s) associated with it in the SELECT, JOIN, WHERE and ORDER BY clauses.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL.
Oracle (and PostgreSQL) have what are called "sequences". They're separate objects from the table, but are used for functionality similar to MySQL's auto_increment. Unlike MySQL's auto_increment, you can have more than one sequence used per table (they're never associated), and can control each one individually.
Instead they seem to create an index out of two fields.
That's what the table design was, nothing specifically Oracle about it.
But I think it's time to address that an index has different meaning in a database than how you are using the term. An index is an additional step to make SELECTing data out of a table faster (but makes INSERT/UPDATE/DELETE slower because of maintaining them).
What you're talking about is actually called a primary key, and in this example it'd be called a composite key because it involves more than one column. One of the columns, either the DATE (consider it DATETIME) or the foreign key, can have duplicates in this case. But because of the key being based on both columns, it's the combination of the two values that makes them the key to a unique record in the table.
http://use-the-index-luke.com/ is my Web-Book that explains how to use indexes in Oracle.
It's an overkill to your question, however, it is probably worth reading if you want to understand how things work.

How do you handle multiple value types in MySQL when inheritance isn't supported?

I need to create a table of attributes where each record is essentially just a name-value pair. The problem is that the value can be a string, integer or decimal and I'm using MySQL which doesn't support table inheritance. So, the question is - should I create a separate table for each value type or should I just create str_value, int_value and dec_value columns with an additional value_type column that tells you which type to use? There won't be many records in this table (less than 100), so performance shouldn't be much of an issue, but I just don't want to make a design decision that's going to make the SQL more complex than it has to be.
Having different tables, or even multiple columns where only one of the 3 are populated, is going to be a nightmare. Store it all as varchar along with a type column.
When querying the database, you will always receive strings, no matter what the column type is - so there is no reason to make a design decision here - just store everything as a string.
By the way: Having an additional value_type column is redundant - the entry has the type of the only column that has a not-null value.
If you create different columns for each type, you're going to have to be smart about it in your code anyway...
Why not just store everything as a string and convert to the desired type in your code?