BigTable: Storing IDs as Qualifiers? - bigtable

On GCP doc it says:
Because Cloud Bigtable tables are sparse, you can create as many column qualifiers as you need in each row. There is no space penalty for empty cells in a row. As a result, it often makes sense to treat column qualifiers as data. For example, if your table is storing user posts, you could use the unique identifier for each post as the column qualifier.
https://cloud.google.com/bigtable/docs/schema-design#column_families
Can anyone help me with an example?
If I have 1M users and each posts 1000 posts, does it make sense to have 1B column qualifiers (1M * 1000)?
Thanks !

There are a couple of constraints that are relevant here:
There is a hard limit of 256 MB per row
A row cannot be split across different nodes, which prevents parallelization
So you would want to avoid storing data from multiple users in a single row. So you wouldn't have 1B posts in a single row. However, having a 1M rows, each with a 1000 qualifiers should be fine. You can think of the column qualifiers as keys in a Hashmap.
Unlike SQL or column families, the qualifiers in each row are completely unrelated to the qualifiers in a different row.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Multiple tables vs one table with more columns

My chosen database is MongoDB. But the question should be independent.
So for example, each row of record will have a flag that can take 1 of 2 possible values.
What is the pro and con of:
Having 1 table with a column to hold the value of this flag.
versus:
the pro and con of:
Having 2 tables to hold the two different types of records distinguished by the aforementioned flag?
Would this be cheaper in terms of storage, since you don't have that extra column?
Would this also be faster in queries, since you know exactly which table to look without having to perform a filter?
What is the common practice in industry?
Storage for a single column holding just a flag (e.g. active and archived) should be negligible. The query could be faster with two tables, however your application is more complex, you have to write 2 queries.
When you have only 2 distinct values and these values are more or less evenly distributed, then an index will not be used, thus the performance should be equal - unless you select the entire table.
It might be useful to have 2 tables if the flags are not evenly distributed. For example you have a rather small active data set which is queried frequently, and a big archive data set which is much bigger but hardly queried.
If available, you can also work with partitions which is actually a good combination of both.

What is the maximum number of table names under a single schema in a DB2 subsystem?

What is the maximum number of results possible from the following SQL query for DB2 on z/OS?
SELECT NAME FROM SYSIBM.SYSTABLES WHERE TYPE='T' AND CREATOR=? ORDER BY NAME ASC
This query is intended to fetch a list of all table names under a specific schema/creator in a DB2 subsystem.
I am having trouble finding a definitive answer. According to IBM's "Limits in DB2 for z/OS" article, the maximum number of internal objects for a DB2 database is 32767. Objects include views, indexes, etc.
I would prefer a more specific answer for maximum number of table names under one schema. For instance, here is an excerpt from an IDUG thread for a related question:
Based on the limit of 32767 objects in one database, where each tablespace takes two entries, and tables and indexes take one entry each, then the theoretical max would seem to be, with one tablespace per database,
32767 - 2 (for the single tablespace) = 32765 / 2 = 16382 tables, assuming you need at least one index per table.
Are these assumptions valid (each tablespace takes two entries, at least one index per table)?
assuming you need at least one index per table.
That assumption doesn't seem valid. Tables don't always have indexes. And you are thinking about edge cases where someone is already doing something weird, so I definitely wouldn't presume there will be indexes on each table.*
If you really want to handle all possible cases, I think you need to assume that you can have up to 32765 tables (two object identifiers are needed for a table space, as mentioned in the quote).
*Also, the footnote in the documentation you linked indicates that an index takes up two internal object descriptors. So the math is also incorrect in that quote. It would actually be 10921 tables if they each had an index. But I don't think that is relevant anyway.
I'm not sure your assumptions are appropriate because there are just too many possibilities to consider and in the grand scheme of things probably doesn't make much difference to the answer from your point of view
I'll rephrase your question to make sure I understand you correctly, you are after the maximum number of rows i.e. worst case scenario, that could possibly be returned by your SQL query?
DB2 System Limits
Maximum databases
Limited by system storage and EDM pool size
Maximum number of databases
65217
Maximum number of internal objects for each database
32767
The number of internal object descriptors (OBDs) for external objects are as follows
Table space: 2 (minimum required)
Table: 1
Therefore the maximum number of rows from your SQL query:
65217 * (32767 - 2) = 2,136,835,005
N.B. DB2 for z/OS does not have a 1:1 ratio between schemas and databases
N.N.B. This figure assumes 32,765 tables/tablespace/database i.e. 32765:1:1
I'm sure ±2 billion rows is NOT a "reasonable" expectation for max number of table names that might show up under a schema but it is possible

Best practice for tables with varying content

Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.
Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.
I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.