BigQuery storage costs relative to schema? - google-bigquery

If I have a column with "numbers" in it, does the storage cost change if the schema specifies that column to be an INTEGER vs STRING?
Example: I have dozens of terabytes of numeric data stored as STRING. If I need to perform math on that column, it's easy enough to cast at query time. But if I change the schema, will the data be stored any differently such that it'll consume less bits at rest, and thus, cost me less?

Given that BigQuery charge STRING/INT64 column as
STRING | 2 bytes + the UTF-8 encoded string size
INT64 | 8 bytes
Not sure how are you planning to encode your numeric data into string, from my gut feeling, unless you have most of the numeric value less than 16 bit, you don't gain much by storing them as STRING than as INT64.
But if you do have small numbers, it is not only saving the cost on storage, but also saving the cost on query if you pay by scanned bytes, which may be more saving than on storage if you scan your data a lot.
Reference: https://cloud.google.com/bigquery/pricing#data

Related

BigQuery performance impact with column data length

Problem statement -
We are planning to store hexadecimal string data with length 64 in a BigQuery column. Will it affect the BigQuery query performance when queries are run with filter/join operations on these columns (with large string lengths) compared to when a smaller length string is stored?
Example -
Let's assume there is a BigQuery table - abc.HACKERNEWS.news
Columns -
id, time, time_ts, encrypted_data, news, status.
Known - encrypted_data column has String with length 32.
Query -
SELECT time FROM abc.HackerNews.news where encrypted_data = 'abcdefghijklmnopqrstuvwxyz123deabcdefghijklmnopqrstuvwxyzabcde' LIMIT 1000
How will the performance impact with the change encrypted_data length?
Will the query perform better if the length of the string length
stored in encrypted_data column is shorter say 5?
Refer to this documentation here in regards to data size calculation:
STRING (data types are equal to) 2 bytes + the UTF-8 encoded string size
So answering your question: yes, the longer the string, the more bytes the query will need to process, and the slower it will be. Therefore, choosing a shorter string length might improve the query performance.

Why should I use char instead of varchar? [duplicate]

This question already has answers here:
What are the use cases for selecting CHAR over VARCHAR in SQL?
(19 answers)
Closed 8 years ago.
Char and varchar are datatypes in SQL, as they are in many other languages(So this question could be multi-language).
From what I understand, the difference is that if I declared a Char as Char(20) it would allocate 20 (bytes/bits) [Could someone clarify this bit too? For now, I'll use bytes.]. Then if I only used 16 bytes, I would still have four allocated to that field. (A waste of 4 bytes of memory.)
However, if I declared a varchar as varchar(20) and only used 16 bytes, it would only allocate 16 bytes.
Surely this is better? Why would anyone choose char? Is it foe legacy reasons, or is there something I'm missing?
Prefer VARCHAR.
In olden days of tight storage, it mattered for space. Nowadays, disk storage is cheap, but RAM and IO are still precious. VARCHAR is IO and cache friendly; it allows you to more densely pack the db buffer cache with data rather than wasted literal "space" space, and for the same reason, space padding imposes an IO overhead.
The upside to CHAR() used to be reduced row chaining on frequently updated records. When you update a field and the value is larger than previously allocated, the record may chain. This is manageable, however; databases often support a "percent free" setting on your table storage attributes that tells the DB how much extra space to preallocate per row for growth.
VARCHAR is almost always preferable because space padding requires you to be aware of it and code differently. Different databases handle it differently. With VARCHAR you know your field holds only exactly what you store in it.
I haven't designed a schema in over a decade with CHAR.
FROM Specification
char[(n)]
Fixed-length non-Unicode character data with length of n bytes. n must
be a value from 1 through 8,000. Storage size is n bytes. The SQL-92
synonym for char is character.
So Char(20) will allocate fixed 20 Bytes space to hold the data.
Usage:
For example if you have a column named Gender and you want to assign values like only M for Male (OR) F for female and you are sure that the field/column are non-null column . In such case, it's much better to define it as CHAR(1) instead like
Gender CHAR(1) not null
Also, varchar types carries extra overhead of 2 bytes as stated in document . The storage size is the actual length of the data entered + 2 bytes.
In case of char that's not the case.

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.

in sql,How does fixed-length data type take place in memory?

I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).