Problem statement -
We are planning to store hexadecimal string data with length 64 in a BigQuery column. Will it affect the BigQuery query performance when queries are run with filter/join operations on these columns (with large string lengths) compared to when a smaller length string is stored?
Example -
Let's assume there is a BigQuery table - abc.HACKERNEWS.news
Columns -
id, time, time_ts, encrypted_data, news, status.
Known - encrypted_data column has String with length 32.
Query -
SELECT time FROM abc.HackerNews.news where encrypted_data = 'abcdefghijklmnopqrstuvwxyz123deabcdefghijklmnopqrstuvwxyzabcde' LIMIT 1000
How will the performance impact with the change encrypted_data length?
Will the query perform better if the length of the string length
stored in encrypted_data column is shorter say 5?
Refer to this documentation here in regards to data size calculation:
STRING (data types are equal to) 2 bytes + the UTF-8 encoded string size
So answering your question: yes, the longer the string, the more bytes the query will need to process, and the slower it will be. Therefore, choosing a shorter string length might improve the query performance.
Related
If I have a column with "numbers" in it, does the storage cost change if the schema specifies that column to be an INTEGER vs STRING?
Example: I have dozens of terabytes of numeric data stored as STRING. If I need to perform math on that column, it's easy enough to cast at query time. But if I change the schema, will the data be stored any differently such that it'll consume less bits at rest, and thus, cost me less?
Given that BigQuery charge STRING/INT64 column as
STRING | 2 bytes + the UTF-8 encoded string size
INT64 | 8 bytes
Not sure how are you planning to encode your numeric data into string, from my gut feeling, unless you have most of the numeric value less than 16 bit, you don't gain much by storing them as STRING than as INT64.
But if you do have small numbers, it is not only saving the cost on storage, but also saving the cost on query if you pay by scanned bytes, which may be more saving than on storage if you scan your data a lot.
Reference: https://cloud.google.com/bigquery/pricing#data
Should I define a column type from actual length to nth power of 2?
The first case, I have a table column store no more than 7 charactors,
will I use NVARCHAR(8)? since there maybe implicit convert inside Sql
server, allocate 8 space and truncate automatic(heard some where).
If not, NCHAR(7)/NCHAR(8), which should be(assume the fixed length is 7)
Any performance differ on about this 2 cases?
You should use the actual length of the string. Now, if you know that the value will always be exactly 7 characters, then use CHAR(7) rather than VARCHAR(7).
The reason you see powers-of-2 is for columns that have an indeterminate length -- a name or description that may not be fixed. In most databases, you need to put in some maximum length for the varchar(). For historical reasons, powers-of-2 get used for such things, because of the binary nature of the underlying CPUs.
Although I almost always use powers-of-2 in these situations, I can think of no real performance differences. There is one. . . in some databases the actual length of a varchar(255) is stored using 1 byte whereas a varchar(256) uses 2 bytes. That is a pretty minor difference -- even when multiplied over millions of rows.
Assuming I have a table in PostgreSQL, how can I find the exact byte size used by the system in order to save a specific row of my table?
For example, assume I have a table with a VARCHAR(1000000) field and some rows contain really big strings for this field while others really small. How can I check the byte size of a row in this case? (including the byte size even in the case TOAST is being used).
Use pg_column_size and octet_length.
See:
How can pg_column_size be smaller than octet_length?
How can I find out how big a large TEXT field is in Postgres?
How can pg_column_size be smaller than octet_length?
I'm developing a Java web application that deals with large amounts of text (HTML code strings encoded using base64), which I need to save in my database. I'm using Firebird 2.0, and every time I try to insert a new record with strings longer than 32767 characters, I receive the following error:
GDS Exception. 335544726. Error reading data from the connection.
I have done some research about it, and apparently this is the character limit for Firebird, both for query strings and records in the database. I have tried a couple of things, like splitting the string in the query and then concatenating the parts, but it didn't work. Does anyone know any workarounds for this issue?
If you need to save large amount of text data in the database - just use BLOB fields. Varchar field size is limited to 32Kb.
For better performance you can use binary BLOBs and save there zipped data.
Firebird query strings are limited to 64 kilobytes in Firebird 2.5 and earlier. The maximum length of a varchar field is 32766 byte (which means it can only store 8191 characters when using UTF-8!). The maximum size of a row (with blobs counting for 8 bytes) is 64 kilobytes as well.
If you want to store values longer than 32 kilobytes, you need to use a BLOB SUB_TYPE TEXT, and you need to use a prepared statement to set the value.
Hello I was trying to find a good way to hash a set of numerical numbers which its output would be under 20 characters that are positive and unique. Any one have any suggestions?
For hashing in general, I'd use the HASHBYTES function. You can then convert the binary data to a string and just pick the first 20 characters, that should still be unique enough.
To get around HASHBYTES limitations (8000 bytes for instance), you can incrementally hash, e.g. for each value concat the previous hash with the value to be added and hash that again. This will make it unique with order etc. and unless you append close to 8000 bytes in one value it will not cause data truncation for the hashing.