What is the size of a Django CharField in Postgres? [duplicate] - sql

Assuming I have a table in PostgreSQL, how can I find the exact byte size used by the system in order to save a specific row of my table?
For example, assume I have a table with a VARCHAR(1000000) field and some rows contain really big strings for this field while others really small. How can I check the byte size of a row in this case? (including the byte size even in the case TOAST is being used).

Use pg_column_size and octet_length.
See:
How can pg_column_size be smaller than octet_length?
How can I find out how big a large TEXT field is in Postgres?
How can pg_column_size be smaller than octet_length?

Related

Oracle: How to convert row count into data size

I would like to know how do i convert number of row into size like in MB or KB?
Is there a way to do that or formula?
The reason I doing this is because I would like to know given this set of data but not all in tablespace, how much size is used by this set of data.
Thanks,
keith
If you want an estimate, you could multiple the row count with the information from user_table.avg_row_len for that table.
If you want the real size of the table on disk, this is available user_segments.bytes. Note that the smallest unit Oracle will use is a block. So even for an empty table, you will see a value that is bigger tzen zero in that column. That is actual size of the space reserved in the tablespace for that table.

Is there a way around the 8k row length limit in SQL Server?

First off, I know that in general having large numbers of wide columns is a bad idea, but this is the format I'm constrained to.
I have an application that imports CSV files into a staging table before manipulating them and inserting/updating values in the database. The staging table is created on the fly and has a variable number of NVARCHAR colums into which the file is imported, plus two INT columns used as row IDs.
One particular file I have to import is about 450 columns wide. With the 24 byte pointer used in a large NVARCHAR column, this adds up to around 10k by my calculations, and I get the error Cannot create a row of size 11166 which is greater than the allowable maximum row size of 8060.
Is there a way around this or are my only choices modifying the importer to split the import or removing columns from the file?
You can use text/ntext which uses 16 bytes pointer. Whereas varchar/nvarchar uses 24bytes pointer.
NVARCHAR(max) or NTEXT can store the data more than 8kb but a record size can not be greater than 8kb till SQL Server 2012. If Data is not fitted in 8kb page size then the data of larger column is moved to another page and a 24 bytes(if data type is varchar/nvarchar) pointer is used to store as reference pointer in main column. if it is text/ntext data type then 16 bytes poiner is used.
For Details you can Visit at following links :
Work around SQL Server maximum columns limit 1024 and 8kb record size
or
http://msdn.microsoft.com/en-us/library/ms186939(v=sql.90).aspx
If you are using SQL Server 2005, 2008 or 2012, you should be able to use NVARCHAR(max) or NTEXT which would be larger than 8,000 characters. MAX will give you 2^31 - 1 characters:
http://msdn.microsoft.com/en-us/library/ms186939(v=sql.90).aspx
I agree that Varchar or nvarchar (Max) is a good solution and will probably work for you, but completeness I will suggest that you can also create more than one table with the two tables having a One-to-One relationship.

How much disk-space is needed to store a NULL value using postgresql DB?

let's say I have a column on my table defined the following:
"MyColumn" smallint NULL
Storing a value like 0, 1 or something else should need 2 bytes (1). But how much space is needed if I set "MyColumn" to NULL? Will it need 0 bytes?
Are there some additional needed bytes for administration purpose or such things for every column/row?
(1) http://www.postgresql.org/docs/9.0/interactive/datatype-numeric.html
Laramie is right about the bitmap and links to the right place in the manual. Yet, this is almost, but not quite correct:
So for any given row with one or more nulls, the size added to it
would be that of the bitmap(N bits for an N-column table, rounded up).
One has to factor in data alignment. The HeapTupleHeader (per row) is 23 bytes long, actual column data always starts at a multiple of MAXALIGN (typically 8 bytes). That leaves one byte of padding that can be utilized by the null bitmap. In effect NULL storage is absolutely free for tables up to 8 columns.
After that, another MAXALIGN (typically 8) bytes are allocated for the next MAXALIGN * 8(typically 64) columns. Etc. Always for the total number of user columns (all or nothing). But only if there is at least one actual NULL value in the row.
I ran extensive tests to verify all of that. More details:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
Null columns are not stored. The row has a bitmap at the start and one bit per column that indicates which ones are null or non-null. The bitmap could be omitted if all columns are non-null in a row. So for any given row with one or more nulls, the size added to it would be that of the bitmap(N bits for an N-column table, rounded up).
More in depth discussion from the docs here
It should need 1 byte (0x00) however it's the structure of the table that makes up most of the space, adding this one value might change something (Like adding a row) which needs more space than the sum of the data in it.
Edit: Laramie seems to know more about null than me :)

Which data structure should I use for storing hash values?

I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.
You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.
5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
Define hash function, which produces integer values from keys.
Sort records in file according to values, produced by this function
Store file offsets where each hash value starts
To locate value:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
Hash function must be fast to be effective
Hash function must produce linear distributed values or near that
Table of hash value offsets can be placed in separated file
Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.
You can find some examples of hash functions here.
Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.
First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.
Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
int key[4];
int a[1<<24][4];
#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])
i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.

in sql,How does fixed-length data type take place in memory?

I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).