How to set the capacity while creating table by invoking indexedTable function?
I use indexedTable to create a table, the code is:
t1 = indexedTable(`sym`id, 1:0, `sym`id`val, [SYMBOL,INT,INT])
t2 = indexedTable(`sym`id, 100:0, `sym`id`val, [SYMBOL,INT,INT])
I find there is no difference in writing and querying data. So I wonder what's the role of capacity?
The parameter "capacity" is a positive integer, showing the memory (in the number of records) allocated by the system for the table when the table is built. When the number of records exceeds the capacity, the system will first allocate a new memory space of 1.2~2 times the capacity, then copy the data to the new memory space, and finally release the original memory. For larger tables, the memory usage of such operations will be very high. Therefore, it is recommended to allocate a reasonable capacity in advance when building the table.
Related
The table was simply with an id, string within 400 charaters, and a length item to record the length of its string. The problem was, when I do a query, e.g select * from table where length = whatever a number It never reacts. (or always caltulates... ) I was wondering, if it was due to the large dataset? Should I somehow split the table into several? But I noticed that, when executing query like above, there are three threads about postgresql with only 2 MB RAM occupation each and 4-5 MB transmission rate. Was it normal?
Environment: 12GB RAM, Postgresql 12 on Win10.
Yes, that is perfectly normal.
Your query is performing a parallel sequential scan with two additional worker processes. Reading a large table from disk neither requires much RAM nor much CPU. You are probably I/O bound.
Two remarks:
Depending on the number of result rows, an index on the column or expression in the WHERE clause can speed up processing considerably.
Unless you really need it for speed, storing the length of the string in an extra column is bad practice. You can calculate that from the string itself.
Storing such redundant data not only wastes a little space, it opens the door to inconsistencies (unless you have a CHECK constraint).
All this is not PostgreSQL specific, it will be the same with any database.
We are trying to upload 80 GB of data in 2 host servers each with 48 GB RAM(in total 96GB). We have partitioned table too. But even after partitioning, we are able to upload data only upto 10 GB. In VMC interface, we checked the size worksheet. The no of rows in the table is 40,00,00,000 and table maximum size is 1053,200,000k and minimum size is 98,000,000K. So, what is issue in uploading 80GB even after partitioning and what is this table size?
The size worksheet provides minimum and maximum size in memory that the number of rows would take, based on the schema of the table. If you have VARCHAR or VARBINARY columns, then the difference between min and max can be quite substantial, and your actual memory use is usually somewhere in between, but can be difficult to predict because it depends on the actual size of the strings that you load.
But I think the issue is that the minimum size is 98GB according to the worksheet, meaning if any nullable strings are null, or any not-null strings would be an empty string. Even without taking into account the heap size and any overhead, this is higher than your 96GB capacity.
What is your kfactor setting? If it is 0, there will be only one copy of each record. If it is 1, there will be two copies of each record, so you would really need 196GB minimum in that configuration.
The size per record in RAM depends on the datatypes chosen and if there are any indexes. Also, VARCHAR values longer than 15 characters or 63 bytes are stored in pooled memory which carries more overhead than fixed-width storage, although it can reduce the wasted space if the values are smaller than the maximum size.
If you want some advice on how to minimize the per-record size in memory, please share the definition of your table and any indexes, and I might be able to suggest adjustments that could reduce the size.
You can add more nodes to the cluster, or use servers with more RAM to add capacity.
Disclaimer: I work for VoltDB.
From Oracle docs :
If you give every column the maximum length or precision for its data
type, then your application needlessly allocates many megabytes of
RAM. For example, suppose that a query selects 10 VARCHAR2(4000)
columns and a bulk fetch operation returns 100 rows. The RAM that your
application must allocate is 10 x 4,000 x 100—almost 4 MB. In
contrast, if the column length is 80, the RAM that your application
must allocate is 10 x 80 x 100—about 78 KB. This difference is
significant for a single query, and your application will process many
queries concurrently. Therefore, your application must allocate the 4
MB or 78 KB of RAM for each connection.
As I know varchar2 is variable length datatype, so DB will only allocate space actually used by column, i.e. if column is only 10 character in Unicode it will allocate 10 bytes. But according to above statement even if column (max) is only 10 character, but length of datatype is defined as 4000, it will still occupy 4000 bytes?
The space allocated on disk will only be as long as required to store the actual data for each row.
The space allocated in memory will (in some cases) be the maximum required based on the datatype.
The documentation itself is wrong/misleading in several ways. The sentence right before the quoted paragraph says "...length and precision affect storage requirements." And yet, right after that, the dufus who wrote the documentation article goes on to refer to RAM. Storage means on disk; RAM is memory. Unless we are talking about an in-memory database (which that documentation article does not), it makes no sense to talk about RAM after saying something "affects storage requirements." The declared length does NOT affect storage, but it MAY affect memory allocation.
Specifically, it MAY affect memory allocation when an application (often written in general languages like Java, C#, etc.) need to allocate memory ahead of time, when the only info they have is what's in the data dictionary. Memory can be allocated statically (at compilation time), but that means you can't use the extra info from the actual data, that all your strings are 100 bytes at most; all that is known AT THAT STAGE is 4000 bytes max. Memory can also be allocated DYNAMICALLY, and that can use the extra info - but it is MUCH, MUCH slower!
In many "interactions" between the DB and applications written in other languages, you don't even have the option of dynamic memory allocation; in the present world, the assumption is that "time" is worth much, much more than RAM, so if you find that your code runs out of memory, buy more RAM and don't worry about dynamic memory allocation. Which means that if you declare VARCHAR2(4000), you should expect that a lot of RAM will be allocated, potentially, in a wasteful way. Just declare VARCHAR2(100) if that's all you need.
The source for that interesting question is here.
The article is very clear about VARCHAR2 storage:
Oracle Database blank-pads values stored in CHAR columns but not
values stored in VARCHAR2 columns. Therefore, VARCHAR2 columns use
space more efficiently than CHAR columns.
What they are saying about the RAM allocation is that your application would not know how much RAM to allocate if you had NOT defined a limit for your VARCHAR2 column. Also, if the limit is too high, it would allocate too much RAM to start with, so always choose the most efficient limit.
There is also a comprehensive article about the OCI usage of data types here.
I have a complex process that interacts with multiple systems.
Each of these systems may produce error messages that I would like to store in a table of my Oracle database (note that I have statuses but the nature of the process is such that the errors may not always be predefined).
We are talking about hundred thousands of transactions each day where 1% may result in various errors.
1) Wanted to know what is a reasonable/acceptable length for the database field and how big of a message should I be storing?
2) Memory wise, does it really matter how large the field is defined in the database?
"Reasonable" and "acceptable" depends on the application. Assuming that you want to define the database column as a VARCHAR2 rather than a CLOB, and assuming that you aren't using 12.1 or later, you can declare the column to hold up to 4000 bytes. Is that enough for whatever error messages you need to support? Is there a lower limit on the length of an error message that you can establish? If you're producing error messages that are designed to be shown to a user, you're probably going to be generating shorter messages. If you're producing and storing stack traces, you may need to declare the column as a CLOB because 4000 bytes may not be sufficient.
What sort of memory are we talking about? On disk, a VARCHAR2 will only allocate the space that is actually required to store the data. When the block is read into the buffer cache, it will also only use the space required to store the data. If you start allocating local variables in PL/SQL, depending on the size of the field, Oracle may allocate more space than is required to store the particular data for that local variable in order to try to avoid the cost of growing and shrinking the allocation when you modify the string. If you return the data to a client application (including a middle tier application server), that client may allocate a buffer in memory based on the maximum size of the column rather than based on the actual size of the data.
Suppose I have a table with a column name varchar(20), and I store a row with name = "abcdef".
INSERT INTO tab(id, name) values(12, 'abcdef');
How is the memory allocation for name done in this case?
There are two ways I can think of:
a)
20 bytes is allocated but only 6 used. In this case varchar2 does not have any significant advantage over char, in terms of memory allocation.
b)
Only 6 bytes is allocated. If this is the case, and I addded a couple of more rows after this one,
INSERT INTO tab(id, name) values(13, 'yyyy');
INSERT INTO tab(id, name) values(14, 'zzzz');
and then I do a UPDATE,
UPDATE tab SET name = 'abcdefghijkl' WHERE id = 12;
Where does the DBMS get the extra 6 bytes needed from? There can be a case that the next 6 bytes are not free (if only 6 were allocated initially, next bytes might have been allotted for something else).
Is there any other way than shifting the row out to a new place? Even shifting would be a problem in case of index organized tables (it might be okay for heap organized tables).
There may be variations depending on the rdbms you are using, but generally:
Only the actual data that you store in a varchar field is allocated. The size is only a maximum allowed, it's not how much is allocated.
I think that goes for char fields also, on some systems. Variable size data types are handled efficiently enough that there is no longer any gain in allocating the maximum.
If you update a record so that it needs more space, the record inside the same allocation block are moved down, and if the records no longer fit in the block, another block is allocated and the records are distributed between the blocks. That means that records are continous inside the allocation blocks, but the blocks doesn't have to be continous on the disk.
It certainly doesn't allocate more space then needed, this would defeat the point of using the variable length type.
In the case you mention I would think that the rows below would have to be moved down on the page, perhaps this is optimized somehow. I don't really know the exact details, perhaps someone else can comment further.
This is probably heavily database dependent.
A couple of points though: MVCC observing databases don't actually update data on disk or in memory cache. They insert a new row with the updated data and mark the old row as deleted from a certain transaction on. After a while the deleted row is not visible to any transactions and it's reclaimed.
For the space storage issue, it's usually in the form of 1-4 bytes of header + data (+ padding)
In the case of chars, the data is padded to reach the sufficient length. In the case of varchar or text, the header stores the length of the data that is following.
Edit For some reason I thought this was tagged Microsoft SQL Server. I think the answer is still relevant though
That's why the official recommendation is
Use char when the sizes of the column data entries are consistent.
Use varchar when the sizes of the column data entries vary considerably.
Use varchar(max) when the sizes of the column data entries vary
considerably, and the size might
exceed 8,000 bytes.
It's a trade off you need to consider when designing your table structure. Probably you would need to consider the frequency of updates vs reads in this calculation too
Worth noting that for char a NULL value still uses all the storage space. There is an addin for Management Studio called SQL Internals Viewer that allows you to see easily how your rows are stored.
Given the VARCHAR2 in the question title, I assume your question is focused around Oracle. In Oracle, you can reserve space for row expansion within a data block with the use of the PCTFREE clause. That can help mitigate the effects of updates making rows longer.
However, if Oracle doesn't have enough free space within the block to write the row back, what it does it is called row migration; it leaves the original address on disk alone (so it doesn't necessarily need to update indexes), but instead of storing the data in the original location, it stores a pointer to that row's new address.
This can cause performance problems in cases where a table is heavily accessed by indexes if a significant number of the rows have migrated, as it adds additional I/O to satisfy queries.