Why assignment is faster than APPEND LINES OF? - abap

I'm currently learning ABAP and can anyone explain why t_table2 = t_table1 is significantly faster than APPEND LINES OF t_table1 TO t_table2?
t_table1, t_table2 are internal tables

In addition to the answers by Zero and Cameron Smith, there's also a concept called "table sharing" (AKA "copy-on-write") which delays the copy until any of the source or target internal table is changed.
If I simplify a lot, one could represent it like the assignment like a copy of 8 bytes (the address of the source internal table). Anyway, most of the time, one of the 2 internal tables will be changed (otherwise, why would there be a copy in the code!) so the final performance is often almost the same, it's just that sometimes there's a benefit because of some code "badly" written.

Memory allocation
When you define an internal table with DATA, the kernel allocates more than one row's space in the memory, so they are stored togehter. Also every time you fill these rows, again a bigger batch will be booked.
You can see this in memory dumps, in this case 16 rows would have been allocated:
When you copy with APPEND LINES OF, the kernel copies line-by-line.
When you just say itab1 = itab2, it gets copied in blocks.
How much faster
Based on the information above, you might think line-by-line is 16 times slower. It is not, in practice, depending on row width, number of lines, kernel version and many other things it is just 10-30% slower.

I can't say this is a full reason (there's probably more going on behind the scenes that I don't know), but some of the reasons definitely include the following.
A thing to note here: on small to medium data sets the difference in speed is negligible.
t_table2 = t_table1 just takes all of data and copies it, overwriting t_table2 ( it does NOT append). In some cases (such as when passing parameters) the data does not even get copied. The same data may be used and a copy will only be produced if a t_table2 needs to be changed.
APPEND LINES OF t_table1 TO t_table2 is basically a loop, which appends records row by row.
The reason I mention the append is because overwrite of a table can be as simple as copy data (or data reference in rare cases) from a to b, while append performs checks whether or not the table is sorted, indexed and such. Even if the table is in its most basic state, append of an internal table is a slightly more complex procedure than an overwrite of a variable.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Which is more space efficient, multiple colums or multple rows? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
Suppose if i have a table A with 100 columns of same data type and 100 rows.
Table B with 2 columns and 5000 rows of same data type of above table columns.
Which table takes more disk space to store & which is more efficient?
Either a table has 2 columns or 100. You would not convert one into the other or you would do something very wrong.
A product table may have 100 columns (item number, description, supplier number, material, list price, actual price ...). How would you make this a two column table? A key-value table? A very bad idea.
A country table may have 2 columns (iso code and name). How would you make this a 100-columns table? By having columns usa_name, usa_code, germany_name, germany_code, ...? An even worse idea.
So: The question is out of question :-) There is nothing to decide between.
The real answer here is... it depends.
Oracle stores its data in "data blocks", which are stored in "Extents" which are stored in "Segments" which make up the "Tablespace". See here.
A data block is much like a block used to store data for the operating system. In fact, an Oracle data block should be specified in multiples of the operating system's blocks so there isn't unnecessary I/O overhead.
A data block is split into 5 chunks:
The Header - Which has information about the block
The Table Directory - Tells oracle that this block contains info about whatever table it is storing data for
Row Directory - The portion of the block that stores information about the rows in the block like addresses.
Row data - The meat and potatoes of the block where the row data is stored. Keeping in mind that Rows can span blocks.
Free space - This is the middle of the bingo board and you don't have to actually put your chip here.
So the two important parts of Oracle data storage, for this question, in it's data blocks are the Row Data and the Row Directory (And to some extend, the Free Space).
In your first table you have very large rows, but fewer of them. This would suggest a smaller row directory (unless it spans multiple blocks because of the size of the rows, in which case it would be Rows*Blocks-Necessary-To-Store-Them). In your second table you have more rows, which would suggest a larger row directory than the first table.
I believe that a row directory entry is two bytes. It describes the offset in bytes from the start of the block where the row data can be found. If your data types for your two columns in second table are TINYINT() then your rows would be 2 bytes as well. In effect, you have more rows, so your directory here is as big as your data. It's datasize*2, which will cause you to store more data for this table.
The other gotcha here is that data stored in the row directory of a block is not deleted when the row is deleted. The header that contains the row directory in the block is only reused when a new insert comes along that needs the space.
Also, every block has it's free space that it keeps for storing more rows and header information, as well as holding transaction entries (see the link above for that).
At any rate, it's unlikely that your row directory in a given block would be larger than your row data, and even then Oracle may be holding onto free space in the block that trumps both depending on the size of the table and how often it's accessed and whether oracle is automatically managing free space for you, or your managing manually (does anyone do that?).
Also, if you toss an index on either of these tables, you'll change the stats all around anyway. Indexes are stored like tables, and they have their own Segments, extents, and blocks.
In the end, your best bet is to not worry too much about the blocks and whatnot (after all, storage is cheap) instead:
Define appropriate field types for your data. Don't store boolean values in a CHAR(100), for instance.
Define your indexes wisely. Don't add an index just to be sure. Make good decisions when your tuning.
Design your schema for your end user's needs. Is this a reporting database? In that case, shoot for denormalized pre-aggregated data to keep reads fast. Try to keep down the number of joins a user needs to get at their result set.
Focus on cutting CPU and I/O requirements based on the queries that will come through for the schema you have created. Storage is cheap, CPU and I/O aren't, and your end user isn't going to give a rats ass about how many hard drives (or ram if it's in-memory) you needed to cram into your box. They are going to care about how quick the application reads and writes.
p.s. Forgive me if I misrepresented anything here. Logical database storage is complicated stuff and I don't deal much with Oracle so I may be missing a piece to the puzzle, but the overall gist is the same. There is the actual data you store, and then there is the metadata for that data. It's unlikely that the metadata will trump, in size, the data itself, but given the right circumstances, it's possible (especially with indexing figured in). And, in the end, don't worry to much about it anyway. Focus on the needs of the end user/application in designing your schema. The end user will balk a hell of a lot more than your box.
Efficiency is a nebulous concept, and depends on what you are measuring. If you are having to jump through hoops extracting data that is poorly indexed (or requires function-based indexes) that were implemented because disk space was deemed more important than proper design, then I'd say that you will wind up with a far less efficient app from both a data retrieval standpoint, not to mention having to deal with code complexity implemented to try and overcome poor design.
Considering that each column has to store some meta data, I'm guessing that table B might be more space efficient, since the size of your actual data is constant and equal in both cases.
In terms of memory I think it depends on type of data (images, videos, int, varchar...etc) stored in the tables. (assuming you do not mean that both tables contains the same data as I do not see how you'd change the columns into rows)
In terms of efficiency and I hope I am right if I say that table B is more efficient as indexing 2 columns is easier that indexing 5 therefore data can be retrieved easier in any way possible compared to a table with 5 columns where some kind of query might take longer time.

Inserting "bigger" data into PostgreSQL makes the system faster?

So, I witnessed the following behaviour while using PostgreSQL.
I have a table like this: (id INTEGER ..., msg VARCHAR(2000))
and then I run two programs A and B that do the exact same thing,
namely doing 20000 insertions and then 20000 retrievals (based on their id). The only
difference is that program A does insertions with messages containing
2000 characters while B just inserts messages containing at most 10 characters.
The thing is that the average time of all the insertions and retrievals
in A is always about ~15ms less than in B which doesn't really make sense,
since A is adding "bigger" data.
Any ideas or hints on why this could be happening? Could it be that when not using
all the characters of the msg the system uses the rest of the space for other purposes and therefore if msg is full the system is faster?
Based on #Dan Bracuk comment. I save the time on different events and realized that the following happens, in program A there quite a few times that insertions are really really fast while in program B this is never the case and that's why on average A is faster than B but I cannot explain this behaviour either.
I can't reproduce this without more detail about your setup and your programs, so the following is just an educated guess. It's conceivable that your observation is due to TOAST. Once a text field exceeds a certain size, it is stored in a physically separate table. Therefore, the main table is actually smaller than in the case where all the text values are stored inline, and so searches could be faster.

How does this not make varchar2 inefficient?

Suppose I have a table with a column name varchar(20), and I store a row with name = "abcdef".
INSERT INTO tab(id, name) values(12, 'abcdef');
How is the memory allocation for name done in this case?
There are two ways I can think of:
a)
20 bytes is allocated but only 6 used. In this case varchar2 does not have any significant advantage over char, in terms of memory allocation.
b)
Only 6 bytes is allocated. If this is the case, and I addded a couple of more rows after this one,
INSERT INTO tab(id, name) values(13, 'yyyy');
INSERT INTO tab(id, name) values(14, 'zzzz');
and then I do a UPDATE,
UPDATE tab SET name = 'abcdefghijkl' WHERE id = 12;
Where does the DBMS get the extra 6 bytes needed from? There can be a case that the next 6 bytes are not free (if only 6 were allocated initially, next bytes might have been allotted for something else).
Is there any other way than shifting the row out to a new place? Even shifting would be a problem in case of index organized tables (it might be okay for heap organized tables).
There may be variations depending on the rdbms you are using, but generally:
Only the actual data that you store in a varchar field is allocated. The size is only a maximum allowed, it's not how much is allocated.
I think that goes for char fields also, on some systems. Variable size data types are handled efficiently enough that there is no longer any gain in allocating the maximum.
If you update a record so that it needs more space, the record inside the same allocation block are moved down, and if the records no longer fit in the block, another block is allocated and the records are distributed between the blocks. That means that records are continous inside the allocation blocks, but the blocks doesn't have to be continous on the disk.
It certainly doesn't allocate more space then needed, this would defeat the point of using the variable length type.
In the case you mention I would think that the rows below would have to be moved down on the page, perhaps this is optimized somehow. I don't really know the exact details, perhaps someone else can comment further.
This is probably heavily database dependent.
A couple of points though: MVCC observing databases don't actually update data on disk or in memory cache. They insert a new row with the updated data and mark the old row as deleted from a certain transaction on. After a while the deleted row is not visible to any transactions and it's reclaimed.
For the space storage issue, it's usually in the form of 1-4 bytes of header + data (+ padding)
In the case of chars, the data is padded to reach the sufficient length. In the case of varchar or text, the header stores the length of the data that is following.
Edit For some reason I thought this was tagged Microsoft SQL Server. I think the answer is still relevant though
That's why the official recommendation is
Use char when the sizes of the column data entries are consistent.
Use varchar when the sizes of the column data entries vary considerably.
Use varchar(max) when the sizes of the column data entries vary
considerably, and the size might
exceed 8,000 bytes.
It's a trade off you need to consider when designing your table structure. Probably you would need to consider the frequency of updates vs reads in this calculation too
Worth noting that for char a NULL value still uses all the storage space. There is an addin for Management Studio called SQL Internals Viewer that allows you to see easily how your rows are stored.
Given the VARCHAR2 in the question title, I assume your question is focused around Oracle. In Oracle, you can reserve space for row expansion within a data block with the use of the PCTFREE clause. That can help mitigate the effects of updates making rows longer.
However, if Oracle doesn't have enough free space within the block to write the row back, what it does it is called row migration; it leaves the original address on disk alone (so it doesn't necessarily need to update indexes), but instead of storing the data in the original location, it stores a pointer to that row's new address.
This can cause performance problems in cases where a table is heavily accessed by indexes if a significant number of the rows have migrated, as it adds additional I/O to satisfy queries.

Performance benefit when SQL query is limited vs calling entire row?

How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?
It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.
It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.
The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.
The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.
If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.