Representing objects of properties and methods in memory , if anyone have picture or drawing to expalin how computer deal with it and store properties in memory?
Computers do not really store abstract information of that sort at the basic level. There, you essentially have numbers--in binary, but that is not important--and it is generally up to software to interpret these numbers.
In the Von Neuman model, that close to every system is based on, you have one big address space. You can index into it, so your CPU can, for example, fetch the number that sits on a given address, or write a new number to an address, and that is mostly what there is to storing data. Usually, but not always, the addresses pick individual bytes of your memory, but your computer could address into larger or smaller word sizes, for example, you might have a computer that would address into 32 bit words instead of 8 bit words. It doesn't matter for the overall model, though. You just have a big block of memory and you can get the data at individual addresses.
How you interpret this data is up to the program. Well, almost. In this figure, I've tried to illustrate memory and where we have some data. The data is the zero-terminated string "Hello, World\n", but only if we interpret it as an ASCII-encoded string. If we interpreted it as an array of integers instead, then it would be that. The hardware doesn't care how you interpret the data.
What makes a computer a Neuman model is that both data and program is represented in the same memory. Not only can we get to any data via its address, but we can get to the code we want to run as well. There isn't any difference between the two. A program, or a function, or a method, is just an address where you have a sequence of numbers, and the CPU can interpret these numbers as executable code. You can, in theory, point to "Hello, World\n" and then tell the CPU to run it as a program. (I won't recommend it).
When it comes to executable code, there is the slight difference that the CPU does the interpretation. In your own program, you can mostly choose how to represent data (although there might be some penalties if you want different representations than what you get from the raw hardware), but the CPU will interpret the different numbers as specific instructions and execute them as such. At least that is how it works if you run native code; if you have a virtual machine, then the virtual machine is a program that interprets your code, and its interpretation of the data can be quite different from the CPU's. The virtual machine, though, will typically run native code, so you are still relying on the CPU's interoperation of numbers, although indirectly.
I should also mention that modern hardware and operating systems do not usually stick with the simple Von Neuman model. If you treat program and data as interchangeable, you get some massive security holes. In practise, you have some form of permission set on different memory blocks, and your code has to sit in a block that you are allowed to execute, and your data (typically) is not. You can switch the permissions, though, if you want to autogenerate native executable code, and virtual machines often do this.
Anyway, for simplicity, let's just say that we have a simple Von Neuman model. Then both program and data are just chunks of memory that we either interpret as program (and it will then be executed by the CPU when we tell it to run the code at a given address) or as data (and then our software is responsible for interpreting the numbers in memory as some higher data structure).
There aren't any differences between object, properties, or other higher-level concepts at this level. Those are entirely dealt with at the level(s) above the hardware. They are simply interpretations of the raw numbers that sit in memory.
Update: a few more details...
Storing objects
The hardware doesn’t know anything about objects. It has addresses and there are numbers (or bit-patterns, if you prefer) at those addresses. Most data types span more than one address. If, for example, we can address bytes, but integers take up four bytes (i.e. they are 32-bit integers), then naturally we need four bytes, at four addresses, to represent an integer. They will be represented as four contiguous bytes, and depending on the architecture you might have the most-significant byte first or last (this is known as endianess) So, the number 10 (which fits in a single byte, but is still a four-byte integer) might be represented as 0x00 0x00 0x00 0x0a or 0x0a 0x00 0x00 0x00. The 0x0a byte is 10 and it might be first or last.
What then about structures, which is what is closest to what we think of as objects? They are larger blocks of attributes/properties/entries/whatever, and they are represented the same way. Blocks of memory is all we have.
If you have an object that contains two integers, say a representation of a rectangle, then the object sits somewhere in memory and will contain the representation of those two integers.
rect:
h, w: int
I’ve intentionally made up the syntax for this, since it isn’t language specific, and different languages and runtime systems have different variations on how they do this, but they all do something similar.
Here, one representation could be a block of 8 bytes, two 4-byte integers, where the first is h and the second is w. There might be padding between elements, so the objects are aligned the way the hardware prefers, but I will ignore that here.
If the object sits at address 0xafode4, that means that h also sits there (assuming that there is no extra information stored in the object), and that means that w sits four bytes later, if integers take up four bytes of space. Again, the details will differ, but this is generally how it is done if you know the layout of objects at compile time. (If you don’t know them until runtime, you will instead have a table of attributes, and the object contains the table instead).
Now, what happens if an object contains other objects? Say, what if the rectangle is represented by two points instead, and the points are objects
point:
x, y: int
rect:
p1, p2: point
In the simplest version, nothing changes. The rect object contains two points, so the points are embedded in the memory that represents the rect.
This doesn’t always work, though. If you have polymorphic types, you might not know the concrete type of a contained object, so you cannot allocate memory. In that case, instead of containing the other object, you will have a reference to it, a pointer. The rect object would hold the addresses of the two points, and the points would sit elsewhere in memory. This is also what you have to do if you want to build non-trivial data structures, so it isn’t specific to object orientation or objects.
In an OOP context, there might be a bit more work to it, but we will get to that. First, let’s consider functions (and let’s go back to a rectangle that just holds h and w).
Representation of functions
Code is just blocks of memory as well, but where the numbers represent instructions to the CPU. Let’s say we want to multiply two numbers, then we might have an instruction that looks like
mul a, b, c
that says that the CPU should take the numbers in registers a and b, multiply them, and put the result in register c. You usually have instructions that take the input from memory or as constants or such as well, but let’s just consider a single simple instruction: multiply two numbers you have in registers and put the result in a third register.
The mul instruction has a number. Completely arbitrarily we can say that it is the byte 0xef. The three arguments specify registers, and if they are a byte each we can have up to 256 registers. The full instruction would contain four bytes, the mul instruction 0xef and the three arguments. If we want to multiply register r1 with register r2 and put the result in register r0, the instruction would be
mul r1, r2, r0
0xef 0x01 0x02 0x00
so what the computer sees is the program 0xef 0x01 0x02 0x00.
For functions, we need two things more: a way to return, and a way to handle input and output.
The return bit is easy. There will be a ret instruction that returns to where the function was called, handling stack registers and such in the process. We can pretend that ret has code 0xab.
Input and output is specified by a calling convention, and it isn’t tied to the hardware as such. You need an agreed upon way to pass arguments to functions and you need to know where the result is when the function returns, but that is all there is to it. On our imaginary architecture, we could say that input one and two will be in registers r1 and r2 and that the output should be in r0 when we return. That way, we can make a simple multiplication function
fun mult(a, b): return a * b
with the instructions
mul r1, r2, r0 ; 0xef 0x01 0x02 0x00
ret ; 0xab
and the computer will store it as the numbers 0xef 0x01 0x02 0x00 0xab. If you know where this code/data sits in memory, e.g. 0x00beef, you can call the function call 0x00beef with some other instruction call (that also has a number, say 0x10) and the address (here an address is typically 8 bytes on a desktop, or 64 bits, so the three bytes in 0x00beef would have zeros before or after it, depending on endianes. I will pretend that we have three byte addresses to make it more readable).
To call the function, you first need to get the arguments into the correct registers, so if you want to get the area of our rect object, you want to get h and w into registers r1 and r2.
What you want to do is call
area = mult(rect.h, rect.w)
so how do you get rect.h and rect.w into registers? You need instructions for that. Let’s say that we have a mov instruction (0x12) that looks like this:
mov adr, reg
where adr is an address (3 bytes on this imaginary architecture) and reg is a register (1 byte). The full instruction is 5 bytes (the 0x12 instruction, the 3 byte address and the 1 byte register). If your rect object sits at 0xaf0de4, then we have rect.h at 0xaf0de4 as well, and we have rect.w four bytes later, at 0xaf0de8. Calling mult(rect.h, rect.w) involves these instructions
mov 0xaf0de4, r1 ; rect.h -> r1
mov 0xaf0de8, r2 ; rect.h -> r2
call 0x00beef ; mult(rect.h, rect.w)
; now rect.h * rect.w is in r0
The actual data stored on the computer is the codes for this:
; mov 0xaf0de4, r1
0x12 0xaf 0x0d 0xe4 0x01
; mov 0xaf0de8, r2
0x12 0xaf 0x0d 0xe8 0x02
; call 0x00beef
0x10 0x00 0xbe 0xef
Everything is still just numbers that we can access through addresses.
Here, of course, the addresses we have used are hardwired into the program, and that doesn’t work in real life. You don’t know where all the objects will be when you compile your program. Some addresses you do know, once you fire up your executable. The location of functions, for example, will be known, and the linker can insert the correct addresses where you need them. Locations of objects, typically not. But there will be instructions like mov that takes the address from a register instead of from the program. We could, for example, have an instruction
mov a[offset], b
that moves data from the address stored in register a + offset into register b. It might have a another number, say 0x13 instead of 0x12, but in assembly you typically have the same code so you don’t see it there.
You would also have an instruction for putting a constant into a register, and I wouldn’t be surprised if that is also called mov and would have the form
mov a, b
where a is now a constant, i.e. some number, and you put that number in register b. The assembly looks the same, but the instruction might have number 0x14.
Anyway, we could use that to call mult(rect.h, rect.w) instead. Then the code would be
mov 0xaf0de4, r3 ; put the address of rect in r3
; 0x14 0xaf 0x0d 0xe4 0x03
mov r3[0], r1 ; put the value at r3+0 into r1
; 0x13 0x03 0x00 0x01
mov r3[4], r2 ; put the value at r3+4 into r2
; 0x13 0x03 0x04 0x02
call 0x00beef
; 0x10 0x00 0xbe 0xef
If we have these instructions, we could also modify our function mult(a,b) to one that takes a rectangle as input and returns the area
fun area(rect): rect.h * rect.w
The function can get the address of the object as its single argument, where it would go in register r1, and from there it could load rect.h and rect.w to multiply them.
; area(rect) -- address of rect in r1
mov r1[0], r2 ; rect.h -> r2
mov r1[4], r3 ; rect.w -> r3
mul r2, r3, r0 ; rect.h * rect.w -> r0
ret ; return rect.h * rect.w
It gets more complicated than this, but you should have the idea now. Our functions are sequences of such instructions, and the arguments to them, and the result value, is passed back and forth, usually through registers, by some calling convention. If you want to pass a value to a function, you need to put it in the right register (or on the stack, depending on the calling convention), and then the function will operate on it. What it does with the object is entirely software; the hardware doesn’t care that much.
Classes and polymorphism
What then if we want polymorphic methods? If we have a class hierarchy of geometric objects and rect is just one of them, and all of them should have an area method that, when called, is dispatched based on the objects’ class?
When you have polymorphic methods, what you really have is a bunch of different functions. If you call x.area() on an object x that happens to be a circle, then you are really calling circle_area(x), while if x is a rect you are calling rect_area(x). The only thing you need to make this work is having a mechanism for dispatching to the right function call.
Here, again, the details differ (a lot), but a simple solution is to put pointers to the correct function in the objects. If you call x.area() maybe you know that the first element in the memory of x is a pointer to its specific area function. So, instead of calling a function directly, you fetch the address of the function from x and then you call it.
x.area() == (x.area_func)(x)
All objects you can call area() on should have this function, and they should have it at the same offset from the address of the object, and then it can be as simple as that.
This can, of course, be wasteful in memory if your classes have lots of methods. You are storing a pointer to each method in each object (and you also have to spend time on initialising this, so there is additional overhead there as well).
Then another solution can be to add a level of indirection. If the methods are the same for all objects of a class (which they often are, but not for all languages) then you can put the table of methods in a class object and have a single pointer to the class in each object. When you need to get the right function you first get the class and then you get the function from it.
x.area() == (x.class.area_func)(x)
With single inheritance, the tables in the different classes can have different sizes, and it doesn’t get more complicated because of that. With multiple inheritance, it does get more complicated, but that is handled very differently in different languages so it is hard to say anything general about that.
This system call code is not working at all. The compiler is optimizing things out and generally behaving strangely:
template <typename... Args>
inline void print(Args&&... args)
{
char buffer[1024];
auto res = strf::to(buffer) (std::forward<Args> (args)...);
const size_t size = res.ptr - buffer;
register const char* a0 asm("a0") = buffer;
register size_t a1 asm("a1") = size;
register long syscall_id asm("a7") = ECALL_WRITE;
register long a0_out asm("a0");
asm volatile ("ecall" : "=r"(a0_out)
: "m"(*(const char(*)[size]) a0), "r"(a1), "r"(syscall_id) : "memory");
}
This is a custom system call that takes a buffer and a length as arguments.
If I write this using global assembly it works as expected, but program code has generally been extraordinarily good if I write the wrappers inline.
A function that calls the print function with a constant string produces invalid machine code:
0000000000120f54 <start>:
start():
120f54: fa1ff06f j 120ef4 <public_donothing-0x5c>
-->
120ef4: 747367b7 lui a5,0x74736
120ef8: c0010113 addi sp,sp,-1024
120efc: 55478793 addi a5,a5,1364 # 74736554 <add_work+0x74615310>
120f00: 00f12023 sw a5,0(sp)
120f04: 00a00793 li a5,10
120f08: 00f10223 sb a5,4(sp)
120f0c: 000102a3 sb zero,5(sp)
120f10: 00500593 li a1,5
120f14: 06600893 li a7,102
120f18: 00000073 ecall
120f1c: 40010113 addi sp,sp,1024
120f20: 00008067 ret
It's not loading a0 with the buffer at sp.
What am I doing wrong?
It's not loading a0 with the buffer at sp.
Because you didn't ask for a pointer as an "r" input in a register. The one and only guaranteed/supported behaviour of T foo asm("a0") is to make an "r" constraint (including +r or =r) pick that register.
But you used "m" to let it pick an addressing mode for that buffer, not necessarily 0(a0), so it probably picked an SP-relative mode. If you add asm comments inside the template like "ecall # 0 = %0 1 = %1 2 = %2" you can look at the compiler's asm output and see what it picked. (With clang, use -no-integrated-as so asm comments in the template come through in the -S output.)
Wrapping a system call does need the pointer in a specific register, i.e. using "r" or +"r"
asm volatile ("ecall # 0=%0 1=%1 2=%2 3=%3 4=%4"
: "=r"(a0_out)
: "r"(a0), "r"(a1), "r"(syscall_id), "m"(*(const char(*)[size]) a0)
: // "memory" unneeded; the "m" input tells the compiler which memory is read
);
That "m" input can be used instead of the "memory" clobber, not instead of an "r" pointer input. (For write specifically, because it only reads that one area of pointed-to memory and has no other side-effects on memory user-space can see, only on kernel write write buffers and file-descriptor positions which aren't C objects this program can access directly. For a read call, you'd need the memory to be an output operand.)
With optimization disabled, compilers do typically pick another register as the base for the "m" input (e.g. 0(a5) for GCC), but with optimization enabled GCC picks 0(a0) so it doesn't cost extra instructions. Clang still picks 0(a2), wasting an instruction to set up that pointer, even though the "=r"(a0_out) is not early-clobber. (Godbolt, with a very cut-down version of the function that doesn't call strf::to, whatever that is, just copies a byte into the buffer.)
Interestingly, with optimization enabled for my cut-down stand-alone version of the function without fixing the bug, GCC and clang do happen to put a pointer to buffer into a0, picking 0(a0) as the template expansion for that operand (see the Godbolt link above). This seems to be a missed optimization vs. using 16(sp); I don't see why they'd need the buffer address in a register at all.
But without optimization, GCC picks ecall # 0 = a0 1 = 0(a5) 2 = a1. (In my simplified version of the function, it sets a5 with mv a5,a0, so it did actually have the address in a0 as well. So it's a good thing you had more code in your function to make it not happen to work by accident, so you could find the bug in your code.)
I just read the Shader Modules Vulkan tutorial, and I didn't understand something.
Why is createInfo.pCode a uint32_t rather than unsigned char or uint8_t? Is it faster? (because moving pointer is now 4 bytes)?
VkShaderModule createShaderModule(const std::vector<char>& code) {
VkShaderModuleCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = code.size();
createInfo.pCode = reinterpret_cast<const uint32_t*>(code.data());
VkShaderModule shaderModule;
if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
throw std::runtime_error("failed to create shader module!");
}
}
A SPIR-V module is defined a stream of 32-bit words. Passing in data using a uint32_t pointer tells the driver that the data is 32-bit aligned and allows the shader compiler in the driver to directly access the data using aligned 32-bit loads.
This is usually faster (in most CPU designs) than random unaligned access, in particular for cases that cross cache lines.
This is also more portable C/C++. Using direct unaligned memory access instructions is possible in most CPU architectures, but not standard in the language. The portable alternative using a byte stream requires assembling 4 byte loads and merging them, which is less efficient than just making a direct aligned word load.
Note using reinterpret_cast here assumes the data is aligned correctly. For the base address of std::vector it will work (data allocated via new must be aligned enough for the largest supported primitive type), but it's one to watch out for if you change where the code comes from in future.
According to VUID-VkShaderModuleCreateInfo-pCode-parameter, the specification requires that:
pCode must be a valid pointer to an array of 4/codeSize uint32_t values
pCode being a uint32_t* allows this requirement to be partially validated by the compiler, making it one less thing to worry about for users of the API.
Because SPIR-V word is 32 bits.
It should not matter for performance. It is just a type.
Trying to write to flash to store some configuration. I am using an STM32F446ze where I want to use the last 16kb sector as storage.
I specified VOLTAGE_RANGE_3 when I erased my sector. VOLTAGE_RANGE_3 is mapped to:
#define FLASH_VOLTAGE_RANGE_3 0x00000002U /*!< Device operating range: 2.7V to 3.6V */
I am getting an error when writing to flash when I use FLASH_TYPEPROGRAM_WORD. The error is HAL_FLASH_ERROR_PGP. Reading the reference manual I read that this has to do with using wrong parallelism/voltage levels.
From the reference manual I can read
Furthermore, in the reference manual I can read:
Programming errors
It is not allowed to program data to the Flash
memory that would cross the 128-bit row boundary. In such a case, the
write operation is not performed and a program alignment error flag
(PGAERR) is set in the FLASH_SR register. The write access type (byte,
half-word, word or double word) must correspond to the type of
parallelism chosen (x8, x16, x32 or x64). If not, the write operation
is not performed and a program parallelism error flag (PGPERR) is set
in the FLASH_SR register
So I thought:
I erased the sector in voltage range 3
That gives me 2.7 to 3.6v specification
That gives me x32 parallelism size
I should be able to write WORDs to flash.
But, this line give me an error (after unlocking the flash)
uint32_t sizeOfStorageType = ....; // Some uint I want to write to flash as test
HAL_StatusTypeDef flashStatus = HAL_FLASH_Program(TYPEPROGRAM_WORD, address++, (uint64_t) sizeOfStorageType);
auto err= HAL_FLASH_GetError(); // err == 4 == HAL_FLASH_ERROR_PGP: FLASH Programming Parallelism error flag
while (flashStatus != HAL_OK)
{
}
But when I start to write bytes instead, it goes fine.
uint8_t *arr = (uint8_t*) &sizeOfStorageType;
HAL_StatusTypeDef flashStatus;
for (uint8_t i=0; i<4; i++)
{
flashStatus = HAL_FLASH_Program(TYPEPROGRAM_BYTE, address++, (uint64_t) *(arr+i));
while (flashStatus != HAL_OK)
{
}
}
My questions:
Am I understanding it correctly that after erasing a sector, I can only write one TYPEPROGRAM? Thus, after erasing I can only write bytes, OR, half-words, OR, words, OR double words?
What am I missing / doing wrong in above context. Why can I only write bytes, while I erased with VOLTAGE_RANGE_3?
This looks like an data alignment error, but not the one related with 128-bit flash memory rows which is mentioned in the reference manual. That one is probably related with double word writes only, and is irrelevant in your case.
If you want to program 4 bytes at a time, your address needs to be word aligned, meaning that it needs to be divisible by 4. Also, address is not a uint32_t* (pointer), it's a raw uint32_t so address++ increments it by 1, not 4. As far as I know, Cortex M4 core converts unaligned accesses on the bus into multiple smaller size aligned accesses automatically, but this violates the flash parallelism rule.
BTW, it's perfectly valid to perform a mixture of byte, half-word and word writes as long as they are properly aligned. Also, unlike the flash hardware of F0, F1 and F3 series, you can try to overwrite a previously written location without causing an error. 0->1 bit changes are just ignored.
i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).