In C++, map class is very comfortable. Instead of going for a separate database I want to store all the rows as objects and I want to create map object for the columns to search. I am concerned with maximum objects a process can handle. And is using map function to retrieve an object among, say, 10 million objects, if linux permits, is a good choice? I'm not worried about persisting the data.
What you are looking for is std::map::max_size, quoting from the reference:
...reflects the theoretical limit on the size of the container. At runtime, the size of the container may be limited to a value smaller than max_size() by the amount of RAM available.
No, there is no maximum number of objects per process. Objects (as in, C++ objects) are an abstraction which the OS is unaware of. The only meaningful limit in this regard is the amount of memory used.
You can completely fill your RAM using as much map as it takes, I promise.
As you can see in reference documentation, the constannt map::max_size will let you know the numbers.
This should be 2^31-1 on iX86 hardware/OS and 2^64-1 on amd64 hardware/64bit OS
Possible additionnal information here.
Object is a concept in programming language. In fact, the processes are not aware of the objects. With enough RAM space, you can alloc as many objects as possible in your program.
About your second question, my answer is that which data structure you choose in your program depends on the problem that you want to solve in your program. Map is a suitable data structure for quick accessing objects, testing existance, etc, but is not good enough to maintain the objects' order.
Related
I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)
I am writting a program in java for my application and i am concerned about speed performance . I have done some benchmarking test and it seems to me the speed is not good enough. I think it has to do with add ang get method of the arraylist since when i use jvm and press snapshot it tells me that it takes more seconds add and get method of arraylist.
I have read some years ago when i tool OCPJP test that if you want to have a lot of add and delete use LinkedList but if you want fast iteration use ArrayList. In other words use ArrayList when you will use get method and LinkedList when you will use add method and i have done that .
I am not sure anymore if this is right or not?!
I would like anybody to give me an advise if i have to stick with that or is there any other way how can i improve my performance.
I think it has to do with add ang get method of the arraylist since when i use jvm and press snapshot it tells me that it takes more seconds add and get method of arraylist
It sounds like you have used a profiler to check what the actual issues are -- that's the first place to start! Are you able to post the results of the analysis that might, perhaps, hint at the calling context? The speed of some operations differ between the two implementations as summarized in other questions. If the calls you see are really called from another method in the List implementation, you might be chasing the wrong thing (i.e. calling insert frequently near one end of an ArrayList that can cause terrible performance).
In general performance will depend on the implementation, but when running benchmarks myself with real-world conditions I have found that ArrayList-s generally fit my use case better if able to size them appropriately on creation.
LinkedList may or may not keep a pool of pre-allocated memory for new nodes, but once the pool is empty (if present at all) it will have to go allocate more -- an expensive operation relative to CPU speed! That said, it only has to allocate at least enough space for one node and then tack it onto the tail; no copies of any of the data are made.
An ArrayList exposes the part of its implementation that pre-allocates more space than actually required for the underlying array, growing it as elements are added. If you initialize an ArrayList, it defaults to an internal array size of 10 elements. The catch is that when the list outgrows that initially-allocated size, it must go allocate a contiguous block of memory large enough for the old and the new elements and then copy the elements from the old array into the new one.
In short, if you:
use ArrayList
do not specify an initial capacity that guarantees all items fit
proceed to grow the list far beyond its original capacity
you will incur a lot of overhead when copying items. If that is the problem, over the long run that cost should be amortized across the lack of future re-sizing ... unless, of course, you repeat the whole process with a new list rather than re-using the original that has now grown in size.
As for iteration, an array is composed of a contiguous chunk of memory. Since many items may be adjacent, fetches of data from main memory can end up being much faster than the nodes in a LinkedList that could be scattered all over depending on how things get laid out in memory. I'd strongly suggest trusting the numbers of the profiler using the different implementations and tracking down what might be going on.
I'd like to know if there is a guideline for the maximum number of attributes any one NSManagedObject subclass should have. I've read Apple's docs carefully, but find no mention of a limit where performance starts to erode.
I DO see a compiler flag option in Xcode that provides a warning when an NSManagedObject has more than 100 attributes, but I can't find any documentation on it. Does anyone here have experience with Core Data MOs that have a large number of attributes?
I'm focused on performance, not memory usage. In my app, there will only ever be about 10-20 instances of this MO that has a large number of attributes and I'm developing on OS X rather than iOS, so memory usage isn't a factor. But if there is a point where performance (especially when faulting) starts to die, I'd like to know now so I can change the structure of my data model accordingly.
Thanks!
Each attribute gets mapped to a table column, if you're using SQLite backing stores. SQLite has a hard limit on the number of columns you can use (2000 by default, though it's compile-time configurable so Apple's implementation could differ), and they recommend not using more than one hundred. That could well be why the Xcode warning sets its threshold at 100.
That same linked page on limits also notes that there are some O(N^2) algorithms where N is the number of columns, so, sounds like you should generally avoid high numbers.
For other file formats, I don't know of any limits or recommendations. But I'd expect similar things - i.e. there's probably some algorithm in there that's O(N^2) or worse, so you want to avoid becoming an uncommon edge case.
Not that I have run across even on iOS. The biggest limiting factor of performance is the cache size in the NSPersistentStoreCoordinator which on Mac OSX is pretty big.
If your attributes are strings, numbers, dates, etc. (i.e. not binary data) then you can probably have a huge number of attributes before you start to see a performance hit. If you are working with binary data then I would caution you against blowing the cache and consider storing binary data outside of SQLite. More recent versions of the OS can even do this automatically for you.
However, I would question why you would want to do this. Surely there are attributes that are going to be less vital than others and can be abstracted away into child entities on the other side of a one-to-one relationship?
I've just made a simple RAM memory in Minecraft (with redstone), with 4bits for the adress and 4bits stored in each cell. Our next goal is to store different kinds of variables in it and to process them differently.
We are not engineers, so we don't really know, but we have made some quite complex things and we think we can do this. The problem is that we can't figure out how to store variables of more bits that can be stored in a single cell. I'll give an example.
Think of a 16bit variable. We thought that there's no sense in creating big cells so we decided to store that data storing 4bits in each cell. But that's not enough, we had to relate those 4 cells. So we thought that we had to create 8bit cells, with 4bits of content and 4bits to store the address where the next 4bits of the variable are stored. However, 4bits of address is nothing for RAM, we can't store nothing there. So we would need at least 8bits for the address. 4bits of content also seems quite low, and we also need at least other 4bits to store the type of the variable.
Well, finally we thought that technique was absurd and that it coudn't be done like that in real life. And we don't know how to do it now. I've searched on the web about how RAM works and the few that I've find was too complex for our needs.
Could someone please explain us how this is done in real life?
Heh you're playing the blame game, trying to pin all the responsibility of memory management on the physical RAM implementation.
In fact, RAM is just that, a storage device (your redstone tiles), actually storing data in it is your program's responsibility. Put in other words, there doesn't need to be a standardized memory cell "linking" strategy for RAM, because it's your program that writes to it and then reads it back, so it knows its own common practices.
With that in mind, storing values is easy. Say you want a 16bit integer stored in your 4bit/word RAM (so 4 words of data). Simply refer to addresses 0 through 4 as your variable and that's it. No "linking" necessary because you both know how to read from it and write to it, and you won't step on your own toes (in theory).
Additional thoughts for growing your construct: special locations for specialized registries (stack pointer to use a stack for recursive computing, program pointer for a turing machine etc). I had one more but I forgot it while writing that one, if I'll remember it I'll edit..
I'm writing an API that gets information about the CPU (using CPUID). What I'm wondering is should I store the values from the bit field returned by calling CPUID in separate integer values, or should I just store the entire bit field in a value and write functions to get the different values on-the-fly?
What is preferable in this case? Memory usage or speed? If it's memory usage, I'll just store the entire bit field in a single variable. If it's speed, I'll store each value in a separate variable.
You're only going to query a CPU once. With modern computers having both huge amounts of memory and processing power, it would make no difference either way.
Just do what would make more sense for the next person who reads it.
Programs must be written for people to read, and only incidentally for machines to execute.
— The Structure and Interpretation of Computer Programs
I think it does not matter here, b/c you will not call your CPU-id code 10000 times per second.. will you?
I think you can define different interface (method) for different value. this is more clear and easy to use. a clear, accuracy & easy to use of interface should be the first thing to consider, then performance (memory usage & speed).