How many objects can an nsarray hold? [duplicate] - objective-c

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Maximum amount of objects in NSArray
I was just wondering how many objects I can put into an NSArray, because I need to find something that functions like an array, but I need to hold a lot of data (between 900 and 1200 strings). I was thinking about using an NSDictionary to hold the data, but it doesn't seem to fit the bill. Do you think an NSArray will hold that many objects, or should I use an NSDictionary?

Technically, NSArray can hold up to NSUIntegerMax objects (this is the largest value that can be returned from count). On a 32-bit system like the iPhone, that is a little over 4 billion. On a 64-bit system like most Macs, it is many orders of magnitude higher. By the time you even need to think about running out of room in an NSArray, you're going to have other scaling problems to deal with first, like the fact that 4 billion four-character strings will take up something like 16 GB of memory.
NSArray has two internal implementations for differently sized arrays — 1200 items would still be well within the "small array" implementation.

How much memory do you have?
There is no practical upper limit other than the number of bits it takes to hold all that data. 1200 items is fine. But if those items are each 10 thousand word strings, you may start needing too much memory to hold them all.

You can store an unlimited number of objects in an NSArray until you run out of memory. 900-1200 strings is not a large number, but it obviously depends on the length of each string. Do you know in advance whether they will be long?

Related

What is the time complexity of Search in ArrayList?

One interview question which I couldn't answer and couldn't find any relevant answers online.
I know the arraylist retrieve the data in constant time based on the indexes.
Suppose in an arraylist, there are 10000 data and the element is at 5000th location(We are not given the location), we have to search for a particular value( for eg integer 3 which happens to be on the 5000th index), for searching the value, we will have to traverse through the arraylist to find the value and it would take linear time right??
Because if we are traversing through the arraylist to find the data, it would take linear time and not constant time.
In short I want to know the internal working of contains method in which I have to check for the particular value and I don't have the index. It will have to traverse through the array to check for the particular value and it would take O(n) time right?
Thanks in advance.
I hope this is what you want to know about search in ArrayList:
Arrays are laid sequentially in memory. This means, if it is an array of integers that uses 4 bytes each, and starts at memory address 1000, next element will be at 1004, and next at 1008, and so forth. Thus, if I want the element at position 20 in my array, the code in get() will have to compute:
1000 + 20 * 4 = 1080
to have the exact memory address of the element. Well, RAM memory got their name of Random Access Memory because they are built in such way that they have a hierarchy of hardware multiplexers that allow them to access any stored memory unit (byte?) in constant time, given the address.
Thus, two simple arithmetic operations and one access to RAM is said to be O(1). See link to original answer.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

NSMutableArray memory allocation

I have an NSMutableArray with 14 indexes acting as a global NSArray, lets call it 'A'. In each of these indexes I have a sub-array (therefore I have 14 sub-arrays within 'A').
These arrays then form the data for my UITableViews.
If I check with the server and download a new sub-array (on a background thread) lets call it 'B' and want to replace one of the arrays in 'A' (on the main thread), is it safe to be reading from one index within 'A' whilst concurrently rewriting a different array at a different index (I emphasise it will never be the same index)? Will it cause any memory issues?
My knowledge of how memory allocation and pointers work is limited and I can't seem to find information on this.
Yes you will be fine as long as the threads don't try accessing the same sub-array. If the sub-arrays are themselves NSMutableArrays, you can merely rewrite the data within them, or remove and add a new array to take its place.

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.

Using NSDecimalNumber in objective-c

I have a calculation that goes something like this:
Price = value * randomNumberBetween(decimalValueA, decimalValueB)
I was originally generating this using floats/doubles. However, after looking up a bit more on objective-c, it was mentioned numerous times that when calculating currency you should use NSDecimalNumber.
The issue I have is that I use this 'price' variable in comparisons and things, for example:
if (deposit/price) < 0.2
return price*0.05;
Using NSDecimalNumber makes this a lot more difficult. As far as I'm aware I should be converting any magic numbers (in this case 0.2 and 0.05) to NSDecimalNumber so then I can compare them and use functions such as NSDecimalMultiply.
Also, if I have a function that is something like:
return (minRandomPercentage + ((maxRandomPercentage - minRandomPercentage) * (randomNumber)
it ends up becoming this ridiculous string of nested function calls like:
return [minRandomPercentage decimalNumberByAdding:[[maxRandomPercentage decimalNumberBySubtracting: minRandomPercentage] decimalNumberByMultiplyingBy:random]]
Is this seriously how objective-c deals with decimals? Can anyone give me any clues on how to make this a lot less arduous? I can live with the nested function calls if I could do comparisons with the result and not have to be casting every magic number I have.
If you can't afford to deal with the rounding errors that can occur with the standard base-2 floating point types, you'll have to use NSDecimal or NSDecimalNumber. NSDecimal is a C struct, and Foundation provides a C interface for dealing with it. It provides functions NSDecimalAdd, NSDecimalMultiply, etc.
From the Number and Value Programming Guide: You might consider the C interface if you don’t need to treat decimal numbers as objects—that is, if you don’t need to store them in an object-oriented collection like an instance of NSArray or NSDictionary. You might also consider the C interface if you need maximum efficiency. The C interface is faster and uses less memory than the NSDecimalNumber class.
If you're writing object-oriented code, and you're not interacting with massive data sets, it might be best to stick with NSDecimalNumber. If you profile your code and find that using NSDecimalNumber is causing a high memory overhead, then you may need to consider alternatives.
If rounding errors are not a concern, you can also use native C scalars. See: How to add two NSNumber objects?
NSNumber and NSDecimalNumber are used as object wrappers when you need to pass a number to a method or store numbers in a collection. Since NSArray, NSSet, NSDictionary, etc. only allow you to store objects of type 'id', you can't store ints, floats, etc. natively.
If you're dealing with large data sets and can afford rounding errors, you can use ints, floats, doubles, etc. raw. Then when you have your result and you need to store it or pass it to another object, you can wrap it up in an NSNumber accordingly.
If you do have a need to store large collections of numbers, it's much more efficient to use C arrays than to initialize and store lots of NSNumber objects.
Seriously, this is how you do base 10 arithmetic in iOS. As you're probably aware, many numbers that have exact representations in base 10 don't have exact representations in base 2, and that can lead to unacceptable rounding when working with base 10 systems like currency or metric measurements.
Values represented by NSDecimalNumber are objects, unlike built-in numeric types like int, float, and double. It seems odd at first to use methods for arithmetic operations, but it makes more sense when you start thinking about the values as objects.