I have a program producing a lot of data, which it writes to a csv file line by line (as the data is created). If I were able to open the csv file in excel it would be about 1 billion cells (75,000*14,600). I get the System.OutOfMemoryException thrown every time I try and access it (or even create an array this size). If anyone has any idea how to can take the data into vb.net so I can do some simple operations (all data needs to be available at once) then I'll try every idea you have.
I've looked at increasing the amount of ram used but other articles/posts say this will run short way before the 1 billion mark. There's no issues with time here, assuming it's no more than a few days/weeks I can deal with it (I'll only be running it once or twice a year). If you don't know anyway to do it the only other solutions I can think of would be increasing the number of columns in excel to ~75,000 (if that's possible - can't write the data the other way around), or I suppose if there's another language that could handle this?
At present it fails right at the start:
Dim bigmatrix(75000, 14600) As Double
Many thanks,
Fraser :)
First, this will always require a 64bit operating system and a fairly large amount of RAM, as you're trying to allocate about 8 GB.
This is theoretically possible in Visual Basic targeting .NET 4.5 if you turn on gcAllowVeryLargeObjects. That being said, I would recommend using a jagged array instead of a multidimensional array if possible, as this will remove the requirement of needing a single allocation of 8GB. (This will also potentially allow it to work in .NET 4 or earlier.)
Related
Was hoping for some advice on best practice here. Working in objective C and Xcode.
I made a "FileConverter" class which has a method to reads a cvs file with 7 columns of float values into a SQLite database (after verifying the data and parsing it).
The way I have done this is to load the whole file into an NSString, then split into row components, then split each row into column components (saving the result as a 2x2 NSArray.
I then open database and copy the array into the sqlite database. I'm using the TEXT datatype for storage at the moment. Once there, I plan to graph the data.
It seems to check and convert the cvs ok. However if the csv is quite long ( say 10,000 rows), I get the spinning wheel for several seconds while it does its work. For shorter files it converts almost instantly.
Ultimately, at the point the user clicks "Convert CSV", I will also be running another method which will graph the data and I expect this will result in huge delay while it reads the sql database, assembles the data into CGPoints and then draws into the graph view.
My question is about how best to optimise the process, so it can handle the larger files without spinning wheels appearing. Is this possible?
a) Using NSStrings and NSArrays certainly makes the job of reading and splitting up the data super simple and makes verifying the data easy. Is this the best way? Should I malloc a float array instead?
b) I'm working on the basis that by saving the data as TEXT values in the database, converting them to CGFloat values will be straightforward, but realise this will add processing time.
c) I'm imagining that a sqlite3 database would be a faster way for getting the data when I come to graphing it, but I could also simply copy the cvs file and parse it at the point of graphing the data.
Really appreciate advice on this
Your main thread runs the UI, and you're trying to do your computation (CSV Computations) on the main thread making the UI hang (not updating) making the OS invoke the spinning wheel. To avoid this you need to move your compute intensive operations to secondary threads.
When the csv file is small, a synchronous operation is able to perform the csv computation and update the UI immediately. In case of big files, the UI (app window) waits for the computations to finish before the UI can be updated. In such cases you need to asynchronously compute the csv calculations on a background thread. There are many ways to achieve this, the most popular being Apple's GCD.
The following link, Apple's guide to non blocking UI explains it in detail:
Link to Apple's non blocking guide
Link to GCD
I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.
I'm writing an application in vb.net 2005. The app reads a spreadsheet into a DataSet with ADO.NET and uses a column of that table to populate a ListBox. When a ListBox Item is selected, the user will be presented with detailed information on the selected record.
One part of this information isn't in the DataSet. I have to compare a column from the spreadsheet with several external data sources to determine the nature of the record in question. Here's where I have my problem.
This comparison has to search through 9.5m rows in a SQL table at one stage. I've checked and there's no way to "shrink" the query down as I'm already only searching absolutely essential data.
What happens is that the application never visibly does anything. The CPU usage shoots up to 100% regardless of what it was at beforehand and the system's performance becomes almost unbearably slow.
Can anyone suggest a way I can improve this situation while this massive query is running?
EDIT: I was originally going to write the contents of the 9.5m rows in the database table to a text file which I'd then read from, but after 6.5m rows, I got an OutOfMemoryException.
I suspect your CPU might be used in populating the DataSet, though you would have to profile your application to confirm that. Try using a DataReader instead and either storing the results in some more compact format in memory or, if you're running out of memory, then writing them to a file as you go. With the DataReader approach you never need to store the entire result set in memory at the same time.
An index in the column to search?
A new field in the table to help to search faster?
I'm working on a project in Objective-c where I need to work with large quantities of data stored in an NSDictionary (it's around max ~2 gigs in ram). After all the computations that I preform on it, it seems like it would be quicker to save/load the data when needed (versus re-parsing the original file).
So I started to look into saving large amount of data. I've tried using NSKeyedUnarchiver and [NSDictionary writeToFile:atomically:], but both failed with malloc errors (Can not allocate ____ bytes).
I've looked around SO, Apple's Dev forums and Google, but was unable to find anything. I'm wondering if it might be better to create the file bit-by-bit instead of all at once, but I can't anyway to add to an existing file. I'm not completely opposed to saving with a bunch of small files, but I would much rather use one big file.
Thanks!
Edited to include more information: I'm not sure how much overhead NSDictionary gives me, as I don't take all the information from the text files. I have a 1.5 gig file (of which I keep ~1/2), and it turns out to be around 900 megs through 1 gig in ram. There will be some more data that I need to add eventually, but it will be constructed with references to what's already loaded into memory - it shouldn't double the size, but it may come close.
The data is all serial, and could be separated in storage, but needs to all be in memory for execution. I currently have integer/string pairs, and will eventually end up with string/strings pairs (with all the values also being a key for a different set of strings, so the final storage requirements will be the same strings that I currently have, plus a bunch of references).
In the end, I will need to associate ~3 million strings with some other set of strings. However, the only important thing is the relationship between those strings - I could hash all of them, but NSNumber (as NSDictionary needs objects) might give me just as much overhead.
NSDictionary isn't going to give you the scalable storage that you're looking for, at least not for persistence. You should implement your own type of data structure/serialisation process.
Have you considered using an embedded sqllite database? Then you can process the data but perhaps only loading a fragment of the data structure at a time.
If you can, rebuilding your application in 64-bit mode will give you a much larger heap space.
If that's not an option for you, you'll need to create your own data structure and define your own load/save routines that don't allocate as much memory.
I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.
This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.
(edit for clarification)
I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.
I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.
I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).
Hopefully this clarifies what is needed and why the current solutions given are not illuminating.
I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.
In some languages programmers have access to "direct byte arrays" or "memory maps", which are provided by the OS. In java we have java.nio.MappedByteBuffer. This allows one to work with the data as if it were a byte array in memory, when in fact it is on the disk. The size of the file one can work with is only limited by the OS's virtual memory capabilities, and is typically ~<4GB for 32-bit computers. 64-bit? In theory 16 exabytes (17.2 billion GBs), but I think modern CPUs are limited to a 40-bit (1TB) or 48-bit (128TB) address space.
This would let you easily work with the one big file.
The FASTA file format is very sparse. The first thing I would do is generate a compact binary format, and index that - it should be maybe 20-30% the size of your current file, and the process for coding/decoding the data should be fast enough (even with 4GB) that it won't be an issue.
At that point, your file should fit within memory, even on a 32 bit machine. Let the OS page it, or make a ramdisk if you want to be certain it's all in memory.
Keep in mind that memory is only around $30 a GB (and getting cheaper) so if you have a 64 bit OS then you can even deal with the complete file in memory without encoding it into a more compact format.
Good luck!
-Adam
I talked to a few co-workers and they just use VIM/Grep to search when they need to. Most of the time I wouldn't expect someone to search for a substring like this though.
But I don't see why MS Desktop search or spotlight or google's equivalent can't help you here.
My recommendation is splitting the file up --by gene or species, hopefully the input sequences aren't interleaved.
I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack
It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.
It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.