Is there a limit to the amount of rows Pandas read_csv can load? - pandas

I am trying to load a .csv file using Pandas read_csv method, the file has 29872046 rows and it's total size is 2.2G.
I notice that most of the lines loaded miss their values, for a large amount of columns. The csv file when browsed from shell contains those values...
Are there any limitations to loaded files? If not, how could this be debugged?
Thanks

#d1337,
I wonder if you have memory issues. There is a hint of this here.
Possibly this is relevant or this.
If I was attempting to debug it I would do the simple thing. Cut the file in half - what happens? If ok, go up 50%, if not down 50%, until able to identify the point where its happening. You might even want to start with 20 lines and just make sure it is size related.
I'd also add OS and memory information plus the version of Pandas you're using to your post in case its relevant (I'm running Pandas 11.0, Python 3.2, Linux Mint x64 with 16G of RAM so I'd expect no issues, say). Also, possibly, you might post a link to your data so that someone else can test it.
Hope that helps.

Related

Grib2 data extraction with xarray and cfgrib very slow, how to improve the code?

The code is taking about 20 minutes to load a month for each variable with 168 time steps for the cycle of 00 and 12 UTC of each day. When it comes to saving to csv, the code takes even longer, it's been running for almost a day and it still hasn't saved to any station. How can I improve the code below?
Reading .grib files using xr.open_mfdataset() and cfgrib:
I can speak to the slowness of reading grib files using xr.open_mfdataset(). I had a similar task where I was reading in many grib using xarray and it was taking forever. Other people have experienced similar issues with this as well (see here).
According to the issue raised here, "cfgrib is not optimized to handle files with a huge number of fields even if they are small."
One thing that worked for me was converting as many of the individual grib files as I could to one (or several) netcdf files and then read in the newly created netcdf file(s) to xarray instead. Here is a link to show you how you could do this with several different methods. I went with the grib_to_netcdf command via ecCodes tool.
In summary, I would start with converting your grib files to netcdf, as it should be able to read in the data to xarray in a more performant manner. Then you can focus on other optimizations further down in your code.
I hope this helps!

Unable to open or view data with python

I have a netcdf3 dataset that I need to get into a data frame (preferable pandas) for further manipulation but am unable to do so. Since the data is multi hierarchical, I am first reading it in xarray then converting it to pandas using the to_dataframe command. This has worked well on other data, but it just kills the kernel in this case. I have tried converting it to a netcdf4 file using ncks but still cannot open it. I have also tried calling a single row or column of the xarray data structure just to view it, and this similarly kills the kernel. I produced this data so it is probably not corrupted. Does anyone have any experience with this? The file size is about 890 MB if that's relevant.
I am running python 2.7 on a Mac.

How to handle a very big array in vb.net

I have a program producing a lot of data, which it writes to a csv file line by line (as the data is created). If I were able to open the csv file in excel it would be about 1 billion cells (75,000*14,600). I get the System.OutOfMemoryException thrown every time I try and access it (or even create an array this size). If anyone has any idea how to can take the data into vb.net so I can do some simple operations (all data needs to be available at once) then I'll try every idea you have.
I've looked at increasing the amount of ram used but other articles/posts say this will run short way before the 1 billion mark. There's no issues with time here, assuming it's no more than a few days/weeks I can deal with it (I'll only be running it once or twice a year). If you don't know anyway to do it the only other solutions I can think of would be increasing the number of columns in excel to ~75,000 (if that's possible - can't write the data the other way around), or I suppose if there's another language that could handle this?
At present it fails right at the start:
Dim bigmatrix(75000, 14600) As Double
Many thanks,
Fraser :)
First, this will always require a 64bit operating system and a fairly large amount of RAM, as you're trying to allocate about 8 GB.
This is theoretically possible in Visual Basic targeting .NET 4.5 if you turn on gcAllowVeryLargeObjects. That being said, I would recommend using a jagged array instead of a multidimensional array if possible, as this will remove the requirement of needing a single allocation of 8GB. (This will also potentially allow it to work in .NET 4 or earlier.)

Spark RDD.saveAsTextFile writing empty files to S3

I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz) that reads input from and writes output to S3.
The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)
However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)
When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN files (my number of partitions are constant for these experiments), but each one is empty.
When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.
I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.
Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?
Turns out I was working on this too late at night. This morning, I took a step back and found a bug in my map-reduce which was effectively filtering out all the results.
You should use coalesce before saveAsTextFile
from spark programming guide
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.
eg:
outputRDD.coalesce(100).saveAsTextFile(s3n://bucket/path/to/output/)

On Disk Substring index

I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.
This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.
(edit for clarification)
I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.
I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.
I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).
Hopefully this clarifies what is needed and why the current solutions given are not illuminating.
I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.
In some languages programmers have access to "direct byte arrays" or "memory maps", which are provided by the OS. In java we have java.nio.MappedByteBuffer. This allows one to work with the data as if it were a byte array in memory, when in fact it is on the disk. The size of the file one can work with is only limited by the OS's virtual memory capabilities, and is typically ~<4GB for 32-bit computers. 64-bit? In theory 16 exabytes (17.2 billion GBs), but I think modern CPUs are limited to a 40-bit (1TB) or 48-bit (128TB) address space.
This would let you easily work with the one big file.
The FASTA file format is very sparse. The first thing I would do is generate a compact binary format, and index that - it should be maybe 20-30% the size of your current file, and the process for coding/decoding the data should be fast enough (even with 4GB) that it won't be an issue.
At that point, your file should fit within memory, even on a 32 bit machine. Let the OS page it, or make a ramdisk if you want to be certain it's all in memory.
Keep in mind that memory is only around $30 a GB (and getting cheaper) so if you have a 64 bit OS then you can even deal with the complete file in memory without encoding it into a more compact format.
Good luck!
-Adam
I talked to a few co-workers and they just use VIM/Grep to search when they need to. Most of the time I wouldn't expect someone to search for a substring like this though.
But I don't see why MS Desktop search or spotlight or google's equivalent can't help you here.
My recommendation is splitting the file up --by gene or species, hopefully the input sequences aren't interleaved.
I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack
It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.
It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.