Unable to open or view data with python - pandas

I have a netcdf3 dataset that I need to get into a data frame (preferable pandas) for further manipulation but am unable to do so. Since the data is multi hierarchical, I am first reading it in xarray then converting it to pandas using the to_dataframe command. This has worked well on other data, but it just kills the kernel in this case. I have tried converting it to a netcdf4 file using ncks but still cannot open it. I have also tried calling a single row or column of the xarray data structure just to view it, and this similarly kills the kernel. I produced this data so it is probably not corrupted. Does anyone have any experience with this? The file size is about 890 MB if that's relevant.
I am running python 2.7 on a Mac.

Related

Grib2 data extraction with xarray and cfgrib very slow, how to improve the code?

The code is taking about 20 minutes to load a month for each variable with 168 time steps for the cycle of 00 and 12 UTC of each day. When it comes to saving to csv, the code takes even longer, it's been running for almost a day and it still hasn't saved to any station. How can I improve the code below?
Reading .grib files using xr.open_mfdataset() and cfgrib:
I can speak to the slowness of reading grib files using xr.open_mfdataset(). I had a similar task where I was reading in many grib using xarray and it was taking forever. Other people have experienced similar issues with this as well (see here).
According to the issue raised here, "cfgrib is not optimized to handle files with a huge number of fields even if they are small."
One thing that worked for me was converting as many of the individual grib files as I could to one (or several) netcdf files and then read in the newly created netcdf file(s) to xarray instead. Here is a link to show you how you could do this with several different methods. I went with the grib_to_netcdf command via ecCodes tool.
In summary, I would start with converting your grib files to netcdf, as it should be able to read in the data to xarray in a more performant manner. Then you can focus on other optimizations further down in your code.
I hope this helps!

Not enough disk space when loading dataset with TFDS

I was implementing a DCGAN application based on the lsun-bedroom dataset. I was planning to utilize tfds, since lsun was on its catalog. Since the total dataset contains 42.7 GB of images, I only wanted to load a portion(10%) of the full data and used the following code to load the data according to the manual. Unfortunately, the same error informing not enough disk space occurred. Would there be a possible solution with tfds or should I use another API to load the data?
tfds.load('lsun/bedroom',split='train[10%:]')
Not enough disk space. Needed: 42.77 GiB (download: 42.77 GiB, generated: Unknown size)
I was testing on Google Colab
TFDS download the dataset from the original author website. As the datasets are often published as monolithic archive (e.g lsun.zip), it is unfortunately
impossible for TFDS to only download/install part of the dataset.
The split argument only filter the dataset after it has been fully generated. Note: You can see the download size of the datasets in the catalog: https://www.tensorflow.org/datasets/catalog/overview
To me, there seems to be some kind of issue or, at least, a misunderstanding about the variable 'split' of tfds.load().
'split' seems to be intended to load a given portion of the dataset, once the whole dataset has been downloaded.
I got the same error message when downloading the dataset called "librispeech". Any setting of the variable 'split' seems to be intended to download the whole dataset, which is too big for my disk.
I managed to download the much smaller "mnist" dataset, but I found both the train and test splits downloaded by setting 'split' to 'test'.

apache arrow - adequacy for parallel processing

I have a huge dataset and am using Apache Spark for data processing.
Using Apache Arrow, we can convert Spark-compatible data-frame to Pandas-compatible data-frame and run operations on it.
By converting the data-frame, will it achieve the performance of parallel processing seen in Spark or will it behave like Pandas?
As you can see on the documentation here
Note that even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data
The data will be sent to the driver when the data is moved to the Pandas data frame. That means that you may have performance issues if there is too much data for the driver to deal with. For that reason, if you are decided to use Pandas, try to group the data before calling to toPandas() method.
It won't have the same parallelization once it's converted to a Pandas data frame because Spark executors won't be working on that scenario. The beauty of Arrow is to be able to move from the Spark data frame to Pandas directly, but you have to think on the size of the data
Another possibility would be to use other frameworks like Koalas. It has some of the "beauties" of Pandas but it's integrated into Spark.

Applying XGBOOST with large data set

I have a large dataset with size approximately 5.3 GB and i have stored data using bigmemory() in R. Please let me know how to apply XGBOOST to this kind of data??
There currently is no support for this with xgboost. You could file an issue on the github repo with respect to the R package.
Otherwise, you could attempt to have it read from a file of your data. The docs say you can point to a local data file. Not sure about format restrictions or how it will be handled in RAM but something to explore.

Is there a limit to the amount of rows Pandas read_csv can load?

I am trying to load a .csv file using Pandas read_csv method, the file has 29872046 rows and it's total size is 2.2G.
I notice that most of the lines loaded miss their values, for a large amount of columns. The csv file when browsed from shell contains those values...
Are there any limitations to loaded files? If not, how could this be debugged?
Thanks
#d1337,
I wonder if you have memory issues. There is a hint of this here.
Possibly this is relevant or this.
If I was attempting to debug it I would do the simple thing. Cut the file in half - what happens? If ok, go up 50%, if not down 50%, until able to identify the point where its happening. You might even want to start with 20 lines and just make sure it is size related.
I'd also add OS and memory information plus the version of Pandas you're using to your post in case its relevant (I'm running Pandas 11.0, Python 3.2, Linux Mint x64 with 16G of RAM so I'd expect no issues, say). Also, possibly, you might post a link to your data so that someone else can test it.
Hope that helps.