Speed of deleting vs writing a new file, Jupyter pandas - pandas

I am doing a project in the jupyter notebook/lab environment, and am wondering if anyone knows the speed of csv's deleting rows vs just writing a whole new csv file. I am going to do around 20 filters and calculations, and would prefer to keep the filters separate for error checking. (some are required to be done separately)
I am very new to speed of functions, and can't find documentation as which would be faster.
Basically, would I be better off just having a single file that I keep removing things from as I filter things out, or writing to a new file.
I have previously been using writing to a new file, but the size of this dataset is concerning for me. So far I haven't had errors with anything, but speed is already an issue with small sizes.
I've also debated just truncating a file instead of making a new one.

Related

Grib2 data extraction with xarray and cfgrib very slow, how to improve the code?

The code is taking about 20 minutes to load a month for each variable with 168 time steps for the cycle of 00 and 12 UTC of each day. When it comes to saving to csv, the code takes even longer, it's been running for almost a day and it still hasn't saved to any station. How can I improve the code below?
Reading .grib files using xr.open_mfdataset() and cfgrib:
I can speak to the slowness of reading grib files using xr.open_mfdataset(). I had a similar task where I was reading in many grib using xarray and it was taking forever. Other people have experienced similar issues with this as well (see here).
According to the issue raised here, "cfgrib is not optimized to handle files with a huge number of fields even if they are small."
One thing that worked for me was converting as many of the individual grib files as I could to one (or several) netcdf files and then read in the newly created netcdf file(s) to xarray instead. Here is a link to show you how you could do this with several different methods. I went with the grib_to_netcdf command via ecCodes tool.
In summary, I would start with converting your grib files to netcdf, as it should be able to read in the data to xarray in a more performant manner. Then you can focus on other optimizations further down in your code.
I hope this helps!

VB.NET - Getting list of all files in network resource

Using VB.NET I need to get a list of all files in a network (NAS) folder. This seems really slow:
Dim searchFolder as string = "\\NAS\Tool files"
My.Computer.FileSystem.GetFiles(searchFolder, FileIO.SearchOption.SearchAllSubDirectories, sp).ToList
Is there a faster way?
EDIT: I should have mentioned that the network resource has over 100,000 files.
A great question! I have been in a situation where I needed to iterate over close to 3 million files from a slow network storage area, read some data out of them and create database filepath pointers for each one. GetFiles was prohibitively slow in that use case.
The problem with .GetFiles() is that it performs the entire retrieval upfront before commencing your iteration.
A simple tweak would be to use .EnumerateFiles(), which has a similar usage, which will find the first file that matches the pattern and commence iteration immediately before moving onto the next. More on this here.
Depending what operation you're performing in each iteration, you may want to take a step into the land of threading - this is a bit more advanced but it'll allow you to perform operations in parallel which will bring you further speed gains.
Good luck!

How can I make a Spark paired RDD from many S3 files whose URLs are in an RDD?

I have millions of S3 files, whose sizes average about 250k but are highly variable (up to a few 4 GB size). I can't easily use wildcards to pick out multiple files, but I can make an RDD holding the S3 URLs of the files I want to process at any time.
I'd like to get two kinds of paired RDDs. The first would have the S3 URL, then the contents of the file as a Unicode string. (Is that even possible when some of the files can be so long?) The second could be computed from the first, by split()-ting the long string at newlines.
I've tried a number of ways to do this, typically getting a Python PicklingError, unless I iterate though the PII of S3 URLs one at a time. Then I can use union() to build up the big pairRDDs I want, as was described in another question. But I don't think that is going to run in parallel, which will be important when dealing with lots of files.
I'm currently using Python, but can switch to Scala or Java if needed.
Thanks in advance.
The size of the files shouldn't matter as long as your cluster has the in-memory capacity. Generally, you'll need to do some tuning before everything works.
I'm not versed with python so I can't comment too much on pickling error. Perhaps these links might help but I'll add python tag so that someone better can take a look.
cloudpickle.py
pyspark serializer can't handle functions

solutions for cleaning/manipulating big data (currently using Stata)

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.

How to handle a very big array in vb.net

I have a program producing a lot of data, which it writes to a csv file line by line (as the data is created). If I were able to open the csv file in excel it would be about 1 billion cells (75,000*14,600). I get the System.OutOfMemoryException thrown every time I try and access it (or even create an array this size). If anyone has any idea how to can take the data into vb.net so I can do some simple operations (all data needs to be available at once) then I'll try every idea you have.
I've looked at increasing the amount of ram used but other articles/posts say this will run short way before the 1 billion mark. There's no issues with time here, assuming it's no more than a few days/weeks I can deal with it (I'll only be running it once or twice a year). If you don't know anyway to do it the only other solutions I can think of would be increasing the number of columns in excel to ~75,000 (if that's possible - can't write the data the other way around), or I suppose if there's another language that could handle this?
At present it fails right at the start:
Dim bigmatrix(75000, 14600) As Double
Many thanks,
Fraser :)
First, this will always require a 64bit operating system and a fairly large amount of RAM, as you're trying to allocate about 8 GB.
This is theoretically possible in Visual Basic targeting .NET 4.5 if you turn on gcAllowVeryLargeObjects. That being said, I would recommend using a jagged array instead of a multidimensional array if possible, as this will remove the requirement of needing a single allocation of 8GB. (This will also potentially allow it to work in .NET 4 or earlier.)