Julia: CSV.write very memory inefficient? - dataframe

I noticed that when saving large dataframes as CSVs the memory allocations are an order of magnitude higher than the size of the dataframe in memory (or the size of the CSV file on disk), at least by a factor of 10. Why is this the case? And is there a way to prevent this? Ie is there a way to save a dataframe to disk without using (much) more memory than the actual dataframe?
In the example below I generate a dataframe with one integer column and 10m rows. It weighs 76MB but writing the CSV allocates 1.35GB.
using DataFrames, CSV
function generate_df(n::Int64)
DataFrame!(a = 1:n)
end
julia> #time tmp = generate_df2(10000000);
0.671053 seconds (2.45 M allocations: 199.961 MiB)
julia> Base.summarysize(tmp) / 1024 / 1024
76.29454803466797
julia> #time CSV.write("~/tmp/test.csv", tmp)
3.199506 seconds (60.11 M allocations: 1.351 GiB)

What you see is not related to CSV.write, but to the fact that DataFrame is type-unstable. This means that it will allocate when iterating rows and accessing their contents. Here is an example:
julia> df = DataFrame(a=1:10000000);
julia> f(x) = x[1]
f (generic function with 1 method)
julia> #time sum(f, eachrow(df)) # after compilation
0.960045 seconds (40.07 M allocations: 613.918 MiB, 4.18% gc time)
50000005000000
This is a deliberate design decision to avoid unacceptable compilation times for very wide data frames (which are common in practice in certain fields of application). Now, this is the way to reduce allocations:
julia> #time CSV.write("test.csv", df) # after compilation
1.976654 seconds (60.00 M allocations: 1.345 GiB, 5.64% gc time)
"test.csv"
julia> #time CSV.write("test.csv", Tables.columntable(df)) # after compilation
0.439597 seconds (36 allocations: 4.002 MiB)
"test.csv"
(this will work OK if the table is narrow, for wide tables it might hit compilation time issues)
This is one of the patterns that are often encountered in Julia (even Julia itself works this way as args field in Expr is Vector{Any}): often you are OK with type unstable code if you do not care about performance (but want to avoid excessive compilation latency), and it is easy to switch to type-stable mode where compilation time does not matter and type-stability does.

Python Pandas:
import pandas as pd
df = pd.DataFrame({'a': range(10_000_000)})
%time df.to_csv("test_py.csv", index=False)
memory consumption (measured in Task Manager): 135 MB (before writing) -> 151 MB (during writing), Wall time: 8.39 s
Julia:
using DataFrames, CSV
df = DataFrame(a=1:10_000_000)
#time CSV.write("test_jl.csv", df)
#time CSV.write("test_jl.csv", df)
memory consumption: 284 MB (before writing) -> 332 MB (after 1st writing),
2.196639 seconds (51.42 M allocations: 1.270 GiB, 7.49% gc time)
2nd execution (no compilation required anymore): -> 357 MB,
1.701374 seconds (50.00 M allocations: 1.196 GiB, 6.26% gc time)
The memory increase of Python and Julia during writing of the CSV file is similar (~ 15 MB). In Julia, I observe a significant memory increase after execution of the first write command, probably due to cashing of compiled code.
Note that allocations != memory requirement. Even though 1.2 GB memory is allocated in total, only 15 MB is used at the same time (max value).
Regarding performance, Julia is nearly 4 times faster than Python, even including compilation time.

Related

Numpy:memory efficient way to produce np.exp(1j*x), for real x

For a given real numpy array x,
np.exp(1j*x)'s peak memory consumption seems to be 4 times X (memory size of x). And this seems to be because: first it computes result=1j*np.sin(x), with peak memory use of 4X, going back to 2X, then computes result+=np.cos(x) with peak memory use of 4X, coming back to 2X. However, np.exp(1j*x)'s size should be 2X. Is there any way to reduce the peak memory to 2X?
As explained in the comments, the 4x peak comes from 1j*x taking 2x the storage of x and then the result of exp() also taking 2x.
Also as already mentioned Numba/Cython/Numexpr can do the operation without intermediate arrays and in a single loop.
The 4x peak can be avoided using only numpy, but it requires 2 loops over x:
result = np.empty_like(x, np.result_type(x, 1j))
np.sin(x, out=result.imag)
np.cos(x, out=result.real)

Dask pandas apply freezes and ultimately gets killed

I paralellized my pandas.apply with a dask.map_paritions and it works fine on a test DataFrame (600 rows) cutting my calculations from 6 to 2 minutes - computing inner function as expected.
However, when I run the full set (6000 rows) the code gest irrepsponsive, the CPU and memory seems to be loaded and ultimately gets killed:
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=params.parallel)
applied = ddf.map_partitions(lambda dframe: dframe.apply(lambda x: my_fun(inData, x), axis=1)).compute(scheduler='processes')
after 15 minutes (BTW simple pandas.apply on the full dataset takes ca 15 minutes) I get from the console:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I am running MacOS, python 3.7, calling from PyCharm - any hints?
PS. I tried both few partitions (2,4) and a lot (32 - since I though maybe that's memory issue)
PS2. I pass quite big objects in inData to my apply function in classic pandas it does not matter, since it is linear, here I think maybe that is an issue (since I have there e.g. a network graph of ca 700MB) - can it be?

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

How can I speed up application of Log(x+1) to a sparse array in Julia

A sparse matrix in Julia only stores nonzero elements.
Some functions, such as log(x+1) (in all bases),
map zero to zero, and thus don't need to be applied to those zero elements.
(I think we would call this a Monoid homomorphism.)
How can I use this fact to speed up an operation?
Example code:
X = sprand(10^4,10^4, 10.0^-5, rand)
function naiveLog2p1(N::SparseMatrixCSC{Float64,Int64})
log2(1+N) |> sparse
end
Running:
#time naiveLog2p1(X)
Output is:
elapsed time: 2.580125482 seconds (2289 MB allocated, 6.86% gc time in 3 pauses with 0 full sweep)
On a second time (so that the function is expected to be already compiled):
elapsed time: 2.499118888 seconds (2288 MB allocated, 8.17% gc time in 3 pauses with 0 full sweep)
Little change, presumably cos it is so simple to compile.
As per suggestion of the Julia manual on "Sparse matrix operations" I would convert the sparse matrix into a dense one using findnz(), do the log operations on the values and the reconstruc the sparse matrix with sparse().
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
I,J,V = findnz(N)
return sparse(I,J,log2(1+V))
end
#time improvedLog2p1(X)
elapsed time: 0.000553508 seconds (473288 bytes allocated)
My solution would be to actually operate on the inside of the data structure itself:
mysparselog(N::SparseMatrixCSC) =
SparseMatrixCSC(N.m, N.n, copy(N.colptr), copy(N.rowval), log2(1+N.nzval))
Note that if you want to operate on the sparse matrix in place, which would be fairly often in practice I imagine, this would be a zero-memory operation. Benchmarking reveals this performs similar to the #Oxinabox answer, as it is about the same in terms of memory operations (although that answer doesn't actually return the new matrix, as shown by the mean output):
with warmup times removed:
naiveLog2p1
elapsed time: 1.902405905 seconds (2424151464 bytes allocated, 10.35% gc time)
mean(M) => 0.005568094618997372
mysparselog
elapsed time: 0.022551705 seconds (24071168 bytes allocated)
elapsed time: 0.025841895 seconds (24071168 bytes allocated)
mean(M) => 0.005568094618997372
improvedLog2p1
elapsed time: 0.018682775 seconds (32068160 bytes allocated)
elapsed time: 0.027129497 seconds (32068160 bytes allocated)
mean(M) => 0.004995127985160583
What you are looking for is the sparse nonzeros function.
nonzeros(A)
Return a vector of the structural nonzero values in sparse matrix A. This includes zeros that are explicitly stored in the sparse
matrix. The returned vector points directly to the internal nonzero
storage of A, and any modifications to the returned vector will mutate
A as well.
You can use this as below:
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
M = copy(N)
ms = nonzeros(M) #Creates a view,
ms = log2(1+ms) #changes to ms, change M
M
end
#time improvedLog2p1(X)
running for the first time output is:
elapsed time: 0.002447847 seconds (157 kB allocated)
running for the second time output is:
0.000102335 seconds (109 kB allocated)
That is a 4 orders of magnitude improvement in speed and memory use.

Doing efficient Numerics in Haskell

I was inspired by this post called "Only fast languages are interesting" to look at the problem he suggests (sum'ing a couple of million numbers from a vector) in Haskell and compare to his results.
I'm a Haskell newbie so I don't really know how to time correctly or how to do this efficiently, my first attempt at this problem was the following. Note that I'm not using random numbers in the vector as I'm not sure how to do in a good way. I'm also printing stuff in order to ensure full evaluation.
import System.TimeIt
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
print $ V.length vec
return vec
sumit :: IO ()
sumit = do
vec <- vector
print $ V.sum vec
time = timeIt sumit
Loading this up in GHCI and running time tells me that it took about 0.22s to run for 3 million numbers and 2.69s for 30 million numbers.
Compared to the blog authors results of 0.02s and 0.18s in Lush it's quite a lot worse, which leads me to believe this can be done in a better way.
Note: The above code needs the package TimeIt to run. cabal install timeit will get it for you.
First of all, realize that GHCi is an interpreter, and it's not designed to be very fast. To get more useful results you should compile the code with optimizations enabled. This can make a huge difference.
Also, for any serious benchmarking of Haskell code, I recommend using criterion. It uses various statistical techniques to ensure that you're getting reliable measurements.
I modified your code to use criterion and removed the print statements so that we're not timing the I/O.
import Criterion.Main
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
return vec
sumit :: IO Int
sumit = do
vec <- vector
return $ V.sum vec
main = defaultMain [bench "sumit" $ whnfIO sumit]
Compiling this with -O2, I get this result on a pretty slow netbook:
$ ghc --make -O2 Sum.hs
$ ./Sum
warming up
estimating clock resolution...
mean is 56.55146 us (10001 iterations)
found 1136 outliers among 9999 samples (11.4%)
235 (2.4%) high mild
901 (9.0%) high severe
estimating cost of a clock call...
mean is 2.493841 us (38 iterations)
found 4 outliers among 38 samples (10.5%)
2 (5.3%) high mild
2 (5.3%) high severe
benchmarking sumit
collecting 100 samples, 8 iterations each, in estimated 6.180620 s
mean: 9.329556 ms, lb 9.222860 ms, ub 9.473564 ms, ci 0.950
std dev: 628.0294 us, lb 439.1394 us, ub 1.045119 ms, ci 0.950
So I'm getting an average of just over 9 ms with a standard deviation of less than a millisecond. For the larger test case, I'm getting about 100ms.
Enabling optimizations is especially important when using the vector package, as it makes heavy use of stream fusion, which in this case is able to eliminate the data structure entirely, turning your program into an efficient, tight loop.
It may also be worthwhile to experiment with the new LLVM-based code generator by using -fllvm option. It is apparently well-suited for numeric code.
Your original file, uncompiled, then compiled without optimization, then compiled with a simple optimization flag:
$ runhaskell boxed.hs
3000000
30000000
CPU time: 0.35s
$ ghc --make boxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.34s
$ ghc --make -O2 boxed.hs
$ ./boxed
3000000
30000000
CPU time: 0.09s
Your file with import qualified Data.Vector.Unboxed as V instead of import qualified Data.Vector as V (Int is an unboxable type) --
first without optimization then with:
$ ghc --make unboxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.27s
$ ghc --make -O2 unboxed.hs
$ ./unboxed
3000000
30000000
CPU time: 0.04s
So, compile, optimize ... and where possible use Data.Vector.Unboxed
Try to use an unboxed vector, although I'm not sure whether it makes a noticable difference in this case. Note also that the comparison is slightly unfair, because the vector package should optimize the vector away entirely (this optimization is called stream fusion).
If you use big enough vectors, Vector Unboxed might become impractical. For me pure (lazy) lists are quicker, if vector size > 50000000:
import System.TimeIt
sumit :: IO ()
sumit = print . sum $ replicate 50000000 10
main :: IO ()
main = timeIt sumit
I get these times:
Unboxed Vectors
CPU time: 1.00s
List:
CPU time: 0.70s
Edit: I've repeated the benchmark using Criterion and making sumit pure. Code and results as follow:
Code:
import Criterion.Main
sumit :: Int -> Int
sumit m = sum $ replicate m 10
main :: IO ()
main = defaultMain [bench "sumit" $ nf sumit 50000000]
Results:
warming up
estimating clock resolution...
mean is 7.248078 us (80001 iterations)
found 24509 outliers among 79999 samples (30.6%)
6044 (7.6%) low severe
18465 (23.1%) high severe
estimating cost of a clock call...
mean is 68.15917 ns (65 iterations)
found 7 outliers among 65 samples (10.8%)
3 (4.6%) high mild
4 (6.2%) high severe
benchmarking sumit
collecting 100 samples, 1 iterations each, in estimated 46.07401 s
mean: 451.0233 ms, lb 450.6641 ms, ub 451.5295 ms, ci 0.950
std dev: 2.172022 ms, lb 1.674497 ms, ub 2.841110 ms, ci 0.950
It looks like print makes a big difference, as it should be expected!