Which is faster read(), or readline() or readlines() with respect file io in julia? - file-io

Please correct me, if I were wrong:
read is efficient, as I assume:
a) read fetches whole file content to memory in one go, similar to python.
b) readline and readlines brings one line at a time to memory.

In order to expand on the comment here is some example benchmark (to additionally show you how you can perform such tests yourself).
First create some random test data:
open("testdata.txt", "w") do f
for i in 1:10^6
println(f, "a"^100)
end
end
We will want to read in the data in four ways (and calculate the aggregate length of lines):
f1() = sum(length(l) for l in readlines("testdata.txt"))
f2() = sum(length(l) for l in eachline("testdata.txt"))
function f3()
s = 0
open("testdata.txt") do f
while !eof(f)
s += length(readline(f))
end
end
s
end
function f4()
s = 0
for c in read("testdata.txt", String)
s += c != '\n' # assume Linux for simplicity
end
s
end
Now we compare the performance and memory usage of the given options:
julia> using BenchmarkTools
julia> #btime f1()
239.857 ms (2001558 allocations: 146.59 MiB)
100000000
julia> #btime f2()
179.480 ms (2001539 allocations: 137.59 MiB)
100000000
julia> #btime f3()
189.643 ms (2001533 allocations: 137.59 MiB)
100000000
julia> #btime f4()
158.055 ms (13 allocations: 96.32 MiB)
100000000
If you run it on your machine you should get similar results.

Related

How to change array elements that are greater than 5 to 5, in one line?

I would like to take an array x and change all numbers greater than 5 to 5. What is the standard way to do this in one line?
Below is some code that does this in several lines. This question on logical indexing is related but appears to concern selection rather than assignment.
Thanks
x = [1 2 6 7]
for i in 1:length(x)
if x[i] >= 5
x[i] = 5
end
end
Desired output:
x = [1 2 5 5]
The broadcast operator . works with any function, including relational operators, and it also works with assignment. Hence an intuitive one-liner is:
x[x .> 5] .= 5
This part x .> 5 broadcasts > 5 over x, resulting in a vector of booleans indicating elements greater than 5. This part .= 5 broadcasts the assignment of 5 across all elements indicated by x[x .> 5].
However, inspired by the significant speed-up in Benoit's very cool answer below (please do check it out) I decided to also add an optimized variant with a speed test. The above approach, while very intuitive looking, is not optimal because it allocates a temporary array of booleans for the indices. A (more) optimal approach that avoids temporary allocation, and as a bonus will work for any predicate (conditional) function is:
function f_cond!(x::Vector{Int}, f::Function, val::Int)
#inbounds for n in eachindex(x)
f(x[n]) && (x[n] = val)
end
return x
end
So using this function we would write f_cond!(x, a->a>5, 5) which assigns 5 to any element for which the conditional (anonymous) function a->a>5 evaluates to true. Obviously this solution is not a neat one-liner, but check out the following speed tests:
julia> using BenchmarkTools
julia> x1 = rand(1:10, 100);
julia> x2 = copy(x1);
julia> #btime $x1[$x1 .> 5] .= 5;
327.862 ns (8 allocations: 336 bytes)
julia> #btime f_cond!($x2, a->a>5, 5);
15.067 ns (0 allocations: 0 bytes)
This is just ludicrously faster. Also, you can just replace Int with T<:Any. Given the speed-up, one might wonder if there is a function in Base that already does this. A one-liner is:
map!(a->a>5 ? 5 : a, x, x)
and while this significantly speeds up over the first approach, it falls well short of the second.
Incidentally, I felt certain this must be a duplicate to another StackOverflow question, but 5 minutes searching didn't reveal anything.
You can broadcast min as well:
x .= min.(x, 5)
Note that this is (slightly) more efficient than using x[x .> 5] .= 5 because it does not allocate the temporary array of Booleans, x .> 5, and it can be automatically vectorized, with a single pass over the memory (as per Oscar's comment below):
julia> using BenchmarkTools
julia> x = [1 2 6 7] ; #btime $x .= min.($x, 5) ; # fast, no allocations
19.144 ns (0 allocations: 0 bytes)
julia> x = [1 2 6 7] ; #btime $x[$x .> 5] .= 5 ; # slower, allocates
148.678 ns (5 allocations: 304 bytes)

Manipulating data in DataFrame: how to calculate the square of a column

I would like to calculate the square of a column A 1,2,3,4, process it with other calculation store it in column C
using CSV, DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
df.C = ((((df.A./2).^2).*3.14)./1000)
Is there an easier way to write it?
I am not sure how much shorter you would want the formula to be, but you can write:
df.C = #. (df.A / 2) ^ 2 * 3.14 / 1000
to avoid having to write . everywhere.
Or you can use transform!, but it is not shorter (its benefit is that you can uset it in a processing pipeline, e.g. using Pipe.jl):
transform!(df, :A => ByRow(a -> (a / 2) ^ 2 * 3.14 / 1000) => :C)
Try this:
df.D = .5df.A .^2 * 0.00314
Explanation:
not so many parentheses needed
multiplying scalar by vector is here as good as the vectorization for short vectors (up two something like 100 elements)
A simple benchmark using BenchmarkTools:
julia> #btime $df.E = .5*$df.A .^2 * 0.00314;
592.085 ns (9 allocations: 496 bytes)
julia> #btime $df.F = #. ($df.A / 2) ^ 2 * 0.00314;
875.490 ns (11 allocations: 448 bytes)
The fastest is however a longer version where you provide the type information #. (df.A::Vector{Int} / 2) ^ 2 * 0.00314 (again this matters rather for short DataFrames and note that here the Z column must exist so we create it here):
julia> #btime begin $df.Z = Vector{Float64}(undef, nrow(df));#. $df.Z = ($df.A::Vector{Int} / 2.0) ^ 2.0 * 0.00314; end;
162.564 ns (3 allocations: 208 bytes)

Why are the trigonometric functions in Julia seem to be slower than in Numpy?

I'm new to Julia, so may be doing something wrong. But I ran a simple test of trigonometric functions, and Julia seems to be significantly slower than Numpy. Need some help to see why.
--- Julia version:
x = rand(100000);
y = similar(x);
#time y.=sin.(x);
--- Numpy version:
import numpy
x = numpy.random.rand(100000)
y = numpy.zeros(x.shape)
%timeit y = numpy.sin(x)
The Julia version regularly gives 1.3 ~ 1.5 ms, but the Numpy version usually gives 0.9 ~ 1 ms. The difference is quite significant. Why is that? Thanks.
x = rand(100000);
y = similar(x);
f(x,y) = (y.=sin.(x));
#time f(x,y)
#time f(x,y)
#time f(x,y)
Gives
julia> #time y.=sin.(x);
0.123145 seconds (577.97 k allocations: 29.758 MiB, 5.70% gc time)
julia> #time y.=sin.(x);
0.000515 seconds (6 allocations: 192 bytes)
julia> #time y.=sin.(x);
0.000512 seconds (6 allocations: 192 bytes)
The first time you call a function, Julia compiles it. Broadcast expressions generate and use an anonymous function, so if you broadcast in the global scope it will compile it each time. Julia works best in function scopes.

How to show field values in Julia

I was wondering if there is a possibility to show field values in Julia.
For example, this Python program, gets the object variable wealth from the consumer class:
class Consumer:
def __init__(self, w):
"Initialize consumer with w dollars of wealth"
self.wealth = w
def earn(self, y):
"The consumer earns y dollars"
self.wealth += y
def spend(self, x):
"The consumer spends x dollars if feasible"
new_wealth = self.wealth - x
if new_wealth < 0:
print("Insufficent funds")
else:
self.wealth = new_wealth
c1.wealthc1 = Consumer(10) # Create instance with initial wealth 10
c1.spend(5)
c1.wealth
The wealth variable is 5. I want to know how can I translate this code to Julia.
The simplest approach is pretty much like Python:
mutable struct Consumer
wealth
end
function earn(c::Consumer, y)
c.wealth += y
end
function spend(c::Consumer, y)
c.wealth -= y
end
And now you can use it like:
julia> c1 = Consumer(10)
Consumer(10)
julia> spend(c1, 5)
5
julia> c1.wealth
5
You can read more about it here.
But probably in Julia you would write it like:
mutable struct ConsumerTyped{T<:Real}
wealth::T
end
function earn(c::ConsumerTyped, y)
c.wealth += y
end
function spend(c::ConsumerTyped, y)
c.wealth -= y
end
Which on surface will work almost the same. The difference is T which specifies the type of wealth. There are two benefits: you get type control in your code and the functions will run faster.
Given such a definition the only thing you need to know is that the constructor can be called in two flavors:
c2 = ConsumerTyped{Float64}(10) # explicitly specifies T
c3 = ConsumerTyped(10) # T implicitly derived from the argument
Now let us compare the performance of both types:
julia> using BenchmarkTools
julia> c1 = Consumer(10)
Consumer(10)
julia> c2 = ConsumerTyped(10)
ConsumerTyped{Int64}(10)
julia> #benchmark spend(c1, 1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 56.434 ns (0.00% GC)
median time: 57.376 ns (0.00% GC)
mean time: 60.126 ns (0.84% GC)
maximum time: 847.942 ns (87.69% GC)
--------------
samples: 10000
evals/sample: 992
julia> #benchmark spend(c2, 1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 29.858 ns (0.00% GC)
median time: 30.791 ns (0.00% GC)
mean time: 32.835 ns (1.63% GC)
maximum time: 966.188 ns (90.20% GC)
--------------
samples: 10000
evals/sample: 1000
and you see that you get ~2x speedup.
Julia doesn't support classes (in terms of OOP).
However, there are composite types which can represent the variables of your class:
type Consumer
wealth::Float64
end
Now, since Julia doesn't support classes, all methods have to live outside this type which allows one of the key features of Julia, multiple dispatch, to also work with user-defined types. (https://docs.julialang.org/en/stable/manual/methods/, https://www.juliabloggers.com/julia-in-ecology-why-multiple-dispatch-is-good/)
Hence, you would have to add a method like this:
function earn!(consumer::Consumer, y::Float64)
println("The consumer earns y dollars")
consumer.wealth = consumer.wealth + y
end
(Similarly, the spend function can be implemented.)

How to specify the format for printing an array of Floats in julia?

I have an array or matrix that I want to print, but only to three digits of precision. How do I do that. I tried the following.
> #printf("%.3f", rand())
0.742
> #printf("%.3f", rand(3))
LoadError: TypeError: non-boolean (Array{Bool,1}) used in boolean context
while loading In[13], in expression starting on line 1
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
The OP said:
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
This answer to a similar questions suggests something like this:
julia> VERSION
v"1.0.0"
julia> using Printf
julia> m = 3; n = 5;
julia> A = rand(m, n)
3×5 Array{Float64,2}:
0.596055 0.0574471 0.122782 0.829356 0.226897
0.606948 0.0312382 0.244186 0.356534 0.786589
0.147872 0.61846 0.494186 0.970206 0.701587
# For this session of the REPL, redefine show function. Next REPL will be back to normal.
# Note %1.3f% spec for printf format string to get 3 digits to right of decimal.
julia> Base.show(io::IO, f::Float64) = #printf(io, "%1.3f", f)
# Now we have the 3 digits to the right spec working in the REPL.
julia> A
3×5 Array{Float64,2}:
0.596 0.057 0.123 0.829 0.227
0.607 0.031 0.244 0.357 0.787
0.148 0.618 0.494 0.970 0.702
# The print function prints with 3 decimals as well, but note the semicolons for rows.
# This may not be what was wanted either, but could have a use.
julia> print(A)
[0.596 0.057 0.123 0.829 0.227; 0.607 0.031 0.244 0.357 0.787; 0.148 0.618 0.494 0.970 0.702]
How about this?
julia> print(round.(rand(3); digits=3))
[0.188,0.202,0.237]
I would do it this way:
julia> using Printf
julia> map(x -> #sprintf("%.3f",x), rand(3))
3-element Array{String,1}:
"0.471"
"0.252"
"0.090"
I don't think #printf accepts a list of arguments as you might be expecting.
One solution you could try it to use #sprintf to create formatted strings, but collect them up in a list comprehension. You might then use join to concatenate them together like so:
join([#sprintf "%3.2f" x for x in rand(3)], ", ")