How to multiply a dataframe's column by a log using julia? - dataframe

I have a dataframe. I want to multiply column "b" by a "log" and then replace NaN by 0s.
How can I do that in Julia?
I am checking this: DataFrames.jl
But I do not understand.
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))

I want to multiply column "b" by a "log"
Assuming you mean you want to apply the (natural) log to each element in column :b, you can do the following:
log.(df.b)
log(x) applies the (natural) log to an individual element x. By putting a dot after the log, you are broadcasting the log function across each element.
If you wanted to replace column b, do the following:
df.b = log.(df.b)
and then replace NaN by 0s
I'm assuming you want to handle the case where you have a DomainError (ie taking the log of a negative number). Your best best is to handle the error before it arises:
map( x -> x <= 0 ? 0.0 : log(x), df.b)
This maps the anonymous function x -> x <= 0 ? 0.0 : log(x) across each element of column b in your DataFrame. This function tests if x is less than zero - if yes then return 0.0 else return log(x). This "one-line if statement" is called a ternary operator.

Use a generator:
( v <= 0. ? 0. : log(v) for v in df.c )
If you want to add a new column:
df[!, :d] .= ( v <= 0. ? 0. : log(v) for v in df.c)
This is faster than using map (those tests assume that df.d already exits:
julia> using BenchmarkTools
julia> #btime $df[!, :d] .= ( v <= 0.0 ? 0.0 : log(v) for v in $df.c)
1.440 μs (14 allocations: 720 bytes)
julia> #btime $df[!, :d] .= map( x -> x <= 0.0 ? 0.0 : log(x), $df.c);
1.570 μs (14 allocations: 720 bytes)

Related

How can I make Julia output vectors as pretty as Numpy?

If I do the following:
A = [
2 1 3 0 0;
1 1 2 0 0;
0 6 0 2 1;
6 0 0 1 1;
0 0 -20 3 2
]
b = [10; 8; 0; 0; 0]
println(A\b)
The output is:
[8.000000000000002, 12.0, -6.000000000000001, -23.999999999999975, -24.000000000000043]
However, I would prefer it look similar to the way Numpy outputs the result of the same problem (EDIT: preferably keeping a trailing zero and the commas, though):
[ 8. 12. -6. -24. -24.]
Is there an easy way to do this? I could write my own function to do this, of course, but it would be pretty sweet if I could just set some formatting flag instead.
Thanks!
The standard way to do it is to change the IOContext:
julia> println(IOContext(stdout, :compact=>true), A\b)
[8.0, 12.0, -6.0, -24.0, -24.0]
You can write your function e.g. (I am not trying to be fully general here, but rather show you the idea):
printlnc(x) = println(IOContext(stdout, :compact=>true), x)
and then just call prinlnc in your code.
You could change the REPL behavior in Julia by overriding the Base.show method for floats. For an example:
Base.show(io::IO, f::Float64) = print(io, rstrip(string(round(f, digits=7)),'0') )
Now you have:
julia> println(A\b)
[8., 12., -6., -24., -24.]
As noted by #DNF Julia is using commas in vectors. If you want to have a horizontal vector (which is a 1xn matrix in fact) you would need to transpose:
julia> (A\b)'
1×5 adjoint(::Vector{Float64}) with eltype Float64:
8. 12. -6. -24. -24.
julia> println((A\b)')
[8. 12. -6. -24. -24.]
Numpy lies to you. It just hides the digits when printing. To check that it only manipulates the printing of the output, do print(A # X - b) and see the result.
print(A # X - b)
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 -3.55271368e-15 0.00000000e+00]
Julia, on the other hand, makes this clear upfront. If you do the same in Julia, you get the same result (I use Foat64 as Numpy does):
julia> X = A \ b;
julia> Float64.(A) * X - b
5-element Vector{Float64}:
0.0
0.0
0.0
-3.552713678800501e-15
0.0
You can use compact printing, however, similar to arrays to remove the unnecassary digits.
julia> round.(X, digits=7)
5-element Vector{Float64}:
8.0
12.0
-6.0
-24.0
-24.0
This, is much better than the "ugly" 8. 12. -6. -24. -24.

I would like to concatenate anonymous calls in Julia

So, I'm learning more about Julia and I would like to do the following:
I have a 3 row by 2 columns matrix, which is fixed,
A = rand(2,3)
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.705942 0.553562 0.731246
0.205833 0.106978 0.131893
Then, I would like to have a anonymous function, which does the following:
a = ones(1,3);
a[2] = rand();
Finally, I would like to broadcast
broadcast(+, ones(1,3) => a[2]=rand(), A)
So I have the middle column of A, i.e., A[:,2], added by two different random numbers, and in the rest of the columns, we add ones.
EDIT:
If I add a, as it is:
julia> a = ones(1,3)
1×3 Matrix{Float64}:
1.0 1.0 1.0
julia> a[2] = rand()
0.664824196431979
julia> a
1×3 Matrix{Float64}:
1.0 0.664824 1.0
I would like that this a were dynamic, and a function.
So that:
broadcast(+, a, A)
Would give:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 0.553562 + rand() (correct) 1.73125
1.20583 0.106970 + rand() (different rand()) 1.13189
Instead of:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 1.21839 (0.553562 + -> 0.664824) 1.73125
1.20583 0.771802 (0.106978 + -> 0.664824) 1.13189
So, I thought of this pseudo-code:
broadcast(+, a=ones(1,3) => a[2]=rand(), A)
Formalizing:
broadcast(+, <anonymous-fucntion>, A)
Second EDIT:
Rules/Constrains:
Rule 1: the call must be data-transparent. That is, A must not change state, just like when we call f.(A).
Rule 2: not creating an auxiliary variable (a must not exist). The only vector that must exist, before and after, the call is A.
Rule 3: f.(A) must be anonymous; that is, you can't use define f as function f(A) ... end
With the caveat that I don't know how much you really learn by setting artificial rules like this, some tidier ways are:
julia> A = [ 0.705942 0.553562 0.731246
0.205833 0.106978 0.131893 ]; # as given
julia> r = 0.664824196431979; # the one random number
julia> (A' .+ (1, r, 1))' # no extra vector
2×3 adjoint(::Matrix{Float64}) with eltype Float64:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> mapslices(row -> row .+ (1, r, 1), A; dims=2) # one line, but slow
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> B = A .+ 1; #views B[:, 2] .+= (-1 + r); B # fast, no extra allocations
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
I can't tell from your question whether you want one random number or two different ones. If you want two, then you can do this:
julia> using Random
julia> Random.seed!(1); mapslices(row -> row .+ (1, rand(), 1), A; dims=2)
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
julia> Random.seed!(1); B = A .+ 1; #views B[:, 2] .+= (-1 .+ rand.()); B
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
Note that (-1 .+ rand.()) isn't making a new array on the right, it's fused by .+= into one loop over a column of B. Note also that B[:,2] .= stuff just writes into B, but B[:, 2] .+= stuff means B[:, 2] .= B[:, 2] .+ stuff and so, without #views, the slice B[:, 2] on the right would allocate a copy.
Firstly I'd like to say that the approach taken in the other answers is the most performant one. It seems like you want the entire matrix at the end, in that case for the best performance it is generally good to get data (like randomness) in big batches and to not "hide" data from the compiler (especially type information). A lot of interesting things can be achieved with higher level abstractions but since you say performance is important, let's establish a baseline:
function approach1(A)
a = ones(2,3)
#. a[:, 2] = rand()
broadcast(+, a, A)
end
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.199619 0.273481 0.99254
0.0927839 0.179071 0.188591 julia> #btime approach1($A)
65.420 ns (2 allocations: 256 bytes)
2×3 Matrix{Float64}:
1.19962 0.968391 1.99254
1.09278 1.14451 1.18859
With that out of the way let's try some other solutions.
If a single row with lazy elements doesn't count as an auxiliary variable this seems like a good starting point:
function approach2(A)
a = Matrix{Function}(undef, 1, 3)
fill!(a, ()->1.0)
a[2] = rand
broadcast((a,b)->a() + b, a, A)
end
We get a row a = [()->1.0 rand ()->1.0] and evaluate each function when the broadcast gets that element.
julia> #btime approach2($A)
1.264 μs (24 allocations: 960 bytes)
The performance is 20 times worse, why? We've hidden type information from the compiler, it can't tell that a() is a Float64 by just asserting this (changing the last row to broadcast((a,b)->a()::Float64 + b, a, A) increase the performance almost tenfold:
julia> #btime approach2($A)
164.108 ns (14 allocations: 432 bytes)
If this is acceptable we can make it cleaner: introduce a LazyNumber type that keeps track of the return type, and has promote rules/operators so we can get back to broadcast(+, ...). However, we are still 2-3 times slower, can we do better?
An approach that could allow us to squeeze out some more would be to represent the whole array lazily. Something like a Fill type, a LazySetItem that applies on top of a matrix. Once again actually creating the array will be cheaper unless you can avoid getting parts of the array
I agree that it is not very clear what you are trying to achieve, and even if what you want to learn is how to achieve something or how theoretically something works.
If all you want is just to add a random vector to a matrix column (and 1 elsewhere), it is as simple as... add a random vector to the desired matrix column and 1 elsewhere:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.94194 0.691855 0.583107
0.198166 0.740017 0.914133
julia> A[:,[1,3]] .+= 1
2×2 view(::Matrix{Float64}, :, [1, 3]) with eltype Float64:
1.94194 1.58311
1.19817 1.91413
julia> A[:,2] += rand(size(A,1))
2-element Vector{Float64}:
1.0306116987831297
0.8757712661515558
julia> A
2×3 Matrix{Float64}:
1.94194 1.03061 1.58311
1.19817 0.875771 1.91413
Why not just have a be the same size as A, and then you don't even need broadcasting or any weird tricks:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.564824 0.765611 0.841353
0.566965 0.322331 0.109889
julia> a = ones(2,3);
julia> a[:, 2] .= [rand() for _ in 1:size(a, 1)] #in every row, make the second column's value a different rand() result
julia> a
2×3 Matrix{Float64}:
1.0 0.519228 1.0
1.0 0.0804104 1.0
julia> A + a
2×3 Matrix{Float64}:
1.56482 1.28484 1.84135
1.56696 0.402741 1.10989

Julia: How to find the locations of strings (in a string array) in another string array

I want to return a logical vector showing the location of strings that are members of two string arrays A and B.
In Matlab, this would be
A = ["me","you","us"]
B = ["me","us"]
myLogicalVector = ismember(A,B)
myLogicalVector =
1×3 logical array
1 0 1
How do I achieve this in Julia?
I have tried
myLogicalVector = occursin.(A,B)
myLogicalVector = occursin(A,B)
It seems that occursin only works if the two input string arrays are of the same length or one string is a scalar - I am not sure if I am correct on this one.
You can write:
julia> in(B).(A)
3-element BitArray{1}:
1
0
1
more verbose versions of similar operation are (note that the type of array is different in all cases except the first):
julia> in.(A, Ref(B))
3-element BitArray{1}:
1
0
1
julia> [in(a, B) for a in A]
3-element Array{Bool,1}:
1
0
1
julia> map(a -> in(a, B), A)
3-element Array{Bool,1}:
1
0
1
julia> map(a -> a in B, A)
3-element Array{Bool,1}:
1
0
1
julia> [a in B for a in A]
3-element Array{Bool,1}:
1
0
1
If A and B were large and you needed performance then convert B to a Set like this:
in(Set(B)).(A)
(you pay one time cost of creation of the set, bu then the lookup will be faster)

Julia DataFrames equivalent of pandas pct_change()

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.

Apply a pandas window function to a reversed MultiIndex level?

I need to get an .expanding calculation on a MultiIndex DataFrame. But I need it to run in reverse. Here's a sample DataFrame:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in pd.date_range('2018-01-01', periods=5, freq='W')]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df.loc[df['Vals'] < 0] = np.nan
And here is an example of what I want to do for each of the level-0 index values:
k = df.loc['A']
k['Missing'] = k[::-1].isnull().expanding().sum() # Expanding-sum on reversed level-1
This produces the correct results for that one top-level value 'A':
Vals Missing
Num
2018-01-07 NaN 2.0
2018-01-14 NaN 1.0
2018-01-21 0.618576 0.0
2018-01-28 0.568692 0.0
2018-02-04 1.350509 0.0
But how do I get that to apply to all top-level index values, so I can set df['Missing'] =?
I was trying things of the form df.groupby('Name')[::-1].isnull().expanding().sum() ... but I can't get a functional form that allows the level-1 index to be reversed for the calculation.
What is a pandaic expression to do this?
Got it: After grouping we have to strip the top level off the MultiIndex before working on the inner set:
df['Missing'] = df.groupby('Name').apply(
lambda x: x.reset_index(level=0, drop=True)[::-1].isnull().expanding().sum()
)
I.e., we groupby('Name'), and then for each group the lambda expression strips the level-0 index via .reset_index(level=0, drop=True), at which point we can use the remaining DataFrame in reverse order: x[::-1].