How can I make Julia output vectors as pretty as Numpy? - numpy

If I do the following:
A = [
2 1 3 0 0;
1 1 2 0 0;
0 6 0 2 1;
6 0 0 1 1;
0 0 -20 3 2
]
b = [10; 8; 0; 0; 0]
println(A\b)
The output is:
[8.000000000000002, 12.0, -6.000000000000001, -23.999999999999975, -24.000000000000043]
However, I would prefer it look similar to the way Numpy outputs the result of the same problem (EDIT: preferably keeping a trailing zero and the commas, though):
[ 8. 12. -6. -24. -24.]
Is there an easy way to do this? I could write my own function to do this, of course, but it would be pretty sweet if I could just set some formatting flag instead.
Thanks!

The standard way to do it is to change the IOContext:
julia> println(IOContext(stdout, :compact=>true), A\b)
[8.0, 12.0, -6.0, -24.0, -24.0]
You can write your function e.g. (I am not trying to be fully general here, but rather show you the idea):
printlnc(x) = println(IOContext(stdout, :compact=>true), x)
and then just call prinlnc in your code.

You could change the REPL behavior in Julia by overriding the Base.show method for floats. For an example:
Base.show(io::IO, f::Float64) = print(io, rstrip(string(round(f, digits=7)),'0') )
Now you have:
julia> println(A\b)
[8., 12., -6., -24., -24.]
As noted by #DNF Julia is using commas in vectors. If you want to have a horizontal vector (which is a 1xn matrix in fact) you would need to transpose:
julia> (A\b)'
1×5 adjoint(::Vector{Float64}) with eltype Float64:
8. 12. -6. -24. -24.
julia> println((A\b)')
[8. 12. -6. -24. -24.]

Numpy lies to you. It just hides the digits when printing. To check that it only manipulates the printing of the output, do print(A # X - b) and see the result.
print(A # X - b)
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 -3.55271368e-15 0.00000000e+00]
Julia, on the other hand, makes this clear upfront. If you do the same in Julia, you get the same result (I use Foat64 as Numpy does):
julia> X = A \ b;
julia> Float64.(A) * X - b
5-element Vector{Float64}:
0.0
0.0
0.0
-3.552713678800501e-15
0.0
You can use compact printing, however, similar to arrays to remove the unnecassary digits.
julia> round.(X, digits=7)
5-element Vector{Float64}:
8.0
12.0
-6.0
-24.0
-24.0
This, is much better than the "ugly" 8. 12. -6. -24. -24.

Related

pandas dataframe function mean() not working correctly to ignore nan values

By default, the mean() method should ignore the nan value, but for my case, it didn't work. It still takes the nan value.
a = np.array([1,9])
b = np.array([3,nan])
c = np.array([7,8])
d = {'value': [a,b,a,c], 'group': [3,3,4,4], 'garbage':['asd','acas','asdasdc','ghfas']}
df = pd.DataFrame(data=d)
df
OUTPUT:
value group garbage
0 [1, 9] 3 asd
1 [3.0, nan] 3 acas
2 [1, 9] 4 asdasdc
3 [7, 8] 4 ghfas
for i,j in df.groupby('group')['value']:
print(j.mean())
print("=========")
OUTPUT:
[ 2. nan]
=========
[4. 8.5]
=========
I am not sure what you are trying to do here, but Ill take a stab at it.
Firstly, the values column is a column of numpy arrays, so it is two dimensional. Then when you run groupby, j becomes a pd.Series of numpy arrays. Thus, when you call mean you are taking the mean by aligning the axes of the numpy arrays. This is pretty unadvisable because these objects can change shape which will cause an error.
I think what you are trying to do is take the mean across all the arrays in each group. You can do that with.
for i,j in df.groupby('group')['value']:
print(np.nanmean(np.concatenate(j.values)))
Whatever you are trying to do, it is going to be way easier to interact with once you combine the values in your loop.

Binarize a continuous feature with NaNs Python

I have a pandas dataframe of 4000 rows and 35 features, in which some of the continuous features contain missing values (NaNs). For example, one of them (with 46 missing values) has a very left-skewed distribution and I would like to binarize it by choosing a threshold of 1.5 below which I would like to set it as the class 0 and above or equal to 1.5 as the class 1.
Like: X_original = [0.01,2.80,-1.74,1.34,1.55], X_bin = [0, 1, 0, 0, 1].
I tried doing: dataframe["bin"] = (dataframe["original"] > 1.5).astype(int).
However, I noticed that the missing values (NaNs) disappeared and they are encoded in the 0 class.
How could I solve this problem?
To the best of my knowledge there is way to keep the missing values after a comparison, but you can do the following:
import pandas as pd
import numpy as np
X_original = pd.Series([0.01,2.80,-1.74, np.nan,1.55])
X_bin = X_original > 1.5
X_bin[X_original.isna()] = np.NaN
print(X_bin)
Output
0 0.0
1 1.0
2 0.0
3 NaN
4 1.0
dtype: float64
To keep the column as Integer (and also nullable), do:
X_bin = X_bin.astype(pd.Int8Dtype())
print(X_bin)
Output
0 0
1 1
2 0
3 <NA>
4 1
dtype: Int8
The best way to handle this issue that I found was to use list comprehension:
dataframe["Bin"] = [0 if el<1.5 else 1 if el >= 1.5 else np.NaN for el in dataframe["Original"]]
Then I convert the float numbers to objects except the np.NaN
dataframe["Bin"] = dataframe["Bin"].replace([0.0,1.0],["0","1"])

How to multiply a dataframe's column by a log using julia?

I have a dataframe. I want to multiply column "b" by a "log" and then replace NaN by 0s.
How can I do that in Julia?
I am checking this: DataFrames.jl
But I do not understand.
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
I want to multiply column "b" by a "log"
Assuming you mean you want to apply the (natural) log to each element in column :b, you can do the following:
log.(df.b)
log(x) applies the (natural) log to an individual element x. By putting a dot after the log, you are broadcasting the log function across each element.
If you wanted to replace column b, do the following:
df.b = log.(df.b)
and then replace NaN by 0s
I'm assuming you want to handle the case where you have a DomainError (ie taking the log of a negative number). Your best best is to handle the error before it arises:
map( x -> x <= 0 ? 0.0 : log(x), df.b)
This maps the anonymous function x -> x <= 0 ? 0.0 : log(x) across each element of column b in your DataFrame. This function tests if x is less than zero - if yes then return 0.0 else return log(x). This "one-line if statement" is called a ternary operator.
Use a generator:
( v <= 0. ? 0. : log(v) for v in df.c )
If you want to add a new column:
df[!, :d] .= ( v <= 0. ? 0. : log(v) for v in df.c)
This is faster than using map (those tests assume that df.d already exits:
julia> using BenchmarkTools
julia> #btime $df[!, :d] .= ( v <= 0.0 ? 0.0 : log(v) for v in $df.c)
1.440 μs (14 allocations: 720 bytes)
julia> #btime $df[!, :d] .= map( x -> x <= 0.0 ? 0.0 : log(x), $df.c);
1.570 μs (14 allocations: 720 bytes)

Julia DataFrames equivalent of pandas pct_change()

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me