Convert inputs to arrays in julia - numpy

Is there an alternative to numpy.atleast_2d() in Julia.
The python function can be found on this link: https://www.geeksforgeeks.org/numpy-atleast_2d-in-python/

Looking at the Python Numpy docs this needs to be defined as:
atleast_2d(a) = fill(a,1,1)
atleast_2d(a::AbstractArray) = ndims(a) == 1 ? reshape(a, :, 1) : a
Testing:
julia> atleast_2d(3)
1×1 Matrix{Int64}:
3
julia> atleast_2d([4,5])
2×1 Matrix{Int64}:
4
5

Related

Pandas interpolation type when method='index'?

The pandas documentation indicates that when method='index', the numerical values of the index are used. However, I haven't found any indication of the underlying interpolation method employed. It looks like it uses linear interpolation. Can anyone confirm this definitively or point me to where this is stated in the documentation?
So turns out the document is bit misleading for those who read it will likely to think:
‘index’, ‘values’: use the actual numerical values of the index.
as fill the NaN values with numerical values of the index which is not correct, we should read it as linear interpolate value use the actual numerical values of the index
The difference between method='linear' and method='index' in source code of pandas.DataFrame.interpolate mainly are in following code:
if method == "linear":
# prior default
index = np.arange(len(obj.index))
index = Index(index)
else:
index = obj.index
So if you using the default RangeIndex as index of the dataframe, then interpolate results of method='linear' and method='index' will be the same, however if you specify the different index then results will not be the same, following example will show you the difference clearly:
import pandas as pd
import numpy as np
d = {'val': [1, np.nan, 3]}
df0 = pd.DataFrame(d)
df1 = pd.DataFrame(d, [0, 1, 6])
print("df0:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df0.interpolate(method='index'), df0.interpolate(method='linear')))
print("df1:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df1.interpolate(method='index'), df1.interpolate(method='linear')))
Outputs:
df0:
method_index:
val
0 1.0
1 2.0
2 3.0
method_linear:
val
0 1.0
1 2.0
2 3.0
df1:
method_index:
val
1 1.000000
2 1.333333
6 3.000000
method_linear:
val
1 1.0
2 2.0
6 3.0
As you can see, when index=[0, 1, 6] with val=[1.0, 2.0, 3.0], the interpolated value is 1.0 + (3.0-1.0) / (6-0) = 1.333333
Following the runtime of the pandas source code (generic.py -> managers.py -> blocks.py -> missing.py), we can find the implementation of linear interpolate value use the actual numerical values of the index:
NP_METHODS = ["linear", "time", "index", "values"]
if method in NP_METHODS:
# np.interp requires sorted X values, #21037
indexer = np.argsort(inds[valid])
result[invalid] = np.interp(
inds[invalid], inds[valid][indexer], yvalues[valid][indexer]
)

Julia CartesianIndex to integer LinearIndex conversion

To convert a CartesianIndex, such as CartesianIndex(1,2) to a LinearIndex, I can use the LinearIndeces function:
julia> a = rand(2,2)
2×2 Array{Float64,2}:
0.57097 0.0647051
0.767868 0.531104
julia> I = LinearIndices(a)
2×2 LinearIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}:
1 3
2 4
julia> I[CartesianIndex(1,2)]
3
However, how do I get the LinearIndex integer 3 for CartesianIndex(1,2) without constructing the instance of the array a? Assuming I know the ranges for the CartesianIndex, 1:2, 1:2.
Just use LinearIndices with a tuple of the axes (or even just a tuple of dimension sizes):
julia> LinearIndices((1:2,1:2))
2×2 LinearIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}:
1 3
2 4
julia> LinearIndices((1:2,1:2))[1,2]
3

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Taking an expression as an argument in Julia function

I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html

DataFrames.jl Number of rows

I'd like to get the number of rows of a dataframe.
I can achieve that with size(myDataFrame)[1].
Is there a cleaner way ?
If you are using DataFrames specifically, then you can use nrow():
julia> df = DataFrame(Any[1:10, 1:10]);
julia> nrow(df)
10
Alternatively, you can specify the dimension argument for size:
julia> size(df, 1)
10
This also work for arrays as well so it's a bit more general:
julia> my_array = rand(4, 3)
4×3 Array{Float64,2}:
0.980798 0.873643 0.819478
0.341972 0.34974 0.160342
0.262292 0.387406 0.00741398
0.512669 0.81579 0.329353
julia> size(my_array, 1)
4