overloading subarray operator in julia - operators

I know that in julia creating binary operator overloads is easy, e.g.
+(x,y) = x*y
I also know that a[i] is an abbreviation to getindex and setindex!
I would like to know how to overload subarray operators, such as
a[i,j,:,3:]
I believe this is just a function called but am unable to find its name

#which macro helps finding right method to overload:
julia> sample=rand(3,4,5);
julia> #which(sample[1,1,1])
getindex(A::Array{T,N}, i1::Real, i2::Real, I::Real...) at array.jl:283
julia> #which(sample[1,1,:])
getindex(A::AbstractArray{T,N}, I...) at abstractarray.jl:487

Related

Raku: return Type

I want to write a function returning an array whose all subarrays must have a length of two.
For example return will be [[1, 2], [3, 4]].
I define:
(1) subset TArray of Array where { .all ~~ subset :: where [Int, Int] };
and
sub fcn(Int $n) of TArray is export(:fcn) {
[[1, 2], [3, 4]];
}
I find (1) over-complicated. Is there something simpler?
Stepping back first
subset TArray of Array where { .all ~~ subset :: where [Int, Int] };
Is there something simpler?
Before we go there, let's step back. Even ignoring your code's "overly-complicated" nature based on just looking at it, it's also potentially problematic and complicated for various reasons that may not be so obvious. I'll highlight three:
This subset will accept an Array containing Arrays, with each of those arrays containing two Ints. But it doesn't mandate an Array[Array[Int]]. The outer Array's type may be just a generic Array, rather than being an Array[Array] let lone an Array[Array[Int]]. Indeed it will be unless you deliberately introduce strongly typed values. I will cover strong typing in the last section of this answer.
What about an empty Array? Your subset will accept that. Is that your intent? If not, what about requiring at least one pair of Ints?
The outer where clause uses a common Raku idiom of the form .all ~~ ..., with a junction on the left hand side of the ~~ smart match operator. Astonishingly, per an issue I just filed, this may be a problem. What alternatives are there?
Starting simple
Raku does a decent job of keeping simple things simple. If we put aside any artificial desire for strong typing, and focus on simple tools for tightening code up, a simple subset I would have suggested in the past would be:
subset TArray where .all == 2; # BAD despite being idiomatic???
This has all of the problems your original code has, plus in addition it accepts data that has non-integers where integers belong.
But it does have the redeeming qualities that it does a useful check (that the inner arrays each have two elements) and it's significantly simpler than your code.
Now I've reminded myself that I need to view .all on the left hand side of ~~ as possibly a problem, I'll instead write it as:
subset TArray where 2 == .all; # Potentially the new idiomatic.
This version reads more poorly, but, while readability is important, basic correctness is more important.
Still fairly simple, and less problems
Here are two variants I came up with:
subset TArray where all .map: * ~~ (Int,Int);
subset TArray where .elems == .grep: (Int,Int);
These both avoid the junction/smartmatch problem. (The first where expression does have a junction to the left of a smart match, but it's not an example of the problem.)
The second version isn't so obviously correct (think of it as checking that the count of subarrays is the same as the count of subarrays that match (Int,Int)) but it nicely lends itself to fixing the problem of matching if there are zero subarrays, if that were to need fixing:
subset TArray where 0 < .elems == .grep: (Int,Int);
Strong typing solutions
The solutions thus far don't deal with strong typing. Perhaps that's desirable. Perhaps not.
To understand what I mean by this, let's first look at literals:
say WHAT 1; # (Int)
say WHAT [1,2]; # (Array)
say WHAT [[1,2],[3,4]]; # (Array)
These values have types determined by their literal constructors.
The last two are just Arrays, generic over their elements.
(The second is not an Array[Int], which might be expected. Similarly the last one is not an Array[Array[Int]].)
Current built in Raku literal forms for composite types (arrays and hashes) all construct generic Arrays which do not constrain the types of their elements.
See the PR Introduce [1,2,3]:Int syntax #4406 for a proposal/PR regarding element typed composite literals and a related issue I just posted in response to your Q here about an alternative and/or complementary approach to that PR. (There have been discussions over the years about this aspect of the type system but it seems like it's time for Rakoons to look at addressing it.)
What if you wanted to build a strongly typed data structure as the value to return from your routine, and to have the return type check that?
Here's one way one might build such a strongly typed value:
my Array[Array[Int]] $result .= new: Array[Int].new(1,2), Array[Int].new(3,4);
Super verbose! But now you could write the following for your sub's return type check and it'll work:
subset TArray of Array[Array[Int]] where 0 < .elems == .grep: (Int,Int);
sub fcn(Int $n) of TArray is export(:fcn) {
my Array[Array[Int]] $result .= new: Array[Int].new(1,2), Array[Int].new(3,4);
}
Another way to build a strongly typed value is to specify not only the strong typing in a variable's type constraint, but also coercion typing to bridge from a loosely typed value to a strongly typed target.
We keep the exact same subset (that establishes the strongly typed target data structure and adds "refinement typing" checks):
subset TArray of Array[Array[Int]] where 0 < .elems == .grep: (Int,Int);
But instead of using a verbose correct-by-construction initialization value, using full type names and news, we introduce additional coercion typing and then just use ordinary literal syntax:
constant TArrayInitialization = TArray(Array[Array[Int]()]());
sub fcn(Int $n) of TArray is export(:fcn) {
my TArrayInitialization $result = [[1,2],[3,4]];
}
(I could have written the TArrayInitialization declaration as another subset, but it would be a slight overkill to have done so. A constant does the job with less fuss.)
I gather that the aim is to restrict the type of the inner Array to [Int,Int] ... the closest I can get to this is to declare two subsets, one based on the other...
subset IArray where * ~~ [Int, Int];
subset TArray where .all ~~ IArray;
Otherwise, the anonymous subset form you use seems to be the briefest, although as #raiph points out you can drop the 'of Array' piece.
If you wanted to impose this sort of constraint on a function's parameter (rather than its return type) you could do so with something like:
sub fcn(#a where {all .map: * ~~ [Int, Int]}) {...}
As the other answers have mentioned, there currently isn't great syntax for similarly constraining the return type, but there's a proposal to add support for similar syntax for return types. In fact, as mentioned in that issue, someone has volunteered to work on an implementation but hasn't yet made any progress as far as I know. (And I guess I should know, since I was that volunteer… oops)
So, for now, a subset is the best option – but hopefully the future will have even better ways to write that.

Writing an `__array_ufunc__` for string dtypes

I'm implementing a class that mixes in NDArrayOperatorsMixin using the appraoch described here.
This works well for numbers, but doesn't work with string dtypes. For example,
x = MyNewArrayClass(np.array(["a", "b", "c"]))
x == "a"
raises the following UFuncTypeError:
numpy.core._exceptions.UFuncTypeError: ufunc 'equal' did not contain a loop with signature matching types (dtype('<U1'), dtype('<U1')) -> dtype('bool')
How can I modify the implementation suggested in the docs to support str dtypes?

How to subset Julia DataFrame by condition, where column has missing values

This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.
How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).

What is the lambda function doing in the info_dict parameter of the summary_col in this code?

I'm running summary statistics for a group of standard OLS regressions. The code was written by my professor and I'm trying to figure out what's going on specifically in a portion of the code.
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict = {
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
})
I looked up lambda functions. I have a fairly decent understanding of how they work. Aspects of the code that I do understand:
info_dict is a dictionary of values that can be called if you wish to include them in your summary statistics
lambda function work by calling an anonymous function "lambda x" then you place the : and list what operation you want to take place (i.e. x + 5) and then if you already know what parameters you want it to run you can put in a list after a second ":".
{0:d} will round to integers which makes perfect sense for observations. Although I don't know why you can't just say {%.f}. Maybe it's because the former returns an explicit int and the latter returns a float that looks like an int.
{:.2f} will return a float with 2 decimal places
What I don't fully understand is what somestring.format() does. Somehow x is getting defined as the results from the regression I believe and x.nobs is the variable "number of observations". Similar for x.rsquared.
Could someone fill in the gaps for me about what's going on in the formula? What exactly about the lambda function is enabling it to fetch data for each individual regression?
Let's break this out a little bit to make it obvious what is happening:
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict={
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
}
)
The summary_col object is taking in some input, the first argument being a list of regression objects, [reg0,reg1,reg2,reg3]. Then there are three named arguments, stars, float_format, and info_dict. When we pass in the list of regression objects as the first argument, I believe that the lambda function knows to apply the anonymous function to each object. So all info_dict is doing is creating a dictionary with two keys, N and R2 which map to strings. When the member x.nobs and x.rsquared are referenced in the lambda functions they are applied against the regression objects due to the context in which these are used.
If you try to use lambda in that line of code on something that does not exist in the regression objects, you'll almost certainly get an error. The key is in the context against which the lambda is applied.
A good example on the context of lambda functions is iterating over a dictionary and sorting by key and value.
# sort the dict by value first, and key second...
# x is inferred from the context (my_dict.items())
for key, value in sorted(my_dict.items(), key=lambda x: (x[1], x[0]):
print(key, value)

Fminbox Constrained Optimisation Julia

Either fminbox or the Optim.autodiff function appear to create a vector of type Array{Dual{Float64},1} when I run the code below, since I get the error "fbellmanind has no method matching...Array{Dual{Float64},1}". I've specified the function fbellmanind to accept Array{Any,1} but with no luck. Any ideas?
function fbargsolve(x::Vector)
fbellmanind(probc,EV,V,Ind,x,V0,VUnemp0,Vnp,Vp,q,obj,assets,EmpState,i)
fbellmanfirm(probc,poachedwage,minw,x,jfirm1,jfirm0,Ind,i)
#inbounds for ia in 1:na
Vnp[ia]=V[ia]
Indnp[ia]=Ind[ia]
firmratio[ia]=jfirm1[ia]/jfirmres[ia]
hhratio[ia]=((Vnp[ia]-VUnemp0[ia])/(Vp[ia]-VUnemp0[ia]))
end
Crit_bwr=vnormdiff(firmratio,hhratio,Inf)
return Crit_bwr
end
f=fbargsolve
df = Optim.autodiff(f, Float64, na)
x0=vec(bargwage0)
l=vec(max(reswage,minw))
u=vec(poachedwage*ones(na))
sol=fminbox(df,x0,l,u)
Refer to a very important paragraph from Julia doc
Julia’s type parameters are invariant....
You can follow at least these two possible solutions:
1- Change your function declaration, best is to explicitly use right data type Array{Dual{Float64},1} but if you like a generic way:
Use a parametric data type:
julia> function fbellmanind{T}(::Array{T,1})
"OK"
end
julia> fbellmanind(["test"])
"OK"
2- Type cast your arguments
julia> function fbellmanind(::Array{Any,1})
"OK"
end
julia> fbellmanind(Any["test"])
"OK"