Related
I think in pandas a series S
S[0:2] is equivalent to s.iloc[0:2] , in both cases two rows will be there but recently I got into a trouble The first picture shows the expected output but I didn't know what went wrong in the In this picture S[0] is showing error i don't know why
I can try to explain this behavior a bit. In Pandas, you have selection by position or by label, and it's important to remember that every single row/column will always have a position "name" and a label name. In the case of columns, this distinction is often easy to see, because we usually give columns string names. The difference is also obvious when you use explicitly .iloc vs .loc slicing. Finally, s[X] is indexing, which s[X:Y] is slicing, and the behaviour of the two actions is different.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b': [3,3,4]})
df.iloc[:,0]
df.loc[:,'a']
both will return
0 1
1 2
2 3
Name: a, dtype: int64
Now, what happened in your case is that you overwrote the index names when you declared s.index = [11,12,13,14]. You can see that by inspecting the index before and after this change. Before, if you run s.index, you see that it is a RangeIndex(start=0, stop=4, step=1). After you change the index, it becomes Int64Index([11, 12, 13, 14], dtype='int64').
Why does this matter? Because although you overrode the labels of the index, the position of each one of them remains the same as before. So, when you call
s[0:2]
you are slicing by position (this section in the documentation explains that it's equivalent to .iloc. However, when you run
s[0]
Pandas thinks you want to select by label, so it starts looking for the label 0, which doesn't exist anymore, because you overrode it. Think of the square-bracket selection in the context of selecting a dataframe column: you would say df["column"] (so you're asking for the column by label), so the same is in the context of a series.
In summary, what happens with Series indexing is the following:
In the case you use string index labels, and you index by an string, Pandas will look up the string label.
In the case you use string index labels, and you index by an integer, Pandas will fall back to indexing by position (that's why your example in the comment works).
In the case you use integer index labels, and you index by an integer, Pandas will always try to index by the "label" (that's why the first case doesn't work, as you have overriden the label 0).
Here are two articles explaining this bizarre behavior:
Indexing Best Practices in Pandas.series
Retrieving values in a Series by label or position
This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.
How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).
I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):
### Initialize source dataframe for PCA
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
### Extract last column as labels
arLabels=dfSource[1:2:end,3]
### Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)
output> MethodError: no method matching...
At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?
A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):
# avoid materialization if dfSource is large
dfSourceHalf = #view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]
(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)
This code works:
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
arLabels=dfSource[1:2:end,3]
datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)
output> [0,10,0]
Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!
Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).
I have an ordered collection that I would like to convert into a literal array. Below is the ordered collection and the desired result, respectively:
an OrderedCollection(1 2 3)
#(1 2 3)
What would be the most efficient way to achieve this?
The message asArray will create and Array from the OrderedCollection:
anOrderedCollection asArray
and this is probably what you want.
However, given that you say that you want a literal array it might happen that you are looking for the string '#(1 2 3)' instead. In that case I would use:
^String streamContents: [:stream | aCollection asArray storeOn: stream]
where aCollection is your OrderedCollection.
In case you are not yet familiar with streamContents: this could be a good opportunity to learn it. What it does in this case is equivalent to:
stream := '' writeStream.
aCollection asArray storeOn: stream.
^stream contents
in the sense that it captures the pattern:
stream := '' writeStream.
<some code here>
^stream contents
which is fairly common in Smalltalk.
UPDATE
Maybe it would help if we clarify a little bit what do we mean literal arrays in Smalltalk. Consider the following two methods
method1
^Array with: 1 with: 2 with: 3
method2
^#(1 2 3)
Both methods answer with the same array, the one with entries 1, 2 and 3. However, the two implementations are different. In method1 the array is created dynamically (i.e., at runtime). In method2 the array is created statically (i.e., at compile time). In fact when you accept (and therefore compile) method2 the array is created and saved into the method. In method1instead, there is no array and the result is created every time the method is invoked.
Therefore, you would only need to create the string '#(1 2 3)' (i.e., the literal representation of the array) if you were generating Smalltalk code dynamically.
You can not convert an existing object into a literal array. To get a literal array you'd have to write it using the literal array syntax in your source code.
However, I believe you just misunderstood what literal array means, and you are infact just looking for an array.
A literal array is just an array that (in Pharo and Squeak [1]) is created at compile time, that is, when you accept the method.
To turn an ordered collection into an array you use asArray.
Just inspect the results of
#(1 2 3).
(OrderedCollection with: 1 with: 2 with: 3) asArray.
You'll see that both are equal.
[1]: see here for an explanation: https://stackoverflow.com/a/29964346/1846474
In Pharo 5.0 (a beta release) you can do:
| oc ary |
oc := OrderedCollection new: 5.
oc addAll: #( 1 2 3 4 5).
Transcript show: oc; cr.
ary := oc asArray.
Transcript show: ary; cr.
The output on the transcript is:
an OrderedCollection(1 2 3 4 5)
#(1 2 3 4 5)
the literalArray encoding is a kind of "poor man's" persistency encoding to get a representation, which can reconstruct the object from a compilable literal array. I.e. an Array of literals, which by using decodeAsLiteralArray reconstructs the object.
It is not a general mechanism, but was mainly invented to store UI specifications in a method (see UIBuilder).
Only a small subset of classes support this kind of encoding/decoding, and I am not sure if OrderedCollection does it in any dialect.
In the one I use (ST/X), it does not, and I get a doesNotUnderstand.
However, it would be relatively easy to add the required encoder/decoder and make it possible.
But, as I said, its intended use is for UIspecs, not as a general persistency (compiled-object persistency) mechanism. So I rather not recommend using it for such.