I've been using pandas for a little while now and I've realised that I use
df.col
df['col']
interchangeably. Are they actually the same or am I missing something?
Following on from the link in the comments.
df.col
Simply refers to an attribute of the dataframe, similar to say
df.shape
Now if 'col' is a column name in the dataframe then accessing this attribute returns the column as series. This sometimes will be sufficient but
df['col']
will always work, and can also be used to add a new column to a dataframe.
I think this is kind of obvious.....
You cannot use df.col if the column name 'col' has a space in it. But df['col'] always works.
e.g,
df['my col'] works but df.my col will not work.
I'll note there's a difference in how some methods consume data. For example, in the LifeTimes library if I use dataframe.col with some methods, the method will consider the column to be an ndarray and throw an exception that the data must be 1-dimensional.
If however I use dataframe['col'] then the method will consume the data as expected.
Related
Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])
This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.
How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).
To me, it seems df.take() has the same functionality as the more common df.iloc[]. I've checked the documentation but couldn't find a difference. Are there cases where take() is preferable to iloc[]?
There are some pretty important differences:
.iloc is of type <class 'pandas.core.indexing._iLocIndexer'>, whereas .take is a method
Most important .iloc can index into rows AND columns at the same time. .take can only select from one or the other.
.take always returns a DataFrame with the same number of levels in both axes. In contrast, .iloc can easily return fewer levels, to the point of totally returning a Series instead of a DataFrame, i.e. df.iloc[3, :] returns a Series with the columns as the index, but df.take([3]) returns a DataFrame with only one row.
Also pretty important, .take always returns a copy. This means that you can also use .iloc for assignment, but this is not the case for .take.
Case:
I received the following error message and finally figured out that it was hanging up on the .iloc.
TypeError: unhashable type: 'list'
Since the iloc returned a list which is mutable, the code wouldn't proceed with the summation. A simple change from .iloc to .take fixed the problem.
#returns list
df3['sum_total']= df3.iloc[:,-30].sum(axis=[1])
#returns tuple and worked
df3['sum_total']= df3.take([-30], axis=1).sum(axis=1)
Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).
Working with a dataframe df I wanted to create a new column A and assign it to a single value (a string in my case)
df['A'] = value
gave a warning and suggested to use loc
however the solution below still gave the same warning:
df.loc[:,'A']=value
Doing some research I found the solution below which does not generate a warning:
df=df.assign(A =value)
Is it the general accepted way of creating a new column and assigning it to a value? Are there other possibilities using loc?
pandas version '0.20.1'
EDIT: this is the warning message obtained for the 2 first methods
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
As explained by #EdChum and #ScottBoston
Since df was derived using a mask on some original dataframe
df = df_original[boolean_mask]
to avoid the warning with the two first methods, use instead df=df_original[boolean_mask].copy()
df.assign does not need this because it automatically creates a copy of the original dataframe