How to efficiently append a dataframe column with a vector? - dataframe

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.

You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Related

Can anyone tell me what's wrong with my code (I am a newbie in programming, pls do cooperate )

I am trying to write a code which calculates the HCF of two numbers but I am either getting a error or an empty list as my answer
I was expecting the HCF, My idea was to get the factors of the 2 given numbers and then find the common amongst them then take the max out of that
For future reference, do not attach screenshots. Instead, copy your code and put it into a code block because stack overflow supports code blocks. To start a code block, write three tildes like ``` and to end it write three more tildes to close. If you add a language name like python, or javascript after the first three tildes, syntax highlighting will be enabled. I would also create a more descriptive title that more accurately describes the problem at hand. It would look like so:
Title: How to print from 1-99 in python?
for i in range(1,100):
print(i)
To answer your question, it seems that your HCF list is empty, and the python max function expects the argument to the function to not to be empty (the 'arg' is the HCF list). From inspection of your code, this is because the two if conditions that need to be satisfied before anything is added to HCF is never satisfied.
So it could be that hcf2[x] is not in hcf and hcf[x] is not in hcf[x] 2.
What I would do is extract the logic for the finding of the factors of each number to a function, then use built in python functions to find the common elements between the lists. Like so:
num1 = int(input("Num 1:")) # inputs
num2 = int(input("Num 2:")) # inputs
numberOneFactors = []
numberTwoFactors = []
commonFactors = []
# defining a function that finds the factors and returns it as a list
def findFactors(number):
temp = []
for i in range(1, number+1):
if number%i==0:
temp.append(i)
return temp
numberOneFactors = findFactors(num1) # populating factors 1 list
numberTwoFactors = findFactors(num2) # populating factors 2 list
# to find common factors we can use the inbuilt python set functions.
commonFactors = list(set(numberOneFactors).intersection(numberTwoFactors))
# the intersection method finds the common elements in a set.

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

How to subset Julia DataFrame by condition, where column has missing values

This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.
How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).

Selecting Columns Based on Multiple Criteria in a Julia DataFrame

I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):
### Initialize source dataframe for PCA
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
### Extract last column as labels
arLabels=dfSource[1:2:end,3]
### Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)
output> MethodError: no method matching...
At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?
A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):
# avoid materialization if dfSource is large
dfSourceHalf = #view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]
(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)
This code works:
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
arLabels=dfSource[1:2:end,3]
datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)
output> [0,10,0]
Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.