How to subset Julia DataFrame by condition, where column has missing values

How to subset Julia DataFrame by condition, where column has missing values - dataframe

This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.

How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).

Related

Can anyone explain what (1, 88, 2) means when getting error: "ValueError: Must pass 2-d input. shape=(1, 88, 2)"

Ive been messing with dataframes and lists, trying to understand hw they work, and I was wondering if someone could explain something for me about a list I cant seem to make into a dataframe because its not a 2-d input...
So I am downloading the companies listed on a stock exchange. The stock exchange has about 500 companies. Each company can be in one or more index.
bovespa = pd.read_csv('D:\Libraries\Downloads\IbovParts.csv', sep= ';')
This makes a dataframe from a file, which is a list of all the listed companies on the Brazilian B3 index, with 4 columns: the company name, type of stock, the code and which indexes the stock is part of, for example:
From this dataframe, I want to create a set of smaller dataframes, each of which will contain all the companies in that particular index.
Im not sure its the best way, but I found some similar code that creates a dictionary, where the index name is the key and the value is a list of all the stocks in that particular index.
First I manually made a list of the indexes:
list_of_indexes = ['AGFS', 'BDRX', 'GPTW', 'IBOV', 'IBRA', 'IBXL', 'IBXX', 'ICO2', 'ICON', 'IDIV', 'IFIL', 'IFIX', 'IFNC', 'IGCT', 'IGCX', 'IGNM', 'ISEE', 'ITAG', 'IVBX', 'MLCX', 'SMLL', 'UTIL']
Then this is the code that creates a dictionary of keys (index name) and values (empty lists) then fills the lists:
indexes = {key:[] for key in list_of_indexes}
for k in indexes:
mask = bovespa['InIndexes'].str.contains(k)
list = bovespa.loc[mask, ['Empresa','Code']]
indexes[k].append(list)
This seems to work fine. Checking the printout it does what I want it to do.
Now, I want to choose one of the indexes (for example 'IBOV') and create a new dataframe which contains ONLY the codes of the companies in IBOV. I can then use this list of codes in the yf library to download the financial data for the companies of 'IBOV'.
To do this I tried this code, hoping to get a dataframe with an index, the company name and the company code:
IBOV_codes_df = pd.DataFrame(indexes.get('IBOV'))
and got this error:
ValueError: Must pass 2-d input. shape=(1, 88, 2)
The 'type' of the data Im using (indexes.get('IBOV')) is a list:
type(indexes.get('IBOV'))
returns list, but the pd.DataFrame cant use it. Also, I cant call any of the individual elements in the list. This is what the list looks like (in jupyter):
indexes.get('IBOV')
At first I thought it was a 'normal' list with 88 rows and 2 columns, then I noticed the second square bracket AFTER columns, and len(list) told me this list had only one line. Im still fuzzy on lists and dataframes etc...
Anyway, this error seems to be quite common, and I found a solution here on stackoverflow:
pd.DataFrame(IBOV_codes[0])
Unfortunately, the post on stackoverflow just told the original poster to "do this" with no explanation and it worked. It also worked for me, and created a dataframe that is identical in appearance to the list (but without the brackets, obviously.)
Logically, as there is only one line in the list, [0] is the only callable line to use, so it makes sense. My first question in... why?? What the hecks going on? How can python make a dataframe from a list with only one long, confusing string(?) element? I know its pretty smart, but seriously? How? Also, if there is only one line, why does python throw the error: shape=(1, 88, 2). How is that possible? What does shape=(1, 88, 2) mean or look like? I thought the shape would be (1,1): One row and one column. Very confusing.
My second question is about indexing...
In the original dataframe made from the csv, the list of ALL companies, the index (I assume) is the list of numbers: 0, 1, 2 ... 513.
When I start slicing, and create the final dataframe, using pd.DataFrame(IBOV_codes[0]), the index column is 1, 12,17,24,34... 492, 496, 497, 506, 511. Each company has the same 'index' it had when read from the csv.
The numbers are still sequential, but the index is missing loads of numbers. Are these indexes still integers? Or have they become strings/objects? What would be the best code of practice? To reindex to 0, 1,2,3,4 etc?
If anyone can clear things up, "Thanks!"

Functionality of iloc and simply [ ] in a series

I think in pandas a series S
S[0:2] is equivalent to s.iloc[0:2] , in both cases two rows will be there but recently I got into a trouble The first picture shows the expected output but I didn't know what went wrong in the In this picture S[0] is showing error i don't know why

I can try to explain this behavior a bit. In Pandas, you have selection by position or by label, and it's important to remember that every single row/column will always have a position "name" and a label name. In the case of columns, this distinction is often easy to see, because we usually give columns string names. The difference is also obvious when you use explicitly .iloc vs .loc slicing. Finally, s[X] is indexing, which s[X:Y] is slicing, and the behaviour of the two actions is different.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b': [3,3,4]})
df.iloc[:,0]
df.loc[:,'a']
both will return
0 1
1 2
2 3
Name: a, dtype: int64
Now, what happened in your case is that you overwrote the index names when you declared s.index = [11,12,13,14]. You can see that by inspecting the index before and after this change. Before, if you run s.index, you see that it is a RangeIndex(start=0, stop=4, step=1). After you change the index, it becomes Int64Index([11, 12, 13, 14], dtype='int64').
Why does this matter? Because although you overrode the labels of the index, the position of each one of them remains the same as before. So, when you call
s[0:2]
you are slicing by position (this section in the documentation explains that it's equivalent to .iloc. However, when you run
s[0]
Pandas thinks you want to select by label, so it starts looking for the label 0, which doesn't exist anymore, because you overrode it. Think of the square-bracket selection in the context of selecting a dataframe column: you would say df["column"] (so you're asking for the column by label), so the same is in the context of a series.
In summary, what happens with Series indexing is the following:
In the case you use string index labels, and you index by an string, Pandas will look up the string label.
In the case you use string index labels, and you index by an integer, Pandas will fall back to indexing by position (that's why your example in the comment works).
In the case you use integer index labels, and you index by an integer, Pandas will always try to index by the "label" (that's why the first case doesn't work, as you have overriden the label 0).
Here are two articles explaining this bizarre behavior:
Indexing Best Practices in Pandas.series
Retrieving values in a Series by label or position

Selecting Columns Based on Multiple Criteria in a Julia DataFrame

I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):
### Initialize source dataframe for PCA
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
### Extract last column as labels
arLabels=dfSource[1:2:end,3]
### Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)
output> MethodError: no method matching...
At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?

A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):
# avoid materialization if dfSource is large
dfSourceHalf = #view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]
(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)

This code works:
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
arLabels=dfSource[1:2:end,3]
datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)
output> [0,10,0]
Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.

You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Turn a SubDataFrame into a DataFrame?

I have a by which creates SubDataFrames. How do I turn these into a DataFrame, preferably without copying?
My original problem is that I cannot add a new column to a SubDataFrame:
# df[:End] = 1:nrow(merged_islands)
# ERROR: LoadError: Cannot assign to non-existent column: End
# insert!(df, length(df), Array(1:nrow(merged_islands)), :End)
# ERROR: LoadError: MethodError: no method matching insert!(::SubDataFrame{Array{Int64,1}}, ::Int64, ::Array{Int64,1}, ::Symbol)
I am guessing converting it into a DataFrame is the easiest way to do it :)

An interesting question. On current master (to be tagged very soon) it is enough to write DataFrame(sdf) where sdf is a SubDataFrame. It will create a copy of all vectors though.
Here is a solution that will create a DataFrame with a view of all vectors contained in SubDataFrame (it should work both on master and on currently tagged release):
function sdf2df(sdf::SubDataFrame)
p = parent(sdf)
sel = DataFrames.rows(sdf)
DataFrame(AbstractVector[view(p[i], sel) for i in 1:ncol(sdf)],
names(sdf))
end
(I use AbstractVector container type as it will be faster on current master)
You will not be able to add rows to such a DataFrame while it holds at least one view column.
EDIT: as a side note (maybe this was your problem in the end). If you have sdf which is a SubDataFrame whose parent is df which is a DataFrame then if you add columns to df they will be immediately visible in sdf as SubDataFrame only selects rows and inherits all columns from the parent.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to subset Julia DataFrame by condition, where column has missing values - dataframe

Related

Can anyone explain what (1, 88, 2) means when getting error: "ValueError: Must pass 2-d input. shape=(1, 88, 2)"

Functionality of iloc and simply [ ] in a series

Selecting Columns Based on Multiple Criteria in a Julia DataFrame

How to efficiently append a dataframe column with a vector?

Turn a SubDataFrame into a DataFrame?

Categories

Resources