Related
Example of first rows of the DataFrame with original columns Height and Size:
Desired results are in new additional columns Result_by_row and Result_by_column.
Result_by_row: compare 2 row value with 1 row value and if higher return True in new column Result_by_row (example: 46 > 17, return True in row 2 column Result_by_row).
Result_by_column: same principle but with different columns (example: 46 > 32, return True in row 2 column Result_by_column).
Thanks.
New with Julia, so don't know how to do it :)
Here is an example way how you can do it:
julia> using DataFrames
julia> df = DataFrame(Height=[32.0, 35.0, 63.0, 84.0, 72.0], Size=[17.0, 46.0, 18.0, 56.0, 6.0])
5×2 DataFrame
Row │ Height Size
│ Float64 Float64
─────┼──────────────────
1 │ 32.0 17.0
2 │ 35.0 46.0
3 │ 63.0 18.0
4 │ 84.0 56.0
5 │ 72.0 6.0
julia> using ShiftedArrays: lag
julia> df.Result_by_row = df.Height .> lag(df.Height)
5-element Vector{Union{Missing, Bool}}:
missing
true
true
true
false
julia> df.Result_by_column = df.Size .> lag(df.Height)
5-element Vector{Union{Missing, Bool}}:
missing
true
false
false
false
julia> df
5×4 DataFrame
Row │ Height Size Result_by_row Result_by_column
│ Float64 Float64 Bool? Bool?
─────┼───────────────────────────────────────────────────
1 │ 32.0 17.0 missing missing
2 │ 35.0 46.0 true true
3 │ 63.0 18.0 true false
4 │ 84.0 56.0 true false
5 │ 72.0 6.0 false false
I want to create a new variable (nota) using an if-else function with the transform! function from the DataFrames package.
Here is my code:
using DataFrames
datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
transform!(datos, :puntaje => ifelse(datos[datos.puntaje .< 9,:],"Suficiente","Excelente") => :nota)
but this error displays:
ERROR: TypeError: non-boolean (BitVector) used in boolean context
Stacktrace:
[1] top-level scope
# REPL[23]:1
How can I solve it?
Two problems:
You are mixing ByRow with normal transform which is by column
You can't mutate type of puntaje column
You probably want to do this:
julia> datos.puntaje = ifelse.(datos.puntaje .< 9, "Suficiente", "Excelente")
6-element Vector{String}:
"Excelente"
"Excelente"
"Suficiente"
"Suficiente"
"Excelente"
"Suficiente"
julia> datos
6×3 DataFrame
Row │ nombre grupo puntaje
│ String String String
─────┼──────────────────────────────
1 │ ANGELICA A Excelente
2 │ BRENDA A Excelente
3 │ LILIANA B Suficiente
4 │ MARCO B Suficiente
5 │ FABIAN C Excelente
6 │ MAURICIO C Suficiente
If you prefer using transform!, two syntaxes that work are:
julia> datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
julia> transform!(datos, :puntaje => (p -> ifelse.(p .< 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
julia> transform!(datos, :puntaje => ByRow(p -> ifelse(p < 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
i have the following non standard file and want to read it efficiently into a Dataframe, but do not know how to efficiently handle it.
m[mi++]="18.06.16 22:00:00|0;3;1;1667;49;49;12|0;17;3;2153;328;153;57|0;18;0;2165;284;156;53"
m[mi++]="18.06.16 21:55:00|0;24;7;1667;306;306;61|0;21;4;2153;384;166;62|0;19;1;2165;368;185;62"
m[mi++]="18.06.16 21:50:00|0;31;6;1667;394;349;62|0;32;6;2153;402;164;63|0;33;4;2165;380;171;63"
m[mi++]="18.06.16 21:45:00|11;50;19;1667;390;312;63|34;61;9;2153;410;166;63|19;60;8;2165;391;185;63"
m[mi++]="18.06.16 21:40:00|37;63;24;1666;387;313;63|55;80;12;2150;418;169;64|46;79;11;2163;398;186;63"
If I have saved this data in a file a.js the following code works:
b = readtable("a.js")
dat_5m = DataFrame(t=DateTime[], Pac_1=Float64[], Pdc_A1=Float64[], Pdc_B1=Float64[], Etot1=Float64[],
U_A1=Float64[], U_B1=Float64[], T_1=Float64[], Pac_2=Float64[], Pdc_A2=Float64[], Pdc_B2=Float64[],
Etot2=Float64[], U_A2=Float64[], U_B2=Float64[], T_2=Float64[], Pac_3=Float64[], Pdc_A3=Float64[],
Pdc_B3=Float64[], Etot3=Float64[], U_A3=Float64[], U_B3=Float64[], T_3=Float64[])
for r in nrow(b):-1:1
c = split(b[r,1], [';','|','=']);
d = float(c[3:end])'
t = DateTime(c[2], DateFormat("dd.mm.yy HH:MM:SS"))
push!(dat_5m, [t d])
end
But I think it is quite unelegant. Is there any more efficient way?
If possible (IO permitting), I'd recommend cleaning the file first, removing | and any funky " and prefixes. Then you can use CSV.jl to make your life easier.
The process is: read line by line, clean each line and then use CSV to read it back in:
julia> using CSV
julia> open("b.js", "a") do new_file
for line in readlines("a.js")
line = replace(line, "m[mi++]=\"", "")
# remove only trailing quote
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
write(new_file, string(line, "\n"))
end
end
julia> b = CSV.read("b.js",
delim = ";",
header = false, # change this to a string vector to provide column names
dateformat = "dd.mm.yy HH:MM:SS");
julia> # correct times to be in 2016
b[:Column1] += Dates.Year(2000);
julia> head(b)
6×22 DataFrames.DataFrame. Omitted printing of 15 columns
│ Row │ Column1 │ Column2 │ Column3 │ Column4 │ Column5 │ Column6 │ Column7 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
│ 2 │ 2016-06-18T21:55:00 │ 0 │ 24 │ 7 │ 1667 │ 306 │ 306 │
│ 3 │ 2016-06-18T21:50:00 │ 0 │ 31 │ 6 │ 1667 │ 394 │ 349 │
│ 4 │ 2016-06-18T21:45:00 │ 11 │ 50 │ 19 │ 1667 │ 390 │ 312 │
│ 5 │ 2016-06-18T21:40:00 │ 37 │ 63 │ 24 │ 1666 │ 387 │ 313 │
│ 6 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
I cleaned the lines here step by step to make the process easier to understand. If you're a regex master you might want to do that in one line.
There might be a better way of handling the 2 digit years, so if someone knows how to fix that please edit this answer or add it as a comment. Thanks!
Edit:
If you want to do the same without writing to another file, here's a hack to apply the cleaning function, and reuse CSV.read() by converting the readlines array to a IOBuffer again:
function cleaner(line)
line = replace(line, "m[mi++]=\"", "")
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
println(line)
string(line, "\n")
end
c = CSV.read(
Base.IOBuffer(
string(cleaner.(readlines("a.js"))...)),
delim = ";",
header = false,
dateformat = "dd.mm.yy HH:MM:SS");
c[:Column1] += Dates.Year(2000);
This gives the same result as the other solution.
In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..
So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
or
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
I can find the individual "row" values interacting over the table with pair():
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
I can't however get the columns names and types, and I guess working by column would instead be more efficient.
EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.
The following conversion functions:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
give (on Julia 0.6 with tn and t defined as in the question):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Type information is mostly retained:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
And for unnamed columns:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
EDIT: As noted by #Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
And then, the mixed case gives a result like:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Ugly, quick and dirty "solution" (I hope it is doable in other way):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Just install IterableTables and then
using IterableTables
df = DataFrames.DataFrame(it)
Here it is an initial attampt to write a conversion function.. it keeps column names and type.. it would be nice if it could be cleaned up and implemented in either the DataFrame or the IndexedTable package as convert(DataFrame,t::IndexedArray).
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
EDIT 20210106:
Solution for NDSparse indexed table with a single value column:
# NDSparse creation...
content = [["banana","banana","apple","apple","orange"],["us",missing,"us","eu","us"],[1.1,2.2,3.3,missing,5.5]]
dimNames = ["item","region"]
t = NDSparse(content...,names=Symbol.(dimNames))
# NDSparse conversion to df...
names = vcat(keys(keys(t)[1])...,:value)
cols = columns(t)
df = DataFrame(map((n,v) -> Pair(n,v), names, cols))
I'm trying to reproduce the following R codes in Julia
library(dplyr)
women_new <- rbind(women, c(NA, 1), c(NA, NA))
women_new %>%
filter(height %>% complete.cases) %>%
mutate(sector = character(n()),
sector = replace(sector, height >= 0 & height <= 60, "1"),
sector = replace(sector, height >= 61 & height <= 67, "2"),
sector = replace(sector, height >= 68 & height <= 72, "3"))
My attempts in Julia are the following:
using DataFrames
using DataFramesMeta
using Lazy
using RDatasets
women = #> begin
"datasets"
dataset("women")
DataArray()
vcat([[NA NA]; [NA NA]])
end
women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
women_new[16, 2] = 1;
My first question here is that, is there a way to input 1 immediately on vcat([[NA 1]; [NA NA]]) just like in R? It returns the following error if I do so:
MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
in macro expansion at multidimensional.jl:431 [inlined]
in macro expansion at cartesian.jl:64 [inlined]
in macro expansion at multidimensional.jl:429 [inlined]
in _unsafe_batchsetindex!(::Array{Int64,2}, ::Base.Repeated{DataArrays.NAtype}, ::UnitRange{Int64}, ::UnitRange{Int64}) at multidimensional.jl:421
in setindex!(::Array{Int64,2}, ::DataArrays.NAtype, ::UnitRange{Int64}, ::UnitRange{Int64}) at abstractarray.jl:832
in cat_t(::Int64, ::Type{T}, ::DataArrays.NAtype, ::Vararg{Any,N}) at abstractarray.jl:1098
in hcat(::DataArrays.NAtype, ::Int64) at abstractarray.jl:1180
in include_string(::String, ::String) at loading.jl:441
in include_string(::String, ::String, ::Int64) at eval.jl:30
in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
in withpath(::Function, ::String) at eval.jl:38
in macro expansion at eval.jl:49 [inlined]
in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60
My second question is, is there a way to convert DataArray to DataFrame? In this case the column names become X1, X2, ... or any default name in DataFrame since DataArray does not have column names. I think it is neater than typing the following:
women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
I wish I could simply do convert(DataFrame, women) and simply rename the column names. But that conversion does not work. And the following is my attempt on transformation or mutation in the case of R.
#> begin
women_new
#where !isna(:Height)
#transform(
Sector = NA,
Sector = ifelse(:Height .>= 0 & :Height .<= 60, 1,
ifelse(:Height .>= 61 & :Height .<= 67, 2,
ifelse(:Height .>= 68 & :Height .<= 72, 3, NA)))
)
end
But this will return:
15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Sector│
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 1 │
│ 5 │ 62 │ 126 │ 1 │
│ 6 │ 63 │ 129 │ 1 │
│ 7 │ 64 │ 132 │ 1 │
│ 8 │ 65 │ 135 │ 1 │
│ 9 │ 66 │ 139 │ 1 │
│ 10 │ 67 │ 142 │ 1 │
│ 11 │ 68 │ 146 │ 1 │
│ 12 │ 69 │ 150 │ 1 │
│ 13 │ 70 │ 154 │ 1 │
│ 14 │ 71 │ 159 │ 1 │
│ 15 │ 72 │ 164 │ 1 │
which is not equivalent to R, I also tried the following:
#> begin
women_new
#where !isna(:Height)
#transform(
Sector = NA,
Sector = :Height .>= 0 & :Height .<= 60 ? 1 :
:Height .>= 61 & :Height .<= 67 ? 2 :
:Height .>= 68 & :Height .<= 72 ? 3 :
NA;
)
end
But returns the following error:
TypeError: non-boolean (DataArrays.DataArray{Bool,1}) used in boolean context
in (::###469#303)(::DataArrays.DataArray{Int64,1}) at DataFramesMeta.jl:55
in (::##298#302)(::DataFrames.DataFrame) at DataFramesMeta.jl:295
in #transform#38(::Array{Any,1}, ::Function, ::DataFrames.DataFrame) at DataFramesMeta.jl:270
in (::DataFramesMeta.#kw##transform)(::Array{Any,1}, ::DataFramesMeta.#transform, ::DataFrames.DataFrame) at <missing>:0
in include_string(::String, ::String) at loading.jl:441
in include_string(::String, ::String, ::Int64) at eval.jl:30
in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
in withpath(::Function, ::String) at eval.jl:38
in macro expansion at eval.jl:49 [inlined]
in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60
I do appreciate if you can help me figure this out. Finally, my last question is that, is there a way to shorten my code like that in R but still elegant?
I got it. There is an effect on operator precedence, I thought parentheses are not needed.
using DataFrames
using DataFramesMeta
using Lazy
using RDatasets
women = dataset("datasets", "women");
women_new = vcat(
women,
DataFrame(Height = [NA; NA], Weight = #data [1; NA])
)
#> begin
women_new
#where !isna(:Height)
#transform(
Class = NA,
Class = ifelse((:Height .>= 0) & (:Height .<= 60), 1,
ifelse((:Height .>= 61) & (:Height .<= 67), 2,
ifelse((:Height .>= 68) & (:Height .<= 72), 3, NA)))
)
end
Update: The above code can be further simplified into:
#> begin
women_new
#where !isna(:Height)
#transform(
Class = #> begin
function (x)
0 <= x <= 60 ? 1 :
61 <= x <= 67 ? 2 :
68 <= x <= 72 ? 3 :
NA
end
map(:Height)
end
)
end
Or an alternative is to use Query.jl as follows:
using DataFrames
using Query
using RDatasets
women = dataset("datasets", "women");
women_new = vcat(
women,
DataFrame(Height = [NA; NA], Weight = #data [1; NA])
)
#from i in women_new begin
#where !isnull(i.Height)
#select {
i.Height, i.Weight,
class = 0 <= i.Height <= 60 ? 1 :
61 <= i.Height <= 67 ? 2 :
68 <= i.Height <= 72 ? 3 :
0
}
#collect DataFrame
end
The output is now correct:
15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 2 │
│ 5 │ 62 │ 126 │ 2 │
│ 6 │ 63 │ 129 │ 2 │
│ 7 │ 64 │ 132 │ 2 │
│ 8 │ 65 │ 135 │ 2 │
│ 9 │ 66 │ 139 │ 2 │
│ 10 │ 67 │ 142 │ 2 │
│ 11 │ 68 │ 146 │ 3 │
│ 12 │ 69 │ 150 │ 3 │
│ 13 │ 70 │ 154 │ 3 │
│ 14 │ 71 │ 159 │ 3 │
│ 15 │ 72 │ 164 │ 3 │
If we don't want to filter NAs and work with the complete data, then the best I can is the following:
#> begin
women_new
#transform(
Height_New = NA,
Height_New = ifelse(isna(:Height), -1, :Height))
#transform(
Class = NA,
Class = ifelse(:Height_New == -1, NA,
ifelse((:Height_New .>= 0) & (:Height_New .<= 60), 1,
ifelse((:Height_New .>= 61) & (:Height_New .<= 67), 2,
ifelse((:Height_New .>= 68) & (:Height_New .<= 72), 3, NA))))
)
delete!(:Height_New)
end
Update: The above code can be further simplified into:
#> begin
women_new
#transform(
Class = #> begin
function (x)
isna(x) ? NA :
0 <= x <= 60 ? 1 :
61 <= x <= 67 ? 2 :
68 <= x <= 72 ? 3 :
NA
end
map(:Height)
end
)
end
Or an alternative is to use Query.jl as follows:
#from i in women_new begin
#select {
i.Height, i.Weight,
class = 0 <= i.Height <= 60 ? 1 :
61 <= i.Height <= 67 ? 2 :
68 <= i.Height <= 72 ? 3 :
0
}
#collect DataFrame
end
The output:
17×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 2 │
│ 5 │ 62 │ 126 │ 2 │
│ 6 │ 63 │ 129 │ 2 │
│ 7 │ 64 │ 132 │ 2 │
│ 8 │ 65 │ 135 │ 2 │
│ 9 │ 66 │ 139 │ 2 │
│ 10 │ 67 │ 142 │ 2 │
│ 11 │ 68 │ 146 │ 3 │
│ 12 │ 69 │ 150 │ 3 │
│ 13 │ 70 │ 154 │ 3 │
│ 14 │ 71 │ 159 │ 3 │
│ 15 │ 72 │ 164 │ 3 │
│ 16 │ NA │ 1 │ NA │
│ 17 │ NA │ NA │ NA │
In this case, the code becomes messy because there is no method yet for handling NAs in ifelse's first argument.