Specific case in Julia Dataframe - dataframe

I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .
So i have a DataFrame that look like this :
│ Row │ USER_ID │ GENRE_MAIN │ ALBUM_NAME │ ALBUM_ARTIST_NAME │ TOTAL_LISTENED │ TOTAL_COUNT_LISTENED │
│ │ String │ String │ String │ String │ DecFP.Dec128 │ DecFP.Dec128 │
├──────────┼─────────┼───────────────────┼─────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┼────────────────┼──────────────────────┤
│ 1 │ 9s2dsdd6 │ ROCK │ The Thought's Boy │ AARON MADISON │ 5912.0 │ 91.0 │
│ 1 │ 9s2dsdd6 │ ROCK │ The wqeqwewe │ AARON MADISON │ 3212.0 │ 91.0 │
│ 2 │ 11sdasd63 │ ROCK │ Down On The Upside │ SOUNDGARDEN │ 3354.0 │ 14.0 │
│ 3 │ 112sds334 │ CLASSICAL │ Beethoven: Symphonies Nos. 1 & 2 - C.P.E. Bach: Symphonies, Wq 175 & 183/17 │ AKADEMIE FÜR ALTE MUSIK BERLIN │ 1372.0 │ 4.0 │
│ 4 │ 145sdsd42 │ POP │ My Life in the Bush of Ghosts │ BRIAN ENO │ 3531.0 │ 17.0 │
I want to aggregate it by user (i have many rows per user_id ) and do many calculations
I'm doing aggregation with this :
gdf = DataFrames.groupby(df, :USER_ID)
combine(gdf,:TOTAL_LISTENED => sum => :TOTAL_SECONDS_LISTENED,
:TOTAL_COUNT_LISTENED => sum => :TOTAL_TRACKS_LISTENED)
I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and it has to be like this :
USER_ID │ ALBUM1_NAME │ ALBUM2_NAME ......│ GENRE1 │ GENRE2
One line per user_id .
So i tried to do it with a countmap and then sort it , keep only the top 5 , and assign each value to a column in a Dataframe
transposed = sort(countmap(targetId[targetCol]), byvalue=true, rev=true)
for (i, g) in enumerate(eachcol(transposed))
rVal["ALBUM$(i)_NAME"] = g[1]
rVal["ALBUM$(i)_ARTIST"] = g[3]
rVal["ALBUM$(i)_TIME"] = g[2]
rVal["ALBUM$(i)_ID"] = "ID"
rVal["USER_ID"] = id
end
but it doesn't work in a combine , its just very ugly and im sure i can do it a better way .
I hope its understandable , if someone can help me please =)
Thank you
EDIT : A way to reproduce the DataFrame:
v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])

You have not provided an easy way to reproduce your source data, so I am writing the solution from my head and hope I have not made any typo (note that you need DataFrames.jl 0.22 for this to work, while you seem to be on some older version of the package):
using DataFrames, Pipe, Random, Pkg
Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")
Random.seed!(1234)
df = DataFrame(USER_ID=rand(1:10, 80),
GENRE_MAIN=rand(string.("genre_", 1:6), 80),
ALBUM_NAME=rand(string.("album_", 1:6), 80),
ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))
function top5(sdf, col, prefix)
return #pipe groupby(sdf, col) |>
combine(_, nrow) |>
sort!(_, :nrow, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
DataFrame([string(prefix, i) for i in 1:5] .=> _)
end
#pipe groupby(df, :USER_ID) |>
combine(_,
x -> top5(x, :GENRE_MAIN, "genre"),
x -> top5(x, :ALBUM_NAME, "album"),
x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))
The code is a bit complex as we have to handle the fact that there might be less than 5 entries per group.
It produces under Julia 1.5.3:
10×16 DataFrame
Row │ USER_ID genre1 genre2 genre3 genre4 genre5 album1 album2 album3 album4 album5 artist1 artist2 artist3 artist4 artist5
│ Int64 String String String String? String? String String String String String? String String String String? String?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 genre_1 genre_3 genre_5 genre_2 genre_4 album_3 album_5 album_6 album_1 album_4 artist_1 artist_4 artist_3 artist_2 artist_6
2 │ 4 genre_1 genre_3 genre_6 genre_2 missing album_2 album_4 album_5 album_6 missing artist_4 artist_5 artist_2 missing missing
3 │ 8 genre_2 genre_1 genre_6 genre_5 missing album_1 album_5 album_4 album_2 missing artist_5 artist_6 artist_4 artist_1 artist_3
4 │ 2 genre_1 genre_5 genre_2 genre_4 genre_3 album_6 album_3 album_4 album_2 album_1 artist_4 artist_2 artist_6 artist_1 artist_5
5 │ 10 genre_5 genre_3 genre_6 genre_4 genre_2 album_2 album_3 album_1 album_5 album_4 artist_1 artist_6 artist_2 artist_5 artist_3
6 │ 7 genre_5 genre_3 genre_2 genre_4 genre_1 album_2 album_4 album_3 album_5 missing artist_4 artist_1 artist_3 artist_5 missing
7 │ 9 genre_3 genre_4 genre_2 missing missing album_1 album_3 album_4 album_2 missing artist_4 artist_2 artist_6 artist_3 missing
8 │ 5 genre_2 genre_3 genre_4 genre_6 missing album_2 album_1 album_3 album_4 missing artist_6 artist_5 artist_4 artist_1 missing
9 │ 3 genre_6 genre_5 genre_4 genre_2 genre_1 album_3 album_4 album_1 album_5 missing artist_4 artist_3 artist_6 artist_5 missing
10 │ 6 genre_3 genre_4 genre_1 genre_6 missing album_2 album_4 album_5 album_3 missing artist_4 artist_6 artist_5 missing missing
which I assume you wanted?

Related

How do I get the distance to a previous occurrence of a value in Polars dataframe?

I want to efficiently find the distance from the current row to the previous occurrence. I know polars doesn't have indexes, but the formula would roughly be:
if prior_occurrence {
(current_row_index - prior_occurrence_index - 1)
} else {
-1
}
This is the input dataframe:
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
].unwrap();
println!("{}", df_a);
a - i32
b - str
1
c
2
a
2
b
1
c
4
c
1
a
Wanted output:
a - i32
b - str
b_dist - i32
1
c
-1
2
a
-1
2
b
-1
1
c
2
4
c
0
1
a
3
What's the most efficient way to go about this?
python
(df
.with_row_count("idx")
.with_columns([
((pl.col("idx") - pl.col("idx").shift()).cast(pl.Int32).fill_null(0) - 1)
.over("a").alias("a_distance_to_a")
])
)
rust
fn func1() -> PolarsResult<()> {
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
]?;
let out = df_a
.lazy()
.with_row_count("idx", None)
.with_columns([((col("idx") - col("idx").shift(1))
.cast(DataType::Int32)
.fill_null(0)
- lit(1))
.over("a")
.alias("a_distance_to_a")])
.collect()?;
Ok(())
output
shape: (6, 4)
┌─────┬─────┬─────┬─────────────────┐
│ idx ┆ a ┆ b ┆ a_distance_to_a │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i32 │
╞═════╪═════╪═════╪═════════════════╡
│ 0 ┆ 1 ┆ c ┆ -1 │
│ 1 ┆ 2 ┆ a ┆ -1 │
│ 2 ┆ 2 ┆ b ┆ 0 │
│ 3 ┆ 1 ┆ c ┆ 2 │
│ 4 ┆ 4 ┆ c ┆ -1 │
│ 5 ┆ 1 ┆ a ┆ 1 │
└─────┴─────┴─────┴─────────────────┘

How can I create a new variable in Julia using an ifelse condition dataframe?

I want to create a new variable (nota) using an if-else function with the transform! function from the DataFrames package.
Here is my code:
using DataFrames
datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
transform!(datos, :puntaje => ifelse(datos[datos.puntaje .< 9,:],"Suficiente","Excelente") => :nota)
but this error displays:
ERROR: TypeError: non-boolean (BitVector) used in boolean context
Stacktrace:
[1] top-level scope
# REPL[23]:1
How can I solve it?
Two problems:
You are mixing ByRow with normal transform which is by column
You can't mutate type of puntaje column
You probably want to do this:
julia> datos.puntaje = ifelse.(datos.puntaje .< 9, "Suficiente", "Excelente")
6-element Vector{String}:
"Excelente"
"Excelente"
"Suficiente"
"Suficiente"
"Excelente"
"Suficiente"
julia> datos
6×3 DataFrame
Row │ nombre grupo puntaje
│ String String String
─────┼──────────────────────────────
1 │ ANGELICA A Excelente
2 │ BRENDA A Excelente
3 │ LILIANA B Suficiente
4 │ MARCO B Suficiente
5 │ FABIAN C Excelente
6 │ MAURICIO C Suficiente
If you prefer using transform!, two syntaxes that work are:
julia> datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
julia> transform!(datos, :puntaje => (p -> ifelse.(p .< 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
julia> transform!(datos, :puntaje => ByRow(p -> ifelse(p < 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente

how to read a .js file into a DataFrame

i have the following non standard file and want to read it efficiently into a Dataframe, but do not know how to efficiently handle it.
m[mi++]="18.06.16 22:00:00|0;3;1;1667;49;49;12|0;17;3;2153;328;153;57|0;18;0;2165;284;156;53"
m[mi++]="18.06.16 21:55:00|0;24;7;1667;306;306;61|0;21;4;2153;384;166;62|0;19;1;2165;368;185;62"
m[mi++]="18.06.16 21:50:00|0;31;6;1667;394;349;62|0;32;6;2153;402;164;63|0;33;4;2165;380;171;63"
m[mi++]="18.06.16 21:45:00|11;50;19;1667;390;312;63|34;61;9;2153;410;166;63|19;60;8;2165;391;185;63"
m[mi++]="18.06.16 21:40:00|37;63;24;1666;387;313;63|55;80;12;2150;418;169;64|46;79;11;2163;398;186;63"
If I have saved this data in a file a.js the following code works:
b = readtable("a.js")
dat_5m = DataFrame(t=DateTime[], Pac_1=Float64[], Pdc_A1=Float64[], Pdc_B1=Float64[], Etot1=Float64[],
U_A1=Float64[], U_B1=Float64[], T_1=Float64[], Pac_2=Float64[], Pdc_A2=Float64[], Pdc_B2=Float64[],
Etot2=Float64[], U_A2=Float64[], U_B2=Float64[], T_2=Float64[], Pac_3=Float64[], Pdc_A3=Float64[],
Pdc_B3=Float64[], Etot3=Float64[], U_A3=Float64[], U_B3=Float64[], T_3=Float64[])
for r in nrow(b):-1:1
c = split(b[r,1], [';','|','=']);
d = float(c[3:end])'
t = DateTime(c[2], DateFormat("dd.mm.yy HH:MM:SS"))
push!(dat_5m, [t d])
end
But I think it is quite unelegant. Is there any more efficient way?
If possible (IO permitting), I'd recommend cleaning the file first, removing | and any funky " and prefixes. Then you can use CSV.jl to make your life easier.
The process is: read line by line, clean each line and then use CSV to read it back in:
julia> using CSV
julia> open("b.js", "a") do new_file
for line in readlines("a.js")
line = replace(line, "m[mi++]=\"", "")
# remove only trailing quote
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
write(new_file, string(line, "\n"))
end
end
julia> b = CSV.read("b.js",
delim = ";",
header = false, # change this to a string vector to provide column names
dateformat = "dd.mm.yy HH:MM:SS");
julia> # correct times to be in 2016
b[:Column1] += Dates.Year(2000);
julia> head(b)
6×22 DataFrames.DataFrame. Omitted printing of 15 columns
│ Row │ Column1 │ Column2 │ Column3 │ Column4 │ Column5 │ Column6 │ Column7 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
│ 2 │ 2016-06-18T21:55:00 │ 0 │ 24 │ 7 │ 1667 │ 306 │ 306 │
│ 3 │ 2016-06-18T21:50:00 │ 0 │ 31 │ 6 │ 1667 │ 394 │ 349 │
│ 4 │ 2016-06-18T21:45:00 │ 11 │ 50 │ 19 │ 1667 │ 390 │ 312 │
│ 5 │ 2016-06-18T21:40:00 │ 37 │ 63 │ 24 │ 1666 │ 387 │ 313 │
│ 6 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
I cleaned the lines here step by step to make the process easier to understand. If you're a regex master you might want to do that in one line.
There might be a better way of handling the 2 digit years, so if someone knows how to fix that please edit this answer or add it as a comment. Thanks!
Edit:
If you want to do the same without writing to another file, here's a hack to apply the cleaning function, and reuse CSV.read() by converting the readlines array to a IOBuffer again:
function cleaner(line)
line = replace(line, "m[mi++]=\"", "")
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
println(line)
string(line, "\n")
end
c = CSV.read(
Base.IOBuffer(
string(cleaner.(readlines("a.js"))...)),
delim = ";",
header = false,
dateformat = "dd.mm.yy HH:MM:SS");
c[:Column1] += Dates.Year(2000);
This gives the same result as the other solution.

How to convert an IndexedTable to a DataFrame in Julia?

In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..
So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
or
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
I can find the individual "row" values interacting over the table with pair():
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
I can't however get the columns names and types, and I guess working by column would instead be more efficient.
EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.
The following conversion functions:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
give (on Julia 0.6 with tn and t defined as in the question):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Type information is mostly retained:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
And for unnamed columns:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
EDIT: As noted by #Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
And then, the mixed case gives a result like:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Ugly, quick and dirty "solution" (I hope it is doable in other way):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Just install IterableTables and then
using IterableTables
df = DataFrames.DataFrame(it)
Here it is an initial attampt to write a conversion function.. it keeps column names and type.. it would be nice if it could be cleaned up and implemented in either the DataFrame or the IndexedTable package as convert(DataFrame,t::IndexedArray).
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
EDIT 20210106:
Solution for NDSparse indexed table with a single value column:
# NDSparse creation...
content = [["banana","banana","apple","apple","orange"],["us",missing,"us","eu","us"],[1.1,2.2,3.3,missing,5.5]]
dimNames = ["item","region"]
t = NDSparse(content...,names=Symbol.(dimNames))
# NDSparse conversion to df...
names = vcat(keys(keys(t)[1])...,:value)
cols = columns(t)
df = DataFrame(map((n,v) -> Pair(n,v), names, cols))

Julia: DataFramesMeta Transformation

I'm trying to reproduce the following R codes in Julia
library(dplyr)
women_new <- rbind(women, c(NA, 1), c(NA, NA))
women_new %>%
filter(height %>% complete.cases) %>%
mutate(sector = character(n()),
sector = replace(sector, height >= 0 & height <= 60, "1"),
sector = replace(sector, height >= 61 & height <= 67, "2"),
sector = replace(sector, height >= 68 & height <= 72, "3"))
My attempts in Julia are the following:
using DataFrames
using DataFramesMeta
using Lazy
using RDatasets
women = #> begin
"datasets"
dataset("women")
DataArray()
vcat([[NA NA]; [NA NA]])
end
women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
women_new[16, 2] = 1;
My first question here is that, is there a way to input 1 immediately on vcat([[NA 1]; [NA NA]]) just like in R? It returns the following error if I do so:
MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
in macro expansion at multidimensional.jl:431 [inlined]
in macro expansion at cartesian.jl:64 [inlined]
in macro expansion at multidimensional.jl:429 [inlined]
in _unsafe_batchsetindex!(::Array{Int64,2}, ::Base.Repeated{DataArrays.NAtype}, ::UnitRange{Int64}, ::UnitRange{Int64}) at multidimensional.jl:421
in setindex!(::Array{Int64,2}, ::DataArrays.NAtype, ::UnitRange{Int64}, ::UnitRange{Int64}) at abstractarray.jl:832
in cat_t(::Int64, ::Type{T}, ::DataArrays.NAtype, ::Vararg{Any,N}) at abstractarray.jl:1098
in hcat(::DataArrays.NAtype, ::Int64) at abstractarray.jl:1180
in include_string(::String, ::String) at loading.jl:441
in include_string(::String, ::String, ::Int64) at eval.jl:30
in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
in withpath(::Function, ::String) at eval.jl:38
in macro expansion at eval.jl:49 [inlined]
in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60
My second question is, is there a way to convert DataArray to DataFrame? In this case the column names become X1, X2, ... or any default name in DataFrame since DataArray does not have column names. I think it is neater than typing the following:
women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
I wish I could simply do convert(DataFrame, women) and simply rename the column names. But that conversion does not work. And the following is my attempt on transformation or mutation in the case of R.
#> begin
women_new
#where !isna(:Height)
#transform(
Sector = NA,
Sector = ifelse(:Height .>= 0 & :Height .<= 60, 1,
ifelse(:Height .>= 61 & :Height .<= 67, 2,
ifelse(:Height .>= 68 & :Height .<= 72, 3, NA)))
)
end
But this will return:
15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Sector│
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 1 │
│ 5 │ 62 │ 126 │ 1 │
│ 6 │ 63 │ 129 │ 1 │
│ 7 │ 64 │ 132 │ 1 │
│ 8 │ 65 │ 135 │ 1 │
│ 9 │ 66 │ 139 │ 1 │
│ 10 │ 67 │ 142 │ 1 │
│ 11 │ 68 │ 146 │ 1 │
│ 12 │ 69 │ 150 │ 1 │
│ 13 │ 70 │ 154 │ 1 │
│ 14 │ 71 │ 159 │ 1 │
│ 15 │ 72 │ 164 │ 1 │
which is not equivalent to R, I also tried the following:
#> begin
women_new
#where !isna(:Height)
#transform(
Sector = NA,
Sector = :Height .>= 0 & :Height .<= 60 ? 1 :
:Height .>= 61 & :Height .<= 67 ? 2 :
:Height .>= 68 & :Height .<= 72 ? 3 :
NA;
)
end
But returns the following error:
TypeError: non-boolean (DataArrays.DataArray{Bool,1}) used in boolean context
in (::###469#303)(::DataArrays.DataArray{Int64,1}) at DataFramesMeta.jl:55
in (::##298#302)(::DataFrames.DataFrame) at DataFramesMeta.jl:295
in #transform#38(::Array{Any,1}, ::Function, ::DataFrames.DataFrame) at DataFramesMeta.jl:270
in (::DataFramesMeta.#kw##transform)(::Array{Any,1}, ::DataFramesMeta.#transform, ::DataFrames.DataFrame) at <missing>:0
in include_string(::String, ::String) at loading.jl:441
in include_string(::String, ::String, ::Int64) at eval.jl:30
in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
in withpath(::Function, ::String) at eval.jl:38
in macro expansion at eval.jl:49 [inlined]
in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60
I do appreciate if you can help me figure this out. Finally, my last question is that, is there a way to shorten my code like that in R but still elegant?
I got it. There is an effect on operator precedence, I thought parentheses are not needed.
using DataFrames
using DataFramesMeta
using Lazy
using RDatasets
women = dataset("datasets", "women");
women_new = vcat(
women,
DataFrame(Height = [NA; NA], Weight = #data [1; NA])
)
#> begin
women_new
#where !isna(:Height)
#transform(
Class = NA,
Class = ifelse((:Height .>= 0) & (:Height .<= 60), 1,
ifelse((:Height .>= 61) & (:Height .<= 67), 2,
ifelse((:Height .>= 68) & (:Height .<= 72), 3, NA)))
)
end
Update: The above code can be further simplified into:
#> begin
women_new
#where !isna(:Height)
#transform(
Class = #> begin
function (x)
0 <= x <= 60 ? 1 :
61 <= x <= 67 ? 2 :
68 <= x <= 72 ? 3 :
NA
end
map(:Height)
end
)
end
Or an alternative is to use Query.jl as follows:
using DataFrames
using Query
using RDatasets
women = dataset("datasets", "women");
women_new = vcat(
women,
DataFrame(Height = [NA; NA], Weight = #data [1; NA])
)
#from i in women_new begin
#where !isnull(i.Height)
#select {
i.Height, i.Weight,
class = 0 <= i.Height <= 60 ? 1 :
61 <= i.Height <= 67 ? 2 :
68 <= i.Height <= 72 ? 3 :
0
}
#collect DataFrame
end
The output is now correct:
15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 2 │
│ 5 │ 62 │ 126 │ 2 │
│ 6 │ 63 │ 129 │ 2 │
│ 7 │ 64 │ 132 │ 2 │
│ 8 │ 65 │ 135 │ 2 │
│ 9 │ 66 │ 139 │ 2 │
│ 10 │ 67 │ 142 │ 2 │
│ 11 │ 68 │ 146 │ 3 │
│ 12 │ 69 │ 150 │ 3 │
│ 13 │ 70 │ 154 │ 3 │
│ 14 │ 71 │ 159 │ 3 │
│ 15 │ 72 │ 164 │ 3 │
If we don't want to filter NAs and work with the complete data, then the best I can is the following:
#> begin
women_new
#transform(
Height_New = NA,
Height_New = ifelse(isna(:Height), -1, :Height))
#transform(
Class = NA,
Class = ifelse(:Height_New == -1, NA,
ifelse((:Height_New .>= 0) & (:Height_New .<= 60), 1,
ifelse((:Height_New .>= 61) & (:Height_New .<= 67), 2,
ifelse((:Height_New .>= 68) & (:Height_New .<= 72), 3, NA))))
)
delete!(:Height_New)
end
Update: The above code can be further simplified into:
#> begin
women_new
#transform(
Class = #> begin
function (x)
isna(x) ? NA :
0 <= x <= 60 ? 1 :
61 <= x <= 67 ? 2 :
68 <= x <= 72 ? 3 :
NA
end
map(:Height)
end
)
end
Or an alternative is to use Query.jl as follows:
#from i in women_new begin
#select {
i.Height, i.Weight,
class = 0 <= i.Height <= 60 ? 1 :
61 <= i.Height <= 67 ? 2 :
68 <= i.Height <= 72 ? 3 :
0
}
#collect DataFrame
end
The output:
17×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1 │ 58 │ 115 │ 1 │
│ 2 │ 59 │ 117 │ 1 │
│ 3 │ 60 │ 120 │ 1 │
│ 4 │ 61 │ 123 │ 2 │
│ 5 │ 62 │ 126 │ 2 │
│ 6 │ 63 │ 129 │ 2 │
│ 7 │ 64 │ 132 │ 2 │
│ 8 │ 65 │ 135 │ 2 │
│ 9 │ 66 │ 139 │ 2 │
│ 10 │ 67 │ 142 │ 2 │
│ 11 │ 68 │ 146 │ 3 │
│ 12 │ 69 │ 150 │ 3 │
│ 13 │ 70 │ 154 │ 3 │
│ 14 │ 71 │ 159 │ 3 │
│ 15 │ 72 │ 164 │ 3 │
│ 16 │ NA │ 1 │ NA │
│ 17 │ NA │ NA │ NA │
In this case, the code becomes messy because there is no method yet for handling NAs in ifelse's first argument.