I want to efficiently find the distance from the current row to the previous occurrence. I know polars doesn't have indexes, but the formula would roughly be:
if prior_occurrence {
(current_row_index - prior_occurrence_index - 1)
} else {
-1
}
This is the input dataframe:
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
].unwrap();
println!("{}", df_a);
a - i32
b - str
1
c
2
a
2
b
1
c
4
c
1
a
Wanted output:
a - i32
b - str
b_dist - i32
1
c
-1
2
a
-1
2
b
-1
1
c
2
4
c
0
1
a
3
What's the most efficient way to go about this?
python
(df
.with_row_count("idx")
.with_columns([
((pl.col("idx") - pl.col("idx").shift()).cast(pl.Int32).fill_null(0) - 1)
.over("a").alias("a_distance_to_a")
])
)
rust
fn func1() -> PolarsResult<()> {
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
]?;
let out = df_a
.lazy()
.with_row_count("idx", None)
.with_columns([((col("idx") - col("idx").shift(1))
.cast(DataType::Int32)
.fill_null(0)
- lit(1))
.over("a")
.alias("a_distance_to_a")])
.collect()?;
Ok(())
output
shape: (6, 4)
┌─────┬─────┬─────┬─────────────────┐
│ idx ┆ a ┆ b ┆ a_distance_to_a │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i32 │
╞═════╪═════╪═════╪═════════════════╡
│ 0 ┆ 1 ┆ c ┆ -1 │
│ 1 ┆ 2 ┆ a ┆ -1 │
│ 2 ┆ 2 ┆ b ┆ 0 │
│ 3 ┆ 1 ┆ c ┆ 2 │
│ 4 ┆ 4 ┆ c ┆ -1 │
│ 5 ┆ 1 ┆ a ┆ 1 │
└─────┴─────┴─────┴─────────────────┘
I want to create a new variable (nota) using an if-else function with the transform! function from the DataFrames package.
Here is my code:
using DataFrames
datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
transform!(datos, :puntaje => ifelse(datos[datos.puntaje .< 9,:],"Suficiente","Excelente") => :nota)
but this error displays:
ERROR: TypeError: non-boolean (BitVector) used in boolean context
Stacktrace:
[1] top-level scope
# REPL[23]:1
How can I solve it?
Two problems:
You are mixing ByRow with normal transform which is by column
You can't mutate type of puntaje column
You probably want to do this:
julia> datos.puntaje = ifelse.(datos.puntaje .< 9, "Suficiente", "Excelente")
6-element Vector{String}:
"Excelente"
"Excelente"
"Suficiente"
"Suficiente"
"Excelente"
"Suficiente"
julia> datos
6×3 DataFrame
Row │ nombre grupo puntaje
│ String String String
─────┼──────────────────────────────
1 │ ANGELICA A Excelente
2 │ BRENDA A Excelente
3 │ LILIANA B Suficiente
4 │ MARCO B Suficiente
5 │ FABIAN C Excelente
6 │ MAURICIO C Suficiente
If you prefer using transform!, two syntaxes that work are:
julia> datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
julia> transform!(datos, :puntaje => (p -> ifelse.(p .< 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
julia> transform!(datos, :puntaje => ByRow(p -> ifelse(p < 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .
So i have a DataFrame that look like this :
│ Row │ USER_ID │ GENRE_MAIN │ ALBUM_NAME │ ALBUM_ARTIST_NAME │ TOTAL_LISTENED │ TOTAL_COUNT_LISTENED │
│ │ String │ String │ String │ String │ DecFP.Dec128 │ DecFP.Dec128 │
├──────────┼─────────┼───────────────────┼─────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┼────────────────┼──────────────────────┤
│ 1 │ 9s2dsdd6 │ ROCK │ The Thought's Boy │ AARON MADISON │ 5912.0 │ 91.0 │
│ 1 │ 9s2dsdd6 │ ROCK │ The wqeqwewe │ AARON MADISON │ 3212.0 │ 91.0 │
│ 2 │ 11sdasd63 │ ROCK │ Down On The Upside │ SOUNDGARDEN │ 3354.0 │ 14.0 │
│ 3 │ 112sds334 │ CLASSICAL │ Beethoven: Symphonies Nos. 1 & 2 - C.P.E. Bach: Symphonies, Wq 175 & 183/17 │ AKADEMIE FÜR ALTE MUSIK BERLIN │ 1372.0 │ 4.0 │
│ 4 │ 145sdsd42 │ POP │ My Life in the Bush of Ghosts │ BRIAN ENO │ 3531.0 │ 17.0 │
I want to aggregate it by user (i have many rows per user_id ) and do many calculations
I'm doing aggregation with this :
gdf = DataFrames.groupby(df, :USER_ID)
combine(gdf,:TOTAL_LISTENED => sum => :TOTAL_SECONDS_LISTENED,
:TOTAL_COUNT_LISTENED => sum => :TOTAL_TRACKS_LISTENED)
I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and it has to be like this :
USER_ID │ ALBUM1_NAME │ ALBUM2_NAME ......│ GENRE1 │ GENRE2
One line per user_id .
So i tried to do it with a countmap and then sort it , keep only the top 5 , and assign each value to a column in a Dataframe
transposed = sort(countmap(targetId[targetCol]), byvalue=true, rev=true)
for (i, g) in enumerate(eachcol(transposed))
rVal["ALBUM$(i)_NAME"] = g[1]
rVal["ALBUM$(i)_ARTIST"] = g[3]
rVal["ALBUM$(i)_TIME"] = g[2]
rVal["ALBUM$(i)_ID"] = "ID"
rVal["USER_ID"] = id
end
but it doesn't work in a combine , its just very ugly and im sure i can do it a better way .
I hope its understandable , if someone can help me please =)
Thank you
EDIT : A way to reproduce the DataFrame:
v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])
You have not provided an easy way to reproduce your source data, so I am writing the solution from my head and hope I have not made any typo (note that you need DataFrames.jl 0.22 for this to work, while you seem to be on some older version of the package):
using DataFrames, Pipe, Random, Pkg
Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")
Random.seed!(1234)
df = DataFrame(USER_ID=rand(1:10, 80),
GENRE_MAIN=rand(string.("genre_", 1:6), 80),
ALBUM_NAME=rand(string.("album_", 1:6), 80),
ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))
function top5(sdf, col, prefix)
return #pipe groupby(sdf, col) |>
combine(_, nrow) |>
sort!(_, :nrow, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
DataFrame([string(prefix, i) for i in 1:5] .=> _)
end
#pipe groupby(df, :USER_ID) |>
combine(_,
x -> top5(x, :GENRE_MAIN, "genre"),
x -> top5(x, :ALBUM_NAME, "album"),
x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))
The code is a bit complex as we have to handle the fact that there might be less than 5 entries per group.
It produces under Julia 1.5.3:
10×16 DataFrame
Row │ USER_ID genre1 genre2 genre3 genre4 genre5 album1 album2 album3 album4 album5 artist1 artist2 artist3 artist4 artist5
│ Int64 String String String String? String? String String String String String? String String String String? String?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 genre_1 genre_3 genre_5 genre_2 genre_4 album_3 album_5 album_6 album_1 album_4 artist_1 artist_4 artist_3 artist_2 artist_6
2 │ 4 genre_1 genre_3 genre_6 genre_2 missing album_2 album_4 album_5 album_6 missing artist_4 artist_5 artist_2 missing missing
3 │ 8 genre_2 genre_1 genre_6 genre_5 missing album_1 album_5 album_4 album_2 missing artist_5 artist_6 artist_4 artist_1 artist_3
4 │ 2 genre_1 genre_5 genre_2 genre_4 genre_3 album_6 album_3 album_4 album_2 album_1 artist_4 artist_2 artist_6 artist_1 artist_5
5 │ 10 genre_5 genre_3 genre_6 genre_4 genre_2 album_2 album_3 album_1 album_5 album_4 artist_1 artist_6 artist_2 artist_5 artist_3
6 │ 7 genre_5 genre_3 genre_2 genre_4 genre_1 album_2 album_4 album_3 album_5 missing artist_4 artist_1 artist_3 artist_5 missing
7 │ 9 genre_3 genre_4 genre_2 missing missing album_1 album_3 album_4 album_2 missing artist_4 artist_2 artist_6 artist_3 missing
8 │ 5 genre_2 genre_3 genre_4 genre_6 missing album_2 album_1 album_3 album_4 missing artist_6 artist_5 artist_4 artist_1 missing
9 │ 3 genre_6 genre_5 genre_4 genre_2 genre_1 album_3 album_4 album_1 album_5 missing artist_4 artist_3 artist_6 artist_5 missing
10 │ 6 genre_3 genre_4 genre_1 genre_6 missing album_2 album_4 album_5 album_3 missing artist_4 artist_6 artist_5 missing missing
which I assume you wanted?
i have the following non standard file and want to read it efficiently into a Dataframe, but do not know how to efficiently handle it.
m[mi++]="18.06.16 22:00:00|0;3;1;1667;49;49;12|0;17;3;2153;328;153;57|0;18;0;2165;284;156;53"
m[mi++]="18.06.16 21:55:00|0;24;7;1667;306;306;61|0;21;4;2153;384;166;62|0;19;1;2165;368;185;62"
m[mi++]="18.06.16 21:50:00|0;31;6;1667;394;349;62|0;32;6;2153;402;164;63|0;33;4;2165;380;171;63"
m[mi++]="18.06.16 21:45:00|11;50;19;1667;390;312;63|34;61;9;2153;410;166;63|19;60;8;2165;391;185;63"
m[mi++]="18.06.16 21:40:00|37;63;24;1666;387;313;63|55;80;12;2150;418;169;64|46;79;11;2163;398;186;63"
If I have saved this data in a file a.js the following code works:
b = readtable("a.js")
dat_5m = DataFrame(t=DateTime[], Pac_1=Float64[], Pdc_A1=Float64[], Pdc_B1=Float64[], Etot1=Float64[],
U_A1=Float64[], U_B1=Float64[], T_1=Float64[], Pac_2=Float64[], Pdc_A2=Float64[], Pdc_B2=Float64[],
Etot2=Float64[], U_A2=Float64[], U_B2=Float64[], T_2=Float64[], Pac_3=Float64[], Pdc_A3=Float64[],
Pdc_B3=Float64[], Etot3=Float64[], U_A3=Float64[], U_B3=Float64[], T_3=Float64[])
for r in nrow(b):-1:1
c = split(b[r,1], [';','|','=']);
d = float(c[3:end])'
t = DateTime(c[2], DateFormat("dd.mm.yy HH:MM:SS"))
push!(dat_5m, [t d])
end
But I think it is quite unelegant. Is there any more efficient way?
If possible (IO permitting), I'd recommend cleaning the file first, removing | and any funky " and prefixes. Then you can use CSV.jl to make your life easier.
The process is: read line by line, clean each line and then use CSV to read it back in:
julia> using CSV
julia> open("b.js", "a") do new_file
for line in readlines("a.js")
line = replace(line, "m[mi++]=\"", "")
# remove only trailing quote
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
write(new_file, string(line, "\n"))
end
end
julia> b = CSV.read("b.js",
delim = ";",
header = false, # change this to a string vector to provide column names
dateformat = "dd.mm.yy HH:MM:SS");
julia> # correct times to be in 2016
b[:Column1] += Dates.Year(2000);
julia> head(b)
6×22 DataFrames.DataFrame. Omitted printing of 15 columns
│ Row │ Column1 │ Column2 │ Column3 │ Column4 │ Column5 │ Column6 │ Column7 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
│ 2 │ 2016-06-18T21:55:00 │ 0 │ 24 │ 7 │ 1667 │ 306 │ 306 │
│ 3 │ 2016-06-18T21:50:00 │ 0 │ 31 │ 6 │ 1667 │ 394 │ 349 │
│ 4 │ 2016-06-18T21:45:00 │ 11 │ 50 │ 19 │ 1667 │ 390 │ 312 │
│ 5 │ 2016-06-18T21:40:00 │ 37 │ 63 │ 24 │ 1666 │ 387 │ 313 │
│ 6 │ 2016-06-18T22:00:00 │ 0 │ 3 │ 1 │ 1667 │ 49 │ 49 │
I cleaned the lines here step by step to make the process easier to understand. If you're a regex master you might want to do that in one line.
There might be a better way of handling the 2 digit years, so if someone knows how to fix that please edit this answer or add it as a comment. Thanks!
Edit:
If you want to do the same without writing to another file, here's a hack to apply the cleaning function, and reuse CSV.read() by converting the readlines array to a IOBuffer again:
function cleaner(line)
line = replace(line, "m[mi++]=\"", "")
line = replace(line, r"\"$", "")
line = replace(line, "|", ";")
println(line)
string(line, "\n")
end
c = CSV.read(
Base.IOBuffer(
string(cleaner.(readlines("a.js"))...)),
delim = ";",
header = false,
dateformat = "dd.mm.yy HH:MM:SS");
c[:Column1] += Dates.Year(2000);
This gives the same result as the other solution.
In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..
So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
or
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
I can find the individual "row" values interacting over the table with pair():
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
I can't however get the columns names and types, and I guess working by column would instead be more efficient.
EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.
The following conversion functions:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
give (on Julia 0.6 with tn and t defined as in the question):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Type information is mostly retained:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
And for unnamed columns:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
EDIT: As noted by #Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
And then, the mixed case gives a result like:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Ugly, quick and dirty "solution" (I hope it is doable in other way):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Just install IterableTables and then
using IterableTables
df = DataFrames.DataFrame(it)
Here it is an initial attampt to write a conversion function.. it keeps column names and type.. it would be nice if it could be cleaned up and implemented in either the DataFrame or the IndexedTable package as convert(DataFrame,t::IndexedArray).
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
EDIT 20210106:
Solution for NDSparse indexed table with a single value column:
# NDSparse creation...
content = [["banana","banana","apple","apple","orange"],["us",missing,"us","eu","us"],[1.1,2.2,3.3,missing,5.5]]
dimNames = ["item","region"]
t = NDSparse(content...,names=Symbol.(dimNames))
# NDSparse conversion to df...
names = vcat(keys(keys(t)[1])...,:value)
cols = columns(t)
df = DataFrame(map((n,v) -> Pair(n,v), names, cols))