I want to efficiently find the distance from the current row to the previous occurrence. I know polars doesn't have indexes, but the formula would roughly be:
if prior_occurrence {
(current_row_index - prior_occurrence_index - 1)
} else {
-1
}
This is the input dataframe:
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
].unwrap();
println!("{}", df_a);
a - i32
b - str
1
c
2
a
2
b
1
c
4
c
1
a
Wanted output:
a - i32
b - str
b_dist - i32
1
c
-1
2
a
-1
2
b
-1
1
c
2
4
c
0
1
a
3
What's the most efficient way to go about this?
python
(df
.with_row_count("idx")
.with_columns([
((pl.col("idx") - pl.col("idx").shift()).cast(pl.Int32).fill_null(0) - 1)
.over("a").alias("a_distance_to_a")
])
)
rust
fn func1() -> PolarsResult<()> {
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
]?;
let out = df_a
.lazy()
.with_row_count("idx", None)
.with_columns([((col("idx") - col("idx").shift(1))
.cast(DataType::Int32)
.fill_null(0)
- lit(1))
.over("a")
.alias("a_distance_to_a")])
.collect()?;
Ok(())
output
shape: (6, 4)
┌─────┬─────┬─────┬─────────────────┐
│ idx ┆ a ┆ b ┆ a_distance_to_a │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i32 │
╞═════╪═════╪═════╪═════════════════╡
│ 0 ┆ 1 ┆ c ┆ -1 │
│ 1 ┆ 2 ┆ a ┆ -1 │
│ 2 ┆ 2 ┆ b ┆ 0 │
│ 3 ┆ 1 ┆ c ┆ 2 │
│ 4 ┆ 4 ┆ c ┆ -1 │
│ 5 ┆ 1 ┆ a ┆ 1 │
└─────┴─────┴─────┴─────────────────┘
I want to create a new variable (nota) using an if-else function with the transform! function from the DataFrames package.
Here is my code:
using DataFrames
datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
transform!(datos, :puntaje => ifelse(datos[datos.puntaje .< 9,:],"Suficiente","Excelente") => :nota)
but this error displays:
ERROR: TypeError: non-boolean (BitVector) used in boolean context
Stacktrace:
[1] top-level scope
# REPL[23]:1
How can I solve it?
Two problems:
You are mixing ByRow with normal transform which is by column
You can't mutate type of puntaje column
You probably want to do this:
julia> datos.puntaje = ifelse.(datos.puntaje .< 9, "Suficiente", "Excelente")
6-element Vector{String}:
"Excelente"
"Excelente"
"Suficiente"
"Suficiente"
"Excelente"
"Suficiente"
julia> datos
6×3 DataFrame
Row │ nombre grupo puntaje
│ String String String
─────┼──────────────────────────────
1 │ ANGELICA A Excelente
2 │ BRENDA A Excelente
3 │ LILIANA B Suficiente
4 │ MARCO B Suficiente
5 │ FABIAN C Excelente
6 │ MAURICIO C Suficiente
If you prefer using transform!, two syntaxes that work are:
julia> datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
julia> transform!(datos, :puntaje => (p -> ifelse.(p .< 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
julia> transform!(datos, :puntaje => ByRow(p -> ifelse(p < 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..
So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
or
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
I can find the individual "row" values interacting over the table with pair():
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
I can't however get the columns names and types, and I guess working by column would instead be more efficient.
EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.
The following conversion functions:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
give (on Julia 0.6 with tn and t defined as in the question):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Type information is mostly retained:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
And for unnamed columns:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
EDIT: As noted by #Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
And then, the mixed case gives a result like:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Ugly, quick and dirty "solution" (I hope it is doable in other way):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Just install IterableTables and then
using IterableTables
df = DataFrames.DataFrame(it)
Here it is an initial attampt to write a conversion function.. it keeps column names and type.. it would be nice if it could be cleaned up and implemented in either the DataFrame or the IndexedTable package as convert(DataFrame,t::IndexedArray).
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
EDIT 20210106:
Solution for NDSparse indexed table with a single value column:
# NDSparse creation...
content = [["banana","banana","apple","apple","orange"],["us",missing,"us","eu","us"],[1.1,2.2,3.3,missing,5.5]]
dimNames = ["item","region"]
t = NDSparse(content...,names=Symbol.(dimNames))
# NDSparse conversion to df...
names = vcat(keys(keys(t)[1])...,:value)
cols = columns(t)
df = DataFrame(map((n,v) -> Pair(n,v), names, cols))