Julia DataFrame - compare row value with previous row - dataframe

Example of first rows of the DataFrame with original columns Height and Size:
Desired results are in new additional columns Result_by_row and Result_by_column.
Result_by_row: compare 2 row value with 1 row value and if higher return True in new column Result_by_row (example: 46 > 17, return True in row 2 column Result_by_row).
Result_by_column: same principle but with different columns (example: 46 > 32, return True in row 2 column Result_by_column).
Thanks.
New with Julia, so don't know how to do it :)

Here is an example way how you can do it:
julia> using DataFrames
julia> df = DataFrame(Height=[32.0, 35.0, 63.0, 84.0, 72.0], Size=[17.0, 46.0, 18.0, 56.0, 6.0])
5×2 DataFrame
Row │ Height Size
│ Float64 Float64
─────┼──────────────────
1 │ 32.0 17.0
2 │ 35.0 46.0
3 │ 63.0 18.0
4 │ 84.0 56.0
5 │ 72.0 6.0
julia> using ShiftedArrays: lag
julia> df.Result_by_row = df.Height .> lag(df.Height)
5-element Vector{Union{Missing, Bool}}:
missing
true
true
true
false
julia> df.Result_by_column = df.Size .> lag(df.Height)
5-element Vector{Union{Missing, Bool}}:
missing
true
false
false
false
julia> df
5×4 DataFrame
Row │ Height Size Result_by_row Result_by_column
│ Float64 Float64 Bool? Bool?
─────┼───────────────────────────────────────────────────
1 │ 32.0 17.0 missing missing
2 │ 35.0 46.0 true true
3 │ 63.0 18.0 true false
4 │ 84.0 56.0 true false
5 │ 72.0 6.0 false false

Related

How do I get the distance to a previous occurrence of a value in Polars dataframe?

I want to efficiently find the distance from the current row to the previous occurrence. I know polars doesn't have indexes, but the formula would roughly be:
if prior_occurrence {
(current_row_index - prior_occurrence_index - 1)
} else {
-1
}
This is the input dataframe:
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
].unwrap();
println!("{}", df_a);
a - i32
b - str
1
c
2
a
2
b
1
c
4
c
1
a
Wanted output:
a - i32
b - str
b_dist - i32
1
c
-1
2
a
-1
2
b
-1
1
c
2
4
c
0
1
a
3
What's the most efficient way to go about this?
python
(df
.with_row_count("idx")
.with_columns([
((pl.col("idx") - pl.col("idx").shift()).cast(pl.Int32).fill_null(0) - 1)
.over("a").alias("a_distance_to_a")
])
)
rust
fn func1() -> PolarsResult<()> {
let df_a = df![
"a" => [1, 2, 2, 1, 4, 1],
"b" => ["c","a", "b", "c", "c","a"]
]?;
let out = df_a
.lazy()
.with_row_count("idx", None)
.with_columns([((col("idx") - col("idx").shift(1))
.cast(DataType::Int32)
.fill_null(0)
- lit(1))
.over("a")
.alias("a_distance_to_a")])
.collect()?;
Ok(())
output
shape: (6, 4)
┌─────┬─────┬─────┬─────────────────┐
│ idx ┆ a ┆ b ┆ a_distance_to_a │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i32 │
╞═════╪═════╪═════╪═════════════════╡
│ 0 ┆ 1 ┆ c ┆ -1 │
│ 1 ┆ 2 ┆ a ┆ -1 │
│ 2 ┆ 2 ┆ b ┆ 0 │
│ 3 ┆ 1 ┆ c ┆ 2 │
│ 4 ┆ 4 ┆ c ┆ -1 │
│ 5 ┆ 1 ┆ a ┆ 1 │
└─────┴─────┴─────┴─────────────────┘

Sort Julia DataFrame in descending Order

I have a DataFrame
using DataFrames
using DataFramesMeta
using Chain
df = DataFrame(a=1:3,b=4:6)
that I want to sort descending by column :a. Doing it ascending is intuitive...
#chain df begin
#orderby(:a)
end
How can I do this?
Looking for different solutions here, but specifically for a solution that can be used in #chains.
So, if you have a numeric column, you can simply negate it...
using DataFrames
using DataFramesMeta
using Chain
df = DataFrame(a=1:3,b=4:6)
#chain df begin
#orderby(-:a)
end
But this logic doesn't work for Dates for example.
You can just use sort with rev=true:
julia> #chain df begin
sort(:a, rev=true)
end
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 3 6
2 │ 2 5
3 │ 1 4
Example for multiple columns that uses order, see also docstring of sort and https://bkamins.github.io/julialang/2021/03/12/sorting.html:
julia> df = DataFrame(a=rand(Bool, 10), b=rand(Bool, 10))
10×2 DataFrame
Row │ a b
│ Bool Bool
─────┼──────────────
1 │ false true
2 │ true false
3 │ true true
4 │ false true
5 │ false true
6 │ false false
7 │ false false
8 │ true false
9 │ true true
10 │ true false
julia> sort(df, [order(:a, rev=true), :b])
10×2 DataFrame
Row │ a b
│ Bool Bool
─────┼──────────────
1 │ true false
2 │ true false
3 │ true false
4 │ true true
5 │ true true
6 │ false false
7 │ false false
8 │ false true
9 │ false true
10 │ false true

How can I create a new variable in Julia using an ifelse condition dataframe?

I want to create a new variable (nota) using an if-else function with the transform! function from the DataFrames package.
Here is my code:
using DataFrames
datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
transform!(datos, :puntaje => ifelse(datos[datos.puntaje .< 9,:],"Suficiente","Excelente") => :nota)
but this error displays:
ERROR: TypeError: non-boolean (BitVector) used in boolean context
Stacktrace:
[1] top-level scope
# REPL[23]:1
How can I solve it?
Two problems:
You are mixing ByRow with normal transform which is by column
You can't mutate type of puntaje column
You probably want to do this:
julia> datos.puntaje = ifelse.(datos.puntaje .< 9, "Suficiente", "Excelente")
6-element Vector{String}:
"Excelente"
"Excelente"
"Suficiente"
"Suficiente"
"Excelente"
"Suficiente"
julia> datos
6×3 DataFrame
Row │ nombre grupo puntaje
│ String String String
─────┼──────────────────────────────
1 │ ANGELICA A Excelente
2 │ BRENDA A Excelente
3 │ LILIANA B Suficiente
4 │ MARCO B Suficiente
5 │ FABIAN C Excelente
6 │ MAURICIO C Suficiente
If you prefer using transform!, two syntaxes that work are:
julia> datos = DataFrame(nombre =["ANGELICA","BRENDA","LILIANA","MARCO","FABIAN","MAURICIO"],
grupo = ["A","A","B","B","C","C"],
puntaje = [10,9,8,8,9,7]);
julia> transform!(datos, :puntaje => (p -> ifelse.(p .< 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente
julia> transform!(datos, :puntaje => ByRow(p -> ifelse(p < 9, "Suficiente", "Excelente")) => :nota)
6×4 DataFrame
Row │ nombre grupo puntaje nota
│ String String Int64 String
─────┼───────────────────────────────────────
1 │ ANGELICA A 10 Excelente
2 │ BRENDA A 9 Excelente
3 │ LILIANA B 8 Suficiente
4 │ MARCO B 8 Suficiente
5 │ FABIAN C 9 Excelente
6 │ MAURICIO C 7 Suficiente

How to convert an IndexedTable to a DataFrame in Julia?

In a quick explanatory work, IndexedTables seem much faster than DataFrames to work on individual elements (e.g. select or "update"), but DataFrames have a nicer ecosystem of functionalities, e.g. plotting, exporting..
So, at a certain point of the workflow, I would like to convert the IndexedTable to a DataFrame, e.g.
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
or
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
to >>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
I can find the individual "row" values interacting over the table with pair():
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
I can't however get the columns names and types, and I guess working by column would instead be more efficient.
EDIT : I did manage to get the "column" types with the above code, I just need now to get the column names, if any:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2: As requested, this is an example of an IndexedTable that would fail conversion of columns names using (current) Dan Getz answer, as the "index" column(s) are named tuple but the "data" column(s) are normal tuples:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
The problem seems to be in IndexedTable API, and specifically in columns(t) function, that doesn't distinguish between index and values.
The following conversion functions:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
give (on Julia 0.6 with tn and t defined as in the question):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Type information is mostly retained:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
And for unnamed columns:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
EDIT: As noted by #Antonello the case for mixed named and unnamed tuples is not handled correctly. To handle it correctly, we can define:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
And then, the mixed case gives a result like:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Ugly, quick and dirty "solution" (I hope it is doable in other way):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
Just install IterableTables and then
using IterableTables
df = DataFrames.DataFrame(it)
Here it is an initial attampt to write a conversion function.. it keeps column names and type.. it would be nice if it could be cleaned up and implemented in either the DataFrame or the IndexedTable package as convert(DataFrame,t::IndexedArray).
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
EDIT 20210106:
Solution for NDSparse indexed table with a single value column:
# NDSparse creation...
content = [["banana","banana","apple","apple","orange"],["us",missing,"us","eu","us"],[1.1,2.2,3.3,missing,5.5]]
dimNames = ["item","region"]
t = NDSparse(content...,names=Symbol.(dimNames))
# NDSparse conversion to df...
names = vcat(keys(keys(t)[1])...,:value)
cols = columns(t)
df = DataFrame(map((n,v) -> Pair(n,v), names, cols))

Unable to remove NaN from panda Series

I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool