Writetable is exporting data with "Nullable{Type}(data)" instead of just data in Julia - dataframe

For example, I have a DataFrame like the one seen below, let's call it df.
╔═════╦══════╦══════╦══════╗
║ Row ║ a ║ b ║ c ║
╠═════╬══════╬══════╬══════╣
║ 1 ║ 0.66 ║ 0.55 ║ 0.44 ║
╠═════╬══════╬══════╬══════╣
║ 2 ║ 0.11 ║ 0.22 ║ 0.33 ║
╠═════╬══════╬══════╬══════╣
║ 3 ║ 1.00 ║ 2.00 ║ 3.00 ║
╚═════╩══════╩══════╩══════╝
When I use writetable("output.txt",df) I receive the following type of output for the numerical data in the text file.
"Nullable{Float64}(0.66)"
instead of
0.66
Any thoughts on how to get writetable to export just the data?
EDIT:
I should note that this only occurs after importing data using the ReadStat package. Is it possible to convert an entire data set to just an Array that can be exported properly? That may solve the problem.
EDIT #2:
I just tried running the following code (utilizing a created function converter) but am receiving errors (posted below).
f(a,n)=
if typeof(a[n])==NullableArrays.NullableArray{String,1}
convert(Array{String},a[n])
elseif typeof(a[n])==NullableArrays.NullableArray{Float64,1}
convert(Array{Float64},a[n])
elseif typeof(a[n])==NullableArrays.NullableArray{Int64,1}
convert(Array{Float64},a[n])
end
converter(a)=hcat([f(a,n) for n=1:length(a)]...)
The errors received are as follows:
julia> converter(af)
ERROR: NullException()
in convert at /home/ale/.julia/v0.5/NullableArrays/src/primitives.jl:248 [inlined]
in convert(::Type{Array{Float64,N}}, ::NullableArrays.NullableArray{Float64,1}) at /home/ale/.julia/v0.5/NullableArrays/src/primitives.jl:256
in f(::DataFrames.DataFrame, ::Int64) at ./REPL[6]:5
in collect_to!(::Array{Array{T,1},1}, ::Base.Generator{UnitRange{Int64},##1#2{DataFrames.DataFrame}}, ::Int64, ::Int64) at ./array.jl:340
in collect_to!(::Array{Array{Float64,1},1}, ::Base.Generator{UnitRange{Int64},##1#2{DataFrames.DataFrame}}, ::Int64, ::Int64) at ./array.jl:350
in collect(::Base.Generator{UnitRange{Int64},##1#2{DataFrames.DataFrame}}) at ./array.jl:308
in converter(::DataFrames.DataFrame) at ./REPL[7]:1

Have a look / play a bit with the following:
julia> using DataFrames
julia> a = [Nullable(0.1), Nullable{Float64}(), Nullable(0.3)];
julia> b = [Nullable{Float64}(), Nullable(2.), Nullable(3.)];
julia> df = DataFrame(Any[a,b], [:a,:b])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───────┼───────┤
│ 1 │ 0.1 │ #NULL │
│ 2 │ #NULL │ 2.0 │
│ 3 │ 0.3 │ 3.0 │
julia> c = [df[x] for x in names(df)];
julia> f(x) = [get(y, "Missing") for y in x];
julia> d = Any[f(x) for x in c]; # "Any" required for dataframes (I think)
julia> df2 = DataFrame(d, names(df))
│ Row │ a │ b │
├─────┼───────────┼───────────┤
│ 1 │ 0.1 │ "Missing" │
│ 2 │ "Missing" │ 2.0 │
│ 3 │ 0.3 │ 3.0 │
julia> writetable("/home/tasos/Desktop/output.txt", df2)
Note that for each column, if there is even one missing value, your numbers will also be reported inside quotes, because of the mixed array. If you want it to be all integers, you'll have to choose a different default value to "Missing" to denote your missing values (e.g. -1 if you only expect positive numbers).
If you don't like that, then you're probably better off writing your own "writetable" function; it's not that difficult, it's just a case of opening a file and printing what you want per column.
Also, to address some of our discussion in the comments:
The nullable type has two fields:
julia> fieldnames(Nullable)
2-element Array{Symbol,1}:
:hasvalue
:value
Let's create two instances to show what they mean:
julia> a = Nullable(1, true); b = Nullable(2, false);
julia> a.hasvalue, a.value
(true,1)
julia> b.hasvalue, b.value
(false,2)
You can test for nullity explicitly:
julia> isnull(a)
false
julia> isnull(b)
true
julia> isnull(0), isnull("")
(false, false) # isnull returns false by default if input is not a Nullable Type
Or you can use the "get" function to get a Nullable's value. If you don't define an alternative in the case of null, you get a NullException:
julia> get(a)
1
julia> get(b)
ERROR: NullException()
Stacktrace:
[1] get(::Nullable{Int64}) at ./nullable.jl:92
julia> get(b, "Null Detected")
"Null Detected"
A Nullable instance defined as Nullable(1, false) has a .value of 1, but this is superfluous as it is declared as .hasvalue=false and is therefore effectively null (though you can query the .value if you really want to).
A Nullable instance defined as n = Nullable{Float64}() will give you a nullable instance with .hasvalue=false and a meaningless value, presumably whatever was in memory at that location during instantiation, though interpreted as whatever Type of Nullable you declared (i.e. Float64):
julia> n = Nullable{Float64}()
Nullable{Float64}()
julia> n.hasvalue, n.value
(false, 6.9015724352651e-310)

Related

Add thousands separator to column in dataframe in julia

I have a dataframe with two columns a and b and at the moment both are looking like column a, but I want to add separators so that column b looks like below. I have tried using the package format.jl. But I haven't gotten the result I'm afte. Maybe worth mentioning is that both columns is Int64 and the column names a and b is of type symbol.
a | b
150000 | 1500,00
27 | 27,00
16614 | 166,14
Is there some other way to solve this than using format.jl? Or is format.jl the way to go?
Assuming you want the commas in their typical positions rather than how you wrote them, this is one way:
julia> using DataFrames, Format
julia> f(x) = format(x, commas=true)
f (generic function with 1 method)
julia> df = DataFrame(a = [1000000, 200000, 30000])
3×1 DataFrame
Row │ a
│ Int64
─────┼─────────
1 │ 1000000
2 │ 200000
3 │ 30000
julia> transform(df, :a => ByRow(f) => :a_string)
3×2 DataFrame
Row │ a a_string
│ Int64 String
─────┼────────────────────
1 │ 1000000 1,000,000
2 │ 200000 200,000
3 │ 30000 30,000
If you instead want the row replaced, use transform(df, :a => ByRow(f), renamecols=false).
If you just want the output vector rather than changing the DataFrame, you can use format.(df.a, commas=true)
You could write your own function f to achieve the same behavior, but you might as well use the one someone already wrote inside the Format.jl package.
However, once you transform you data to Strings as above, you won't be able to filter/sort/analyze the numerical data in the DataFrame. I would suggest that you apply the formatting in the printing step (rather than modifying the DataFrame itself to contain strings) by using the PrettyTables package. This can format the entire DataFrame at once.
julia> using DataFrames, PrettyTables
julia> df = DataFrame(a = [1000000, 200000, 30000], b = [500, 6000, 70000])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼────────────────
1 │ 1000000 500
2 │ 200000 6000
3 │ 30000 70000
julia> pretty_table(df, formatters = ft_printf("%'d"))
┌───────────┬────────┐
│ a │ b │
│ Int64 │ Int64 │
├───────────┼────────┤
│ 1,000,000 │ 500 │
│ 200,000 │ 6,000 │
│ 30,000 │ 70,000 │
└───────────┴────────┘
(Edited to reflect the updated specs in the question)
julia> df = DataFrame(a = [150000, 27, 16614]);
julia> function insertdecimalcomma(n)
if n < 100
return string(n) * ",00"
else
return replace(string(n), r"(..)$" => s",\1")
end
end
insertdecimalcomma (generic function with 1 method)
julia> df.b = insertdecimalcomma.(df.a)
julia> df
3×2 DataFrame
Row │ a b
│ Int64 String
─────┼─────────────────
1 │ 150000 1500,00
2 │ 27 27,00
3 │ 16614 166,14
Note that the b column will necessarily be a String after this change, as integer types cannot store formatting information in them.
If you have a lot of data and find that you need better performance, you may also want to use the InlineStrings package:
julia> #same as before upto the function definition
julia> using InlineStrings
julia> df.b = inlinestrings(insertdecimalcomma.(df.a))
3-element Vector{String7}:
"1500,00"
"27,00"
"166,14"
This stores the b column's data as fixed-size strings (String7 type here), which are generally treated like normal Strings, but can be significantly better for performance.

Is it possible to set a chosen column as index in a julia dataframe?

dataframes in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex! reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
DataFrames.jl does not currently allow specifying an index for a data frame. The Row column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex! function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42 is equivalent to setindex!(x, 42, 1). Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex! can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.

Julia: Apply function to every cell within a DataFrame (without loosing column names)

I am diving into Julia, hence my "novice"-question.
Coming from R and Python, I am used to apply simple functions (arithmetic or otherwise) to entire pandas.DataFrames and data.frames, respectively.
#both R and Python
df - 1 # returns all values -1, given all values are numeric
df == "someString" # returns a boolean df
a bit more complex
#python
df = df.applymap(lambda v: v - 1 if v > 1 else v)
#R
df[] <- lapply(df, function(x) ifelse(x>1,x-1,x))
The thing is, I don't know how to do this in Julia, I don't find analogue solutions easily on the web. And Stackoverflow helps a lot when using Google. So here it is. How do I do it in Julia?
Thanks for your help!
PS:
So far I have come up with the following solutions, where I loos my column names.
DataFrame(colwise(x -> x .-1, df))
# seems like to much code for only subtracting 1 and loosing col names
Please update your DataFrames.jl installation to version 1.4.2.
You can do all you want using broadcasting like this:
julia> df = DataFrame(rand(2,3), :auto)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
julia> df .+ 1
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.72026 1.75949 1.9987
2 │ 1.72699 1.56015 1.24398
julia> df .< 0.5
2×3 DataFrame
Row │ x1 x2 x3
│ Bool Bool Bool
─────┼─────────────────────
1 │ false false false
2 │ false false true
julia> df2 = string.(df)
2×3 DataFrame
Row │ x1 x2 x3
│ String String String
─────┼────────────────────────────────────────────────────────────
1 │ 0.7202642575401104 0.7594928463144177 0.9987024771396766
2 │ 0.7269944483236035 0.5601527006649413 0.2439815742224939
julia> parse.(Float64, df2)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
Is this what you wanted?

ERROR: UndefVarError: y not defined

I am trying to run a mixed-effects model in julia (R is too slow for my data), but I keep getting this error.
I have installed DataArrays, DataFrames , MixedModels, and RDatasets packages and am following the tutorial here --> http://dmbates.github.io/MixedModels.jl/latest/man/fitting/#Fitting-linear-mixed-effects-models-1
These are my steps:
using DataArrays, DataFrames , MixedModels, RDatasets
I get these warnings
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in
module Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at
nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module
Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition model_response(DataFrames.ModelFrame) in module DataFrames
at /home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65.
I get an R dataset from lme4 package (used in the tutorial)
inst = dataset("lme4", "InstEval")
julia> head(inst)
6×7 DataFrames.DataFrame
│ Row │ S │ D │ Studage │ Lectage │ Service │ Dept │ Y │
├─────┼─────┼────────┼─────────┼─────────┼─────────┼──────┼───┤
│ 1 │ "1" │ "1002" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 2 │ "1" │ "1050" │ "2" │ "1" │ "1" │ "6" │ 2 │
│ 3 │ "1" │ "1582" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 4 │ "1" │ "2050" │ "2" │ "2" │ "1" │ "3" │ 3 │
│ 5 │ "2" │ "115" │ "2" │ "1" │ "0" │ "5" │ 2 │
│ 6 │ "2" │ "756" │ "2" │ "1" │ "0" │ "5" │ 4 │
I run the model as shown in the tutorial
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
and get
ERROR: UndefVarError: y not defined
Stacktrace:
[1] macro expansion at ./REPL.jl:97 [inlined]
[2] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
The same thing happens when I try it with my own data loaded using "readtable" from DataFrames package
I am running julia 0.6.0 and all packages are freshly installed. My system is arch linux 4.11.7-1 with all the latest packages. Julia installs without a problem, but some packages give warnings (see above).
have a go with the #formula macro:
julia> fit!(lmm(#formula(Y ~ (1 | Dept)), inst), true)
f_1: 250160.38873 [1.0]
f_2: 250175.99074 [1.75]
f_3: 250123.06531 [0.25]
f_4: 250602.3424 [0.0]
f_5: 250137.66303 [0.4375]
f_6: 250129.76244 [0.325]
f_7: 250125.94066 [0.280268]
f_8: 250121.15016 [0.23125]
f_9: 250119.12389 [0.2125]
f_10: 250114.7257 [0.175]
f_11: 250105.61264 [0.1]
f_12: 250602.3424 [0.0]
f_13: 250107.52714 [0.118027]
f_14: 250106.36924 [0.107778]
f_15: 250105.04638 [0.0925]
f_16: 250104.72722 [0.085]
f_17: 250104.93086 [0.0749222]
f_18: 250104.70046 [0.0831588]
f_19: 250104.70849 [0.0839088]
f_20: 250104.69659 [0.0824088]
f_21: 250104.69632 [0.0822501]
f_22: 250104.69625 [0.0821409]
f_23: 250104.69625 [0.0820659]
f_24: 250104.69624 [0.0821118]
f_25: 250104.69624 [0.0821193]
f_26: 250104.69624 [0.082111]
f_27: 250104.69624 [0.0821118]
f_28: 250104.69624 [0.0821117]
f_29: 250104.69624 [0.0821118]
f_30: 250104.69624 [0.0821118]
Linear mixed model fit by maximum likelihood
Formula: Y ~ 1 | Dept
logLik -2 logLik AIC BIC
-1.25052348×10⁵ 2.50104696×10⁵ 2.50110696×10⁵ 2.50138308×10⁵
Variance components:
Column Variance Std.Dev.
Dept (Intercept) 0.011897242 0.10907448
Residual 1.764556375 1.32836605
Number of obs: 73421; levels of grouping factors: 14
Fixed-effects parameters:
Estimate Std.Error z value P(>|z|)
(Intercept) 3.21373 0.029632 108.455 <1e-99
The warnings are the usual "Julia (and the package ecosystem) is still in flux" messages. But I wonder whether the docs always keep pace with the code.

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.
The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).
To append to a nonempty dataframe is straightforward:
df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])
This returns.
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 1 | 4 |
| 2 | 2 | 5 |
| 3 | 3 | 6 |
But for an empty init I get errors.
df = DataFrame(A = [], B = [])
push!(df, [3, 6])
Error message:
ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2
What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?
A zero length array defined using only [] will lack sufficient type information.
julia> typeof([])
Array{None,1}
So to avoid that problem is to simply indicate the type.
julia> typeof(Int64[])
Array{Int64,1}
And you can apply that to your DataFrame problem
julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame
julia> push!(df, [3 6])
julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 3 | 6 |
using Pkg, CSV, DataFrames
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))
new_iris = similar(iris, nrow(iris))
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ missing │ missing │ missing │ missing │ missing │
# │ 2 │ missing │ missing │ missing │ missing │ missing │
for (i, row) in enumerate(eachrow(iris))
new_iris[i, :] = row[:]
end
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
# │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
The answer from #waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?
In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient.
The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.
Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.
col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)
df = DataFrame(named_tuple) # 0×2 DataFrame
Alternatively, the NameTuple could be created with
# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type
df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])
1×2 DataFrame
Row │ A B
│ Any Any
─────┼──────────
1 │ 5 f
as seen in this post by #Bogumił Kamiński where multiple columns are needed, something like this can be done:
entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
Row │ A B C D
│ Any Any Any Any
─────┼────────────────────
1 │ 4 5 r p
Or as pointed out by #Antonello below if you know that type you can do.
df = DataFrame([name => Int[] for name in entries])
which is also in #Bogumil Kaminski's original post.