ERROR: UndefVarError: y not defined - dataframe

I am trying to run a mixed-effects model in julia (R is too slow for my data), but I keep getting this error.
I have installed DataArrays, DataFrames , MixedModels, and RDatasets packages and am following the tutorial here --> http://dmbates.github.io/MixedModels.jl/latest/man/fitting/#Fitting-linear-mixed-effects-models-1
These are my steps:
using DataArrays, DataFrames , MixedModels, RDatasets
I get these warnings
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in
module Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at
nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module
Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition model_response(DataFrames.ModelFrame) in module DataFrames
at /home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65.
I get an R dataset from lme4 package (used in the tutorial)
inst = dataset("lme4", "InstEval")
julia> head(inst)
6×7 DataFrames.DataFrame
│ Row │ S │ D │ Studage │ Lectage │ Service │ Dept │ Y │
├─────┼─────┼────────┼─────────┼─────────┼─────────┼──────┼───┤
│ 1 │ "1" │ "1002" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 2 │ "1" │ "1050" │ "2" │ "1" │ "1" │ "6" │ 2 │
│ 3 │ "1" │ "1582" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 4 │ "1" │ "2050" │ "2" │ "2" │ "1" │ "3" │ 3 │
│ 5 │ "2" │ "115" │ "2" │ "1" │ "0" │ "5" │ 2 │
│ 6 │ "2" │ "756" │ "2" │ "1" │ "0" │ "5" │ 4 │
I run the model as shown in the tutorial
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
and get
ERROR: UndefVarError: y not defined
Stacktrace:
[1] macro expansion at ./REPL.jl:97 [inlined]
[2] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
The same thing happens when I try it with my own data loaded using "readtable" from DataFrames package
I am running julia 0.6.0 and all packages are freshly installed. My system is arch linux 4.11.7-1 with all the latest packages. Julia installs without a problem, but some packages give warnings (see above).

have a go with the #formula macro:
julia> fit!(lmm(#formula(Y ~ (1 | Dept)), inst), true)
f_1: 250160.38873 [1.0]
f_2: 250175.99074 [1.75]
f_3: 250123.06531 [0.25]
f_4: 250602.3424 [0.0]
f_5: 250137.66303 [0.4375]
f_6: 250129.76244 [0.325]
f_7: 250125.94066 [0.280268]
f_8: 250121.15016 [0.23125]
f_9: 250119.12389 [0.2125]
f_10: 250114.7257 [0.175]
f_11: 250105.61264 [0.1]
f_12: 250602.3424 [0.0]
f_13: 250107.52714 [0.118027]
f_14: 250106.36924 [0.107778]
f_15: 250105.04638 [0.0925]
f_16: 250104.72722 [0.085]
f_17: 250104.93086 [0.0749222]
f_18: 250104.70046 [0.0831588]
f_19: 250104.70849 [0.0839088]
f_20: 250104.69659 [0.0824088]
f_21: 250104.69632 [0.0822501]
f_22: 250104.69625 [0.0821409]
f_23: 250104.69625 [0.0820659]
f_24: 250104.69624 [0.0821118]
f_25: 250104.69624 [0.0821193]
f_26: 250104.69624 [0.082111]
f_27: 250104.69624 [0.0821118]
f_28: 250104.69624 [0.0821117]
f_29: 250104.69624 [0.0821118]
f_30: 250104.69624 [0.0821118]
Linear mixed model fit by maximum likelihood
Formula: Y ~ 1 | Dept
logLik -2 logLik AIC BIC
-1.25052348×10⁵ 2.50104696×10⁵ 2.50110696×10⁵ 2.50138308×10⁵
Variance components:
Column Variance Std.Dev.
Dept (Intercept) 0.011897242 0.10907448
Residual 1.764556375 1.32836605
Number of obs: 73421; levels of grouping factors: 14
Fixed-effects parameters:
Estimate Std.Error z value P(>|z|)
(Intercept) 3.21373 0.029632 108.455 <1e-99
The warnings are the usual "Julia (and the package ecosystem) is still in flux" messages. But I wonder whether the docs always keep pace with the code.

Related

Is it possible to set a chosen column as index in a julia dataframe?

dataframes in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex! reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
DataFrames.jl does not currently allow specifying an index for a data frame. The Row column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex! function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42 is equivalent to setindex!(x, 42, 1). Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex! can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.

dialog --buildlist option, how to use it?

I've been reading up on the many uses of dialog to create interactive shell scripts, but I'm stumped on how to use the --buildlist option. Read the man pages, searched google, searched stackoverflow, even read through some old articles of Linux Journal from 1994, to no avail.
Can some give me a clear example of how to use it properly?
Lets imagine a directory with 5 files which you'd want to select from, to copy to another directory. Can someone give a working example?
Thankyou!
Consider the following:
dialog --buildlist "Select a directory" 20 50 5 \
f1 "Directory One" off \
f2 "Directory Two" on \
f3 "Directory Three" on
This will display something like
┌────────────────────────────────────────────────┐
│ Select a directory │
│ ┌─────────────────────┐ ┌────^(-)─────────────┐│
│ │Directory One │ │Directory Two ││
│ │ │ │Directory Three ││
│ │ │ │ ││
│ │ │ │ ││
│ │ │ │ ││
│ └─────────────────────┘ └─────────────100%────┘│
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
├────────────────────────────────────────────────┤
│ <OK> <Cancel> │
└────────────────────────────────────────────────┘
The box is 50 characters wide and 20 rows tall; each column displays 5 items. off/on determines if the item starts in the left or right column, respectively.
The controls:
^ selects the left column
$ selects the right column
Move up and down the selected column with the arrow keys
Move the selected item to the other column with the space bar
Toggle between OK and Cancel with the tab key. If you use the --visit-items option, the tab key lets you cycle through the lists as well as the buttons.
Hit enter to select OK or cancel.
If you select OK, the tags (f1, f2, etc) associated with each item in the right column is printed to standard error.

SQL query for converting column breaks in a single column

I have a database in postgres where one of the columns contains text data with multiple column breaks.
So, when I export the data into csv file, the columns are jumbled!
I need a query which will ignore the column breaks in a single column and give an output where the data in the column is available in the same column and does not extend to the next column.
This example table exhibits the problem you are talking about:
test=> SELECT * FROM breaks;
┌────┬───────────┐
│ id │ val │
├────┼───────────┤
│ 1 │ text with↵│
│ │ three ↵│
│ │ lines │
│ 2 │ text with↵│
│ │ two lines │
└────┴───────────┘
(2 rows)
Then you can use the replace function to replace the line breaks with spaces:
test=> SELECT id, replace(val, E'\n', ' ') FROM breaks;
┌────┬───────────────────────┐
│ id │ replace │
├────┼───────────────────────┤
│ 1 │ text with three lines │
│ 2 │ text with two lines │
└────┴───────────────────────┘
(2 rows)

sed: replace many lines from a file with single notification

I'm looking for a solution in GNU sed, but POSIX sed is OK and awk will be OK but probably more complicated than necessary. I prefer sed for this, it should be easy but I'm stuck. Seems like a one-liner can do this, no need to create a python/bash script or anything.
my attempted solution
sed -i '218,226140d; 218i ...REMOVED...' psql.log
This deletes the desired rows, but the insert gets lost. If I move the insert to line 217 I get:
sed -i '218,226140d; 217i ...REMOVED...' psql.log
result:
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
...REMOVED...
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
I know - this should be good enough, but I'm annoyed that I can't get this simple one-liner to work right. What am I missing?
the problem
I keep the psql.log file as a reference for work I am doing developing SQL code. It's very useful to see iterations of the query and the results.
The problem is that sometimes I forget to limit the output and the query will generate 100k+ rows of results that aren't a helpful reference, and I'd like to delete them from my file, leaving a note that reminds me the query output has been excised.
It would be nice to match the pattern, say every output more than 50 rows I could squash down to just the first 5 rows and the last 5. However, its easy for me to mark the line numbers where I've blown up the file, so I'd be happy with just using sed to delete lines N through M, and insert the message ...REMOVED... where line N was.
Here is an example log file, added notes are in parentheses. The query text can change and the number of columns can be from 1 to 100 or more:
...
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217)
│ CC6XF2SVWT │ 13182280615 │ [null] │
(many rows)
│ CC6XF2XWDT │ 995086081 │ [null] │
│ CC6XFX3TL1 │ 25195177405 │ [null] │
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
/* another query begins */
select * from table where X = 1 limit 20;
/* well done you remembered to limit the output */
**************************
...
acceptable output
the query text should all be untouched, and the top/bottom three rows of output kept. The annotation ...REMOVED... has been added and rows 218 through 226140 have been deleted:
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
...REMOVED...
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
(etc just like example above)
update
the border comes from my .psqlrc with \pset border 2
therefore solutions depending on the ┌ character are fragile but OK
over time i've learned that manually flagging the line numbers is too time consuming, so the best solution needs a pattern match
There is example 'every output more than 50 rows I could squash down to just the first 5 rows and the last 5'.
With test input:
$ seq 160 | awk -vstart=10 -vmax=50 -vleft=5 '{if(NR < start) {print; next} {i++; if(i <= left || i > max - left){print}; if(i == left + 1){print "...REMOVED..."}if(i == max){i = 0}}}'
If you line put script in file, store this to squash.awk
BEGIN {
start=10;
max=50;
left=5;
}
{
if(NR < start) {
print;
next
}
i++;
if(i <= left || i > max - left) {
print
}
if(i == left + 1) {
print "...REMOVED...";
}
if(i == max) {
i = 0
}
}
For testing:
$ seq 160 | awk -f squash.awk
Variable start is line number from which squashing line will begin.
Variable max is maximum rows (in your example 50).
Variable left is how many rows will left from max first and last.
if(NR < start) {print; next} if line number less then start (in our case 10), we just print them and go to next line.
Here you can put any condition to skip squashing.
i++ it's rows counter increment.
if(i <= left || i > max - left){print} if rows counter less then 5 or more then max - 5 - print it.
if(i == left + 1){print "...REMOVED..."} when we starting skip rows - put "...REMOVED..." message
if(i == max){i = 0} if rows counter reach max, zero it
One in awk:
$ awk '
/^ └/ { # at the end marker
for(j=1;j<=6;j++) # output from the buffer b the wanted records
print b[j]
for(j=(i-2);j<=i;j++)
print b[j]
delete b # reset buffer
i=0 # and flag / counter
}
/^ ┌/ || i { # at the start marker or when flag up
b[++i]=$0 # gather records to buffer
next
} 1' file # print records which are not between the markers
This might work for you (GNU sed):
sed -r '/\o342[^\n]*$/{:a;N;//ba;s/^(([^\n]*\n){6}).*((\n[^\n]*){5})$/\1 ... REMOVED ...\3/}' file
Focus only on table data which will always contains the octal value 342. Gather up table lines in the pattern space, substitute the required value ... REMOVED ... and print. The number of lines above and below the required string can be altered here 6 (headings + 3 rows) and 5 (required string + 3 rows + table count).
To change a range use:
sed 'm,nc ... REMOVE ...' file # where m,n from and to line numbers
or:
sed -e 'ma ...REMOVE ...' -e 'm,nd' file
N.B. the d command terminates any following commands.
The sed man page is more helpful than you might think at first glance. The [addr]c command is exactly what is needed (note the whitespace after c is ignored):
sed -i '218,226141c ...REMOVED...' psql.log
So there is the solution for known line numbers.
Does anyone want to provide a generic solution where the line numbers aren't known? Probably awk would be the better tool but maybe sed can remove output that is too long.

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.
The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).
To append to a nonempty dataframe is straightforward:
df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])
This returns.
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 1 | 4 |
| 2 | 2 | 5 |
| 3 | 3 | 6 |
But for an empty init I get errors.
df = DataFrame(A = [], B = [])
push!(df, [3, 6])
Error message:
ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2
What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?
A zero length array defined using only [] will lack sufficient type information.
julia> typeof([])
Array{None,1}
So to avoid that problem is to simply indicate the type.
julia> typeof(Int64[])
Array{Int64,1}
And you can apply that to your DataFrame problem
julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame
julia> push!(df, [3 6])
julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 3 | 6 |
using Pkg, CSV, DataFrames
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))
new_iris = similar(iris, nrow(iris))
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ missing │ missing │ missing │ missing │ missing │
# │ 2 │ missing │ missing │ missing │ missing │ missing │
for (i, row) in enumerate(eachrow(iris))
new_iris[i, :] = row[:]
end
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
# │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
The answer from #waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?
In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient.
The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.
Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.
col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)
df = DataFrame(named_tuple) # 0×2 DataFrame
Alternatively, the NameTuple could be created with
# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type
df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])
1×2 DataFrame
Row │ A B
│ Any Any
─────┼──────────
1 │ 5 f
as seen in this post by #Bogumił Kamiński where multiple columns are needed, something like this can be done:
entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
Row │ A B C D
│ Any Any Any Any
─────┼────────────────────
1 │ 4 5 r p
Or as pointed out by #Antonello below if you know that type you can do.
df = DataFrame([name => Int[] for name in entries])
which is also in #Bogumil Kaminski's original post.