Understanding Tibble error message when handling a MICE code - error-handling

I'm running a MICE imputation following the instructions here. My question involves trying to understand what the error message is telling me and how I could go about correcting the Tibble error within the code below.
Thank you
full_dataset %>%
dplyr::filter(vector_x ==1) %>%
dplyr::mutate_if(
is.factor,
fct_explicit_na,
na_level="Missing"
) %>%
finalfit(dependent, explanatory) %>%
knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r", "r"))
Error:
! Tibble columns must have compatible sizes.
Size 42: Existing data.
Size 30: Column at position 5.
i Only values of size one are recycled.
Backtrace:
... %>% finalfit(dependent, explanatory)
finalfit::finalfit(., dependent, explanatory)
finalfit <fn>(...)
finalfit:::fit2df.glm(...)
finalfit:::extract_fit.glm(...)
dplyr::tibble(...)
tibble:::tibble_quos(xs, .rows, .name_repair)
tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])

Related

Error in UseMethod("filter") : no applicable method for 'filter' applied to an object of class "NULL"

I am actually using Tidymodels package on R to study a multi-class classification problem. I have trained several models using Workflow sets, and in my recipe I added a step taken there to replace NA values with a constant. The models that I included in the workflow are:
mlp <-
mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>%
set_engine('nnet') %>%
set_mode('classification')
multinom <-
multinom_reg(penalty = tune(), mixture = tune()) %>%
set_engine('glmnet')
rand_forest <-
rand_forest(mtry = tune(), min_n = tune()) %>%
set_engine('ranger') %>%
set_mode('classification')
tabnet <- tabnet(mode="classification", batch_size= 126, virtual_batch_size= 128, epochs= 1,
num_steps = tune(), learn_rate = tune())%>%
set_engine("torch", verbose = TRUE)
For some models I tried a recipe with SMOTE ("themis" package), PCA, and normalisation (all in the same workflow by adding the steps to the original recipe). Training and testing went pretty well, so I tried an ensemble of these models (using the package "stacks"):
tidymodels_prefer()
stack1 <-
stacks() %>%
add_candidates(res_1)
set.seed(2002)
res1_stack <-
stack1 %>%
blend_predictions()
ens <- fit_members(res1_stack)
When I run this last operation (fit_members) I receive this error
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "NULL"
I figured out, reading this and this on GitHub, that it was because the added step "constantimpute" to the recipe. However, I don't exactly know how can I fix it. Someone can help me?
Thank you very much!!!
Before using the filter function, make sure the table you want to filter is loaded.
Most times we have the the view() function applied and this prevents the table from being loaded into memory for usage.

R flextable automatic row height adjustment (hrule) not working for PDF output

I'm trying to display several tables where the column widths are the same but the row heights vary with the amount of data in the cells in a PDF document generated from an .Rmd file. Despite explicitly setting the column widths and font sizes, and using hrule(rule="auto", part="all") to let the row heights vary, the output changes the widths and font sizes to keep the row heights the same.
Contents of the .Rmd:
---
output: pdf_document
---
```{r, echo=FALSE, collapse=TRUE, include=FALSE}
# Load libraries
#webshot::install_phantomjs() # needed for flextable. Don't need to load it as a package.
library(dplyr) # data management
library(flextable) # produce tables
library(OpenRepGrid) # generate random words
# Build data
table1Df <- data_frame(item = paste0(1:5, "."),
labels = c(randomSentence(10),
randomSentence(10),
randomSentence(10),
randomSentence(10),
randomSentence(10)
),
score = 11:15
)
table2Df <- data_frame(item = paste0(1:5, "."),
labels = c(randomSentence(15),
randomSentence(15),
randomSentence(15),
randomSentence(15),
randomSentence(15)
),
score = 16:20
)
# Function to build flextable
flexPrintFun <- function(df){
flextable(df) %>%
width(j=1, width=0.25) %>%
width(j=2, width=3) %>%
width(j=3, width=0.5) %>%
hrule(rule="auto", part="all") %>%
fontsize(part="header", size=20) %>%
fontsize(part="body", size=15)
}
```
```{r, echo=FALSE}
flexPrintFun(table1Df)
flexPrintFun(table2Df)
```
Here's what it ends up looking like (after zooming in quite a bit because the columns get crushed way down):
The final document will have the tables stacked on top of each other, so it's important that the column widths line up, font sizes are consistent, etc.
I've looked into the documentation about hrule here: https://davidgohel.github.io/flextable/reference/hrule.html and while it states that it works in Word and HTML but not PowerPoint outputs, it says nothing about PDF.
I've looked into kable and kableExtra but those don't quite do everything I need with some other features I don't discuss here.

Pandas 0.21.1 - DataFrame.replace recursion error

I was used to run this code with no issue:
data_0 = data_0.replace([-1, 'NULL'], [None, None])
now, after the update to Pandas 0.21.1, with the very same line of code I get a:
recursionerror: maximum recursion depth exceeded
does anybody experience the same issue ? and knows how to solve ?
Note: rolling back to pandas 0.20.3 will make the trick but I think it's important to solve with latest version
thanx
I think this error message depends on what your input data is. Here's an example of input data where this works in the expected way:
data_0 = pd.DataFrame({'x': [-1, 1], 'y': ['NULL', 'foo']})
data_0.replace([-1, 'NULL'], [None, None])
replaces values of -1 and 'NULL' with None:
x y
0 NaN None
1 1.0 foo

PyPlot throws an error when DataFrame-Column has missing values

I have the following problem:
I would like to plot a variable from a Dataframe with missing values, which are denoted as "NA". However, if I just go ahead and use with Pyplot
x = df[df[:country] .== "Belgium",:year]
y = df[df[:country] .== "Belgium",:hpNormLog]
plot(x, y, "b-", linewidth=2)
I get the following error message:
PyError (:PyObject_Call) <class 'TypeError'> TypeError("float() argument must be a string or a number, not 'PyCall.jlwrap'",)
File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3154, in plot
ret = ax.plot(*args, **kwargs) File "C:\Anaconda3\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py", line 1425, in plot
self.add_line(line) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 1708, in add_line
self._update_line_limits(line) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 1730, in _update_line_limits
path = line.get_path() File "C:\Anaconda3\lib\site-packages\matplotlib\lines.py", line 925, in get_path
self.recache() File "C:\Anaconda3\lib\site-packages\matplotlib\lines.py", line 621, in recache
y = np.asarray(yconv, np.float_) File "C:\Anaconda3\lib\site-packages\numpy\core\numeri...
I would be very grateful, if I had a solution around it.
Best,
Ilja
I found the following solution. I am not deep enough into how Julia works, so I can only say what works and what does not. Arrays with NaN can be plotted with the code written above, columns of DataFrames however do not permit the same thing. The column needs to be converted to an Array, before it can be plotted with missing values. The following code solves the problem:
x = df[df[:country] .== "Belgium",:year]
ytest = df[df[:country] .== "Belgium",:hpNormLog]
y = convert(Array,ytest,NaN)
plot(x, y, "b-", linewidth=2)
x does not contain missing values and therefore I can keep using the DataFrame, but y does contain missing values, so it needs to be converted to an Array. The third argument of convert specifies to what missing values should be converted, in this case to NaN.
Why don't you perform error-handling?
try:
plot(x, y, "b-", linewidth=2)
except PyError:
pass
Escape the error when it works most of the time for your input but skip plotting of "NA"-values....

Pandas: Location of a row with error

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:
df['x']=df['x'].astype('int')
...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'
In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.
The error you are seeing might be due to the value(s) in the x column being strings:
In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'
Ideally, the problem can be avoided by making sure the values stored in the
DataFrame are already ints not strings when the DataFrame is built.
How to do that depends of course on how you are building the DataFrame.
After the fact, the DataFrame could be fixed using applymap:
import ast
df = df.applymap(ast.literal_eval).astype('int')
but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.
Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.
However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.
There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.
So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:
df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
try:
int(item)
except ValueError:
print('ERROR at index {}: {!r}'.format(i, item))
yields
ERROR at index 0: '1.0692e+06'
I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.
import pandas as pd
import sys
def binarySearch(df, l, r, func):
while l <= r:
mid = l + (r - l) // 2;
result = func(df, mid, mid+1)
if result:
# Check if we hit exception at mid
return mid, result
result = func(df, l, mid)
if result is None:
# If no exception at left, ignore left half
l = mid + 1
else:
r = mid - 1
# If we reach here, then the element was not present
return -1
def check(df, start, end):
result = None
try:
# In my case, I want to find out which row cause this failure
df.iloc[start:end].uid.astype(int)
except Exception as e:
result = str(e)
return result
df = pd.read_csv(sys.argv[1])
index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)
To report all rows which fails to map due to any exception:
df.apply(my_function) # throws various exceptions at unknown rows
# print Exceptions, index, and row content
for i, row in enumerate(df):
try:
my_function(row)
except Exception as e:
print('Error at index {}: {!r}'.format(i, row))
print(e)