pivot_table error - InvalidOperation: [<class 'decimal.InvalidOperation'>] - pandas

The above error is being raised from a pivot_table operation for a variable set to be the column grouping (if it matters, it's failing in the format.py module)
/anaconda/lib/python3.4/site-packages/pandas/core/format.py in __call__(self, num)
2477 sign = 1
2478
-> 2479 if dnum < 0: # pragma: no cover
2480 sign = -1
2481 dnum = -dnum
(Pandas v17.1)
If I create random values for the 'problem' variable via numpy there is no error.
Whilst I doubt it's an edge case for the pivot_table function, I can't figure out what might be causing the problem on the data side:
i) The variable is the first integer from a modest sized sequence of integers (eg 2 from 246) (via df.var.str[0]).
ii) pd.unique(df.var) returns the expected 1-9 values
iii) There are no NaNs: notnull(df.var).all() returns True
iv) The dtype is int64 (or if the integer is cast as a string - or set to label these alternatives still fail with the same error)
v) a period index is used - and that forms the index for pivot table.
vi) the aggregation is 'count'
Creating a another variable with random values with those characteristics (1-9 values from from numpy's random.randint) - the pivot_table call works. If I cast it as a string, or use labels, it still works.
Likewise, I've been playing with the data set for a while - usually on some other position in the sequence without issue. But today - the first place is causing a problem.
Possibly, it's a data issue - but why doesn't pivot_table return empty cells or NaNs, rather than failing at that point.
But I'm at a loss after a day exploring.
any thoughts on why the above error is being raised would be much appreciated (as it'll help me track down the data issue if that is the case).
thanks
Chris

The simplest solution is to reset pandas formatting options by
pd.set_option('display.float_format', None)
further details
I had encoutered same problem. As a workaround you can also filter dataframe that is pivoted to avoid NaNs in result.
My problem is related to use of pd.set_eng_float_format(2, True). Without this all pivots works well.

Related

How does DataFrame.interpolation() work in its source code?

Since I could not find the declarations of the single methods of DataFrame.interpolation()'s "method"-parameter, I am asking here:
How does pandas' DataFrame.interpolation() work in relation to the amount of rows it considers, is it just the row before the NaNs and the row right after?
Or is it the whole DataFrame (how does that work at 1 million rows?)
If you already know where to look, feel free to share the link to the source-code (since https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/frame.py#L10916 doesnt include the "method"'s declarations (for example "polynomial").
I found the attached in core/missing.py.
My interpretation is that interpolation is either done with np.iter or, if method is specified and only available in scipy, with _interpolate_scipy_wrapper. A function which I could not locate but a reasonable guess is that it is a wrapper for scipy.
if method in NP_METHODS:
# np.interp requires sorted X values, #21037
indexer = np.argsort(indices[valid])
yvalues[invalid] = np.interp(
indices[invalid], indices[valid][indexer], yvalues[valid][indexer]
)
else:
yvalues[invalid] = _interpolate_scipy_wrapper(
indices[valid],
yvalues[valid],
indices[invalid],
method=method,
fill_value=fill_value,
bounds_error=bounds_error,
order=order,
**kwargs,
)
yvalues[preserve_nans] = np.nan

Understanding with julia dataframes indexing

I am learning julia and i've just found this line:
if(any(mach_df[start_slot:(start_slot + task_setup_time), Symbol(machine)].== 0))
What does it mean?, I know any is a function that returns true if every value of the parameter is true but I just can't understand what is inside the brakets.
Regards
Let us work inside out:
mach_df[start_slot:(start_slot + task_setup_time), Symbol(machine)] selects you rows from the range start_slot:(start_slot + task_setup_time) and column named Symbol(machine) (Symbol is most likely not needed, but I would need to see your source code to tell you exacly); as a result you get a vector.
mach_df[start_slot:(start_slot + task_setup_time), Symbol(machine)] .== 0 gives you another vector that has true if the value in the LHS vector is 0.
the any part will return true if any of the values in the vector produced above is true.
A more advanced (and efficient) way to write it would be:
any(==(0), #view mach_df[start_slot:(start_slot + task_setup_time), Symbol(machine)])
but I am not sure if you need performance in your use case.

Pyomo: Unbounded objective function though bounded

I am currently implementing an optimization problem with pyomo and since now some hours I get the message that my problem is unbounded. After searching for the issue, I came along one term which seems to be unbounded. I excluded this term from the objective function and it shows that it takes a very high negative value, which supports the assumption that it is unbounded to -Inf.
But I have checked the problem further and it is impossible that the term is unbounded, as following code and results show:
model.nominal_cap_storage = Var(model.STORAGE, bounds=(0,None)) #lower bound is 0
#I assumed very high CAPEX for each storage (see print)
dict_capex_storage = {'battery': capex_battery_storage,
'co2': capex_co2_storage,
'hydrogen': capex_hydrogen_storage,
'heat': capex_heat_storage,
'syncrude': capex_syncrude_storage}
print(dict_capex_storage)
>>> {'battery': 100000000000000000, 'co2': 100000000000000000,
'hydrogen': 1000000000000000000, 'heat': 1000000000000000, 'syncrude': 10000000000000000000}
From these assumptions I already assume that it is impossible that the one term can be unbounded towards -Inf as the capacity has the lower bound of 0 and the CAPEX is a positive fixed value. But now it gets crazy. The following term is has the issue of being unbounded:
model.total_investment_storage = Var()
def total_investment_storage_rule(model):
return model.total_investment_storage == sum(model.nominal_cap_storage[storage] * dict_capex_storage[storage] \
for storage in model.STORAGE)
model.total_investment_storage_con = Constraint(rule=total_investment_storage_rule)
If I exclude the term from the objective function, I get following value after the optimization. It seems, that it can take high negative values.
>>>>
Variable total_investment_storage
-1004724108.3426505
So I checked the term regarding the component model.nominal_cap_storage to see the value of the capacity:
model.total_cap_storage = Var()
def total_cap_storage_rule(model):
return model.total_cap_storage == sum(model.nominal_cap_storage[storage] for storage in model.STORAGE)
model.total_cap_storage_con = Constraint(rule=total_cap_storage_rule)
>>>>
Variable total_cap_storage
0.0
I did the same for the dictionary, but made a mistake: I forgot to delete the model.nominal_cap_storage. But the result is confusing:
model.total_capex_storage = Var()
def total_capex_storage_rule(model):
return model.total_capex_storage == sum(model.nominal_cap_storage[storage] * dict_capex_storage[storage] \
for storage in model.STORAGE)
model.total_capex_storage_con = Constraint(rule=total_capex_storage_rule)
>>>>
Variable total_capex_storage
0.0
So my question is why is the term unbounded and how is it possible that model.total_investment_storage and model.total_capex_storage have different solutions though both are calculated equally? Any help is highly appreciated.
I think you are misinterpreting "unbounded." When the solver says the problem is unbounded, that means the objective function value is unbounded based on the variables and constraints in the problem. It has nothing to do with bounds on variables, unless one of those variable bounds prevents the objective from being unbound.
If you want help on above problem, you need to edit and post the full problem, with the objective function, and (if possible) the error. What you have now is a collection of different snippets of different variations of a problem, which isn't really informative on the overall issue.
I solved the problem by setting a lower bound to the term, which takes a negative value:
model.total_investment_storage = Var(bounds=(0, None)
I am still not sure why this term can take negative values but this solved at least my problem

Summation iterated over a variable length

I have written an optimization problem in pyomo and need a constraint, which contains a summation that has a variable length:
u_i_t[i, t]*T_min_run - sum (tnewnew in (t-T_min_run+1)..t-1) u_i_t[i,tnewnew] <= sum (tnew in t..(t+T_min_run-1)) u_i_t[i,tnew]
T is my actual timeline and N my machines
usually I iterate over t, but I need to guarantee the machines are turned on for certain amount of time.
def HP_on_rule(model, i, t):
return model.u_i_t[i, t]*T_min_run - sum(model.u_i_t[i, tnewnew] for tnewnew in range((t-T_min_run+1), (t-1))) <= sum(model.u_i_t[i, tnew] for tnew in range(t, (t+T_min_run-1)))
model.HP_on_rule = Constraint(N, rule=HP_on_rule)
I hope you can provide me with the correct formulation in pyomo/python.
The problem is that t is a running variable and I do not know how to implement this in Python. tnew is only a help variable. E.g. t=6 (variable), T_min_run=3 (constant) and u_i_t is binary [00001111100000...] then I get:
1*3 - 1 <= 3
As I said, I do not know how to implement this in my code and the current version is not running.
TypeError: HP_on_rule() missing 1 required positional argument: 't'
It seems like you didn't provide all your arguments to the function rule.
Since t is a parameter of your function, I assume that it corresponds to an element of set T (your timeline).
Then, your last line of your code example should include not only the set N, but also the set T. Try this:
model.HP_on_rule = Constraint(N, T, rule=HP_on_rule)
Please note: Building a Constraint with a "for each" part, you must provide the Pyomo Sets that you want to iterate over at the begining of the call for Constraint construction. As a rule of thumb, your constraint rule function should have 1 more argument than the number of Pyomo Sets specified in the Constraint initilization line.

error in LDA in r: each row of the input matrix needs to contain at least one non-zero entry

I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:
Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, :
Each row of the input matrix needs to contain at least one non-zero entry.
I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:
table(is.na(dtm[[1]]))
#FALSE
#57100956
table(is.na(dtm[[2]]))
#FALSE
#57100956
A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:
ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Can anyone provide some insights? Thanks.
In my opinion the problem is not the presence of missing values, but the presence of all 0 rows.
To check it:
raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table
Then you can delete all raws which are all 0 doing:
table=table[raw.sum!=0,]
Now table should has all "non 0" raws.
I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!