Pylint: same pylint and pandas version on 2 machines, 1 fails - pandas

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.

A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Related

Pandas diff-function: NotImplementedError

When I use the diff function in my snippet:
for customer_id, cus in tqdm(df.groupby(['customer_ID'])):
# Get differences
diff_df1 = cus[num_features].diff(1, axis = 0).iloc[[-1]].values.astype(np.float32)
I get:
NotImplementedError
The exact same code did run without any error before (on Colab), whereas now I'm using an Azure DSVM via JupyterHub and I get this error.
I already found this
pandas pd.DataFrame.diff(axis=1) NotImplementationError
but the solution doesnt work for me as I dont have any Date types. Also I did upgrade pandas but it didnt change anything.
EDIT:
I have found that the error occurs when the datatype is 'int16' or 'int8'. Converting the dtypes to 'int64' solves it.
However I leave the question open in case someone can explain it or show a solution that works with int8/int16.

Using Optaplanner for VRPPD

I am trying to run the example "optaplanner-mixedvrp-experiment" developed by Geoffrey De Smet and when I run it it throws me the following error:
Caused by: java.lang.IllegalStateException: The entity (MY) has a
variable (previousStandstill) with value (MUNO) which has a
sourceVariableName variable (nextVisit) with a value (WERBOMONT) which
is not null. Verify the consistency of your input problem for that
sourceVariableName variable.
I have not made any change, I have only cloned and executed it, I import and solve it and it throws me this error.
Do you know what could be happening?
I am applying it in the development of a variant of VRP with multiple deliveries and collections, but it throws me the same error. I have activated the FULL_ASSERT mode and nextVisit, previousStandstill, visitIndex are always null
It's been a long time since I looked at that code, so it's using an old version of optaplanner. Our goal is still to clean it up and offer an out of the box example for VRPPD (and probably remove some boilerplate along the way, using the upcoming #CollectionPlanningVariabe etc). That being said, we have multiple users&customers who used that optaplanner-mixedvrp-experiment to successfully build VRPPD implementations.
Which dataset did you try?
FWIW, that IllegalStateException says that when A.previous = B, the B.next is not A. So either the dataset importer didn't import it correctly - before calling solve() - especially if it fails before the first CH step in FULL_ASSERT. Or one of the custom moves corrupted the model.

Why old nodes are visible even after deleting event files [Tensorflow]?

I have just started learning tensorflow, and wrote the following piece of code in Jupyter-Notebook :
a = tf.placeholder(tf.float32,shape=[3,3],name='X')
b = tf.constant([[5,5,5],[2,3,4],[4,5,6]],tf.float32,name='Y')
c = tf.matmul(a,b)
with tf.Session() as sess:
writer = tf.summary.FileWriter('./graphs',sess.graph)
print (sess.run(c,feed_dict={a:[[2,3,4],[4,5,6],[6,7,8]]}))
writer.close()
Running the tensorboard first time, gives a single X,Y and mult node as follows :
However, when I again compile my code (ctrl+enter), the tensorboard now makes a duplicate of the original graph.
I tried to resolve this( remove the older,dead nodes) by:
1. Deleting the event files.
2. Deleting the whole directory containing multiple event files of the same code.
3. Running fuser 6006/tcp -k before the tensorboard command line call.
But even after that, when I ran tensorboard, it would show the duplicate copies.
The only solution that worked was to reset the graph using tf.reset_default_graph() at the beginning of the code or to shut the notebook down and restart it.
My question is :
1. Why is it that even after deleting the event files, the older dead nodes keep showing up on the tensorboard ?. And yes I even restarted Tensorboard after each try, but duplicates were still there.
2. What are the ways if any, besides the two I listed above, to get rid of the dead nodes ?
The nodes you described are not dead. They are still exist and can be used.
When you run your code the first time, the nodes where created and added to the graph. When you execute the same cell for the second time, they are added one more time with the different names.
The same can be achieved if you will just copy twice the code in your .py file.
Your solution with tf.reset_default_graph() is the right one. Restarting the notebook works because all the information from the memory was removed. The same as rerunning .py file.
Your stuff with removing the even files does not work because nonetheless the files are removed, the nodes added to the graph in the memory are still there.
I was also having the same problem of tensorboard displaying duplicate graphs. I tried several measures like deleting the event files , deleting the log directory which contained the event files, but tensorboard would still remember the nodes and the graph from the previous run and show them as duplicate copies, one after another.
I noticed each time I ran the code tensorflow would create the same nodes but with different names, which means it was still keeping the old nodes in its memory. Looked something like this in the python execution window:
After first run:
After running the code 2nd and 3rd time:
This solved the problem
1) Restarting the kernel before each run solved the problem
2) As mentioned in above posts, adding tf.reset_default_graph() in the beginning of the code also solved the problem.

JVM is not ready after 10 seconds

I configured sparkr normally from the tutorials, and everything was working. I was able to read the database with read.df, but suddenly nothing else works, and the following error appears:
Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds
Why does it appear now suddenly? I've read other users with the same problem, but the solutions given did not work. Below is my code:
Sys.setenv(SPARK_HOME= "C:/Spark")
Sys.setenv(HADOOP_HOME = "C:/Hadoop")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
#initialeze SparkR environment
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(SPARK_MEM="4g")
#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Try to do few things below:
Check if c:/Windows/System32/ is there in the PATH.
Check if spark-submit.cmd has proper execute permissions.
If both the above things are true and even if it is giving the same error, then delete spark directory and again create a fresh one by unzipping spark gzip file.
I'm a beginner of R, and I have solved the same problem "JVM is not ready after 10 seconds" by installing JDK(version 7+) before installing sparkr in my mac. And it works well now. Hope this can help you with your problem.

Memory error when running medium sized merge function ipython notebook jupyter

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook
Here is a sample of the data:
timestamp Namecoin_cap
0 2013-04-28 5969081
1 2013-04-29 7006114
2 2013-04-30 7049003
Each frame is around 1000 lines long
Here's the error in detail, I've also include my merge function.
My system is currently using up 64% of it memory
I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.
EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary
EDIT2: I have exported the scrypt to .py
http://pastebin.com/GqaHr7xc
This is a .xlsx that cointains asset names the script needs
The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary
I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.
EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.
EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes
EDIT5: The problem is in the merge function for coin 37
for coin in coins[:37]:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?
EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is
EDIT7: goCards suggestions seemed to work however the next part of the code:
merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)
Produces a memory error. I'm stumped
In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.
file_template_string='%s.csv'
for eachKey in dfDict:
filename = file_template_string%(eachKey)
dfDict[eachKey].to_csv(filename)
If you need to date the files you can also put a timestamp in the filename.
import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))
There are some obvious errors in your code.
for coin in coins: #line 61,89
for coin in data: #should be
df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
Same issue happened to me!
"MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.
Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!