Pandas diff-function: NotImplementedError - pandas

When I use the diff function in my snippet:
for customer_id, cus in tqdm(df.groupby(['customer_ID'])):
# Get differences
diff_df1 = cus[num_features].diff(1, axis = 0).iloc[[-1]].values.astype(np.float32)
I get:
NotImplementedError
The exact same code did run without any error before (on Colab), whereas now I'm using an Azure DSVM via JupyterHub and I get this error.
I already found this
pandas pd.DataFrame.diff(axis=1) NotImplementationError
but the solution doesnt work for me as I dont have any Date types. Also I did upgrade pandas but it didnt change anything.
EDIT:
I have found that the error occurs when the datatype is 'int16' or 'int8'. Converting the dtypes to 'int64' solves it.
However I leave the question open in case someone can explain it or show a solution that works with int8/int16.

Related

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.
A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Sparklyr : sql temporary error : argument is not interpretable as logical

Hi I'm new to sparklyr and I'm essentially running a query to create a temporary object in spark.
The code is something like
ts_data<-tbl(sc,"db.table") %>% filter(condition) %>% compute("ts_data")
sc is my spark connection.
I have run the same code before and it works but now I get the following error.
Error in if (temporary) sql("TEMPORARY ") : argument is not
interpretable as logical
I have tried changing filters, tried it with new tables, R versions and snapshots. Yet it still gives the same exact error. I am positive there are no syntax errors
Can someone help me understand how to fix this?
I ran into the same problem. Changing compute("x") to compute(name = "x") fixed it for me.
This was a bug of sparklyr, and is fixed with version 1.7.0. So either use matching by argument (name = x) or update your sparklyr version.

How to index correctly using Python 3.7 and address associated errors

I have been having issues indexing in Python 3.7. I would greatly appreciate your insights and clarification on this.
I have tried to research and fix this issue but I am not able to understand what I am doing. I would greatly appreciate your help
enroll = pd.read_csv('enrollment_forecast.csv')
enroll.columns = ['year','roll','unem','hgrad','inc']
# the correlation between variables
enroll.corr()
enroll_data = enroll.ix[:(2,3)].values
print(enroll_data)
enroll_target = enroll.ix[:,1].values
print(enroll_target)
enroll_data_names = ['unem','hgrad']
Exception has occurred: AssertionError
End slice bound is non-scalar
Just a heads up, Pandas .ix index accessor is deprecated.
It's throwing the error error because you are passing it a tuple:
enroll_data = enroll.ix[:(2,3)].values
Try passing it a list instead of a tuple:
enroll_data = enroll.ix[:[2,3]].values

JVM is not ready after 10 seconds

I configured sparkr normally from the tutorials, and everything was working. I was able to read the database with read.df, but suddenly nothing else works, and the following error appears:
Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds
Why does it appear now suddenly? I've read other users with the same problem, but the solutions given did not work. Below is my code:
Sys.setenv(SPARK_HOME= "C:/Spark")
Sys.setenv(HADOOP_HOME = "C:/Hadoop")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
#initialeze SparkR environment
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(SPARK_MEM="4g")
#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Try to do few things below:
Check if c:/Windows/System32/ is there in the PATH.
Check if spark-submit.cmd has proper execute permissions.
If both the above things are true and even if it is giving the same error, then delete spark directory and again create a fresh one by unzipping spark gzip file.
I'm a beginner of R, and I have solved the same problem "JVM is not ready after 10 seconds" by installing JDK(version 7+) before installing sparkr in my mac. And it works well now. Hope this can help you with your problem.

Memory error when running medium sized merge function ipython notebook jupyter

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook
Here is a sample of the data:
timestamp Namecoin_cap
0 2013-04-28 5969081
1 2013-04-29 7006114
2 2013-04-30 7049003
Each frame is around 1000 lines long
Here's the error in detail, I've also include my merge function.
My system is currently using up 64% of it memory
I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.
EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary
EDIT2: I have exported the scrypt to .py
http://pastebin.com/GqaHr7xc
This is a .xlsx that cointains asset names the script needs
The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary
I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.
EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.
EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes
EDIT5: The problem is in the merge function for coin 37
for coin in coins[:37]:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?
EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is
EDIT7: goCards suggestions seemed to work however the next part of the code:
merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)
Produces a memory error. I'm stumped
In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.
file_template_string='%s.csv'
for eachKey in dfDict:
filename = file_template_string%(eachKey)
dfDict[eachKey].to_csv(filename)
If you need to date the files you can also put a timestamp in the filename.
import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))
There are some obvious errors in your code.
for coin in coins: #line 61,89
for coin in data: #should be
df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
Same issue happened to me!
"MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.
Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!