No module named 'pandas.core.computation.expressions' - pandas

I've never encountered this error before, but after I redownload python packages, I got this error.
I'm doing a calculation of a list of time difference as follows:
df = (A - B)
where A and B are pandas.core.series format and I received this error:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op, str_rep)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_arithmetic_op(left, right, op, str_rep)
ModuleNotFoundError: No module named 'pandas.core.computation.expressions'
I tried redownloading and updating pandas but it doesn't work

Related

Pandas diff-function: NotImplementedError

When I use the diff function in my snippet:
for customer_id, cus in tqdm(df.groupby(['customer_ID'])):
# Get differences
diff_df1 = cus[num_features].diff(1, axis = 0).iloc[[-1]].values.astype(np.float32)
I get:
NotImplementedError
The exact same code did run without any error before (on Colab), whereas now I'm using an Azure DSVM via JupyterHub and I get this error.
I already found this
pandas pd.DataFrame.diff(axis=1) NotImplementationError
but the solution doesnt work for me as I dont have any Date types. Also I did upgrade pandas but it didnt change anything.
EDIT:
I have found that the error occurs when the datatype is 'int16' or 'int8'. Converting the dtypes to 'int64' solves it.
However I leave the question open in case someone can explain it or show a solution that works with int8/int16.

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.
A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Unable to catch Ray Task Errors when using modin pandas

I am trying to check if a floating column is actually an int column before converting it to string column, (exact use case: 123.00 needs to be 123, '123-4' needs to remain '123-4').
Code:
# series: modin pandas Series
try:
# If it's a int (123.0)
return series.astype(int).astype(str)
except Exception:
# If it's a string (12-3)
return series.astype(str)
However, the exception is not caught
ValueError: invalid literal for int() with base 10: '123-4'
ray.exceptions.RayTaskError(ValueError): ray::apply_func()
I have tried with except:, except ValueError:,
from ray.exceptions import RayTaskError
except RayTaskError:
Update:
Present as a github issue: https://github.com/modin-project/modin/issues/3966

DataFrames, df."example vector in the data()" raises a MethodError

So I have a dataframe df . CSV.read(...) and I have a column labeled 'Population in thousands (2017)'
and I used a command
df."Population in thousands (2017)"
This used to be what was working... but I installed some packages and created something and now I get THIS error when I input
df."Population in thousands (2017)"
ERROR: MethodError: no method matching getproperty(::DataFrame, ::String)
Closest candidates are:
getproperty(::AbstractDataFrame, ::Symbol) at C:\Users\jerem\.julia\packages\DataFrames\S3ZFo\src\abstractdataframe\abstractdataframe.jl:295
getproperty(::Any, ::Symbol) at Base.jl:33
Stacktrace:
[1] top-level scope
# REPL[10]:1
Thank you in advance.
I can confirm that this works on the current (at the time of writing) DataFrames release:
(jl_yo71eu) pkg> st
Status `...\AppData\Local\Temp\jl_yo71eu\Project.toml`
[a93c6f00] DataFrames v1.2.2
julia> using DataFrames
julia> df = DataFrame("Population in thousands (2017)" => rand(5));
julia> df."Population in thousands (2017)"
5-element Vector{Float64}:
0.8976467991472025
0.32646068570785514
0.5168819082429569
0.8488198612708232
0.27250141317576815
I'm assuming you're on an outdated version of DataFrames?
Edited to add following discussion in comments:
Bogumil can of course read your DataFrames version of the random folder name, so it appears you really are on an outdated version. You should do add DataFrames#1.2 in the package manager to force an upgrade, which will tell you what packages in your current environment are holding you back.

Unable to load bigquery data in local spark (on my mac) using pyspark

I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar
Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.