Unable to catch Ray Task Errors when using modin pandas - pandas

I am trying to check if a floating column is actually an int column before converting it to string column, (exact use case: 123.00 needs to be 123, '123-4' needs to remain '123-4').
Code:
# series: modin pandas Series
try:
# If it's a int (123.0)
return series.astype(int).astype(str)
except Exception:
# If it's a string (12-3)
return series.astype(str)
However, the exception is not caught
ValueError: invalid literal for int() with base 10: '123-4'
ray.exceptions.RayTaskError(ValueError): ray::apply_func()
I have tried with except:, except ValueError:,
from ray.exceptions import RayTaskError
except RayTaskError:
Update:
Present as a github issue: https://github.com/modin-project/modin/issues/3966

Related

Error: ('HY000', 'The driver did not supply an error!') - with string

I am trying to push a pandas DataFrame from python to Impala but am getting a very uninformative error. The code I am using looks as such:
cursor = connection.cursor()
cursor.fast_executemany = True
cursor.executemany(
f"INSERT INTO table({', '.join(df.columns.tolist())}) VALUES ({('?,' * len(df.columns))[:-1]})",
list(df.itertuples(index=False, name=None))
)
cursor.commit()
connection.close()
This works for the first 23 rows and then suddenly throws this error:
Error: ('HY000', 'The driver did not supply an error!')
This doesn't help me locate the issue at all. I've turned all Na values to None so there is compatibility, Unfortunatly I can't share the data.
Does anyone have any ideas/ leads as to how I would solve this. Thanks

Pandas diff-function: NotImplementedError

When I use the diff function in my snippet:
for customer_id, cus in tqdm(df.groupby(['customer_ID'])):
# Get differences
diff_df1 = cus[num_features].diff(1, axis = 0).iloc[[-1]].values.astype(np.float32)
I get:
NotImplementedError
The exact same code did run without any error before (on Colab), whereas now I'm using an Azure DSVM via JupyterHub and I get this error.
I already found this
pandas pd.DataFrame.diff(axis=1) NotImplementationError
but the solution doesnt work for me as I dont have any Date types. Also I did upgrade pandas but it didnt change anything.
EDIT:
I have found that the error occurs when the datatype is 'int16' or 'int8'. Converting the dtypes to 'int64' solves it.
However I leave the question open in case someone can explain it or show a solution that works with int8/int16.

Getting sqlalchemy.exc.InternalError with exception value of 'cursor.execute(statement, parameters)' when using Pandas to_sql

I am getting the below exception when writing my Pandas dataframe to Redshift using Pandas's to_sql method:
Exception type:
<class 'sqlalchemy.exc.InternalError'>
Exception traceback:
MY-PATH/venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
Exception value:
cursor.execute(statement, parameters)
Code:
dataframe.to_sql(TABLE-NAME, DB-CONNECTOR, schema='warehouse', method='multi', chunksize=5000, index=False, if_exists='append')
In my scenario, the issue was in my SQL DDL, where one of my columns had a constraint of not null but some of the values in that column that were being written to the data table were null. This was NOT intuitive in the error handling, but wanted to call this out in case others were using the to_sql pandas method to create/update a data table.

No module named 'pandas.core.computation.expressions'

I've never encountered this error before, but after I redownload python packages, I got this error.
I'm doing a calculation of a list of time difference as follows:
df = (A - B)
where A and B are pandas.core.series format and I received this error:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op, str_rep)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_arithmetic_op(left, right, op, str_rep)
ModuleNotFoundError: No module named 'pandas.core.computation.expressions'
I tried redownloading and updating pandas but it doesn't work

Unable to SaveAsTextFile AttributeError: 'list' object has no attribute 'saveAsTextFile'

I have submitted a similar question relating to saveAsTextFile, but I'm not sure if one question will provide the same answer as I now have a new error messagae:
I have compiled the following pyspark.sql code:
#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
myresults = spark.sql("""SELECT
PersonType
,COUNT(PersonType) AS `Person Count`
FROM Person_Person
GROUP BY PersonType""")
myresults.collect()
result = myresults.collect()
result
result.saveAsTextFile("test")
However, I am getting the following error:
Append ResultsClear Results
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-9e137ed161cc> in <module>()
----> 1 result.saveAsTextFile("test")
AttributeError: 'list' object has no attribute 'saveAsTextFile'
As I mentioned I'm trying to send the results of my query to a text file with the command saveAsTextFile, but I am getting the above error.
Can someone shed some light on how to resolve this issue?
Collect() returns all the records of the Dataframe as a list of type Row. And you are calling 'SaveAsTextFile' on the result which is a list.
List doesnt have the 'saveAsTextFile' function, so it's throwing an error.
result = myresults.collect()
result.saveAsTextFile("test")
To save the contents of the Dataframe to file, you have 2 options:
Convert the DataFrame into RDD and call 'saveAsTextFile' function on it.
myresults.rdd.saveAsTextFile(OUTPUT_PATH)
Using DataframeWriter. In this case, DataFrame must have only one column that is of string type. Each row becomes a new line in the output file.
myresults.write.format("text").save(OUTPUT_PATH)
As you have more than 1 column in Dataframe, proceed with Option:1.
Also by default, spark will create 200 Partitions for shuffle. so, 200 files will be created in the output path. If you less data, configure the below parameter according to your data size.
spark.conf.set("spark.sql.shuffle.partitions", 5) # 5 files will be written to output folder.