I have tested that both logger and print can't print message in a pandas_udf , either in cluster mode or client mode.
Test code:
import sys
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import logging
logger = logging.getLogger('test')
spark = (SparkSession
.builder
.appName('test')
.getOrCreate())
df = spark.createDataFrame(pd.DataFrame({
'y': np.random.randint(1, 10, (20,)),
'ds': np.random.randint(1000, 9999, (20,)),
'store_id' : ['a'] * 10 + ['b'] *7 + ['q']*3,
'product_id' : ['c'] * 5 + ['d'] *12 + ['e']*3,
})
)
#pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
print('#'*100)
logger.info('$'*100)
logger.error('&'*100)
return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
Also note:
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("#"*50)
You can't use this in pandas_udf, because this log beyond to spark context object, you can't refer to spark session/context in a udf.
The only way I know is use Excetion as the answer I wrote below.
But it is tricky and with drawback.
I want to know if there is any way to just print message in pandas_udf.
Currently, I tried every way in spark 2.4 .
Without log, it is hard to debug a faulty pandas_udf. The only workable way I know can print error messgage in pandas_udf is raise Exception . So it really cost time to debug in this way, but there isn't a better way I know .
#pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
print('#'*100)
logger.info('$'*100)
logger.error('&'*100)
raise Exception('#'*100) # The only way I know can print message but would break execution
return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])
The drawback is you can't keep spark running after print message.
One thing you can do is to put the log message into the DataFrame itself.
For example
#pandas_udf('y int, ds int, store_id string, product_id string, log string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
return pd.DataFrame([3, 5, 'store123', 'product123', 'My log message'], columns=['y', 'ds','store_id','product_id', 'log'])
After this, you can select the log column with related information into another DataFrame and output to file. Drop it from the original DataFrame.
It's not perfect, but it might be helpful.
Related
I have PySpark dataframe with column named "subnet". I want to add a column which is the first IP of that subnet. I've tried many solutions including
def get_first_ip(prefix):
n = ipaddress.IPv4Network(prefix)
first, last = n[0], n[-1]
return first
df.withColumn("first_ip", get_first_ip(F.col("subnet")))
But getting error:
-> 1161 raise AddressValueError("Expected 4 octets in %r" % ip_str)
1162
1163 try:
AddressValueError: Expected 4 octets in "Column<'subnet'>"
I do understand that is the Column value and can no use it as a simple string here, but how to solve my problem with PySpark?
I could do the same in pandas and then convert to PySpark, but I'm wondering if there's any other more elegant way?
It's hard to tell what's the issue when we don't know how the input dataframe looks like. But something is wrong with the column values as #samkart suggested.
Here's an example that I tested:
import ipaddress
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
def get_first_ip(x):
n = ipaddress.IPv4Network(x)
return str(n[0])
def get_last_ip(x):
n = ipaddress.IPv4Network(x)
return str(n[-1])
first_ip_udf = F.udf(lambda x: get_first_ip(x), StringType())
last_ip_udf = F.udf(lambda x: get_last_ip(x), StringType())
spark = SparkSession.builder.getOrCreate()
data = [
{"IP": "10.10.128.123"},
{"IP": "10.10.128.0/17"},
]
df = spark.createDataFrame(data=data)
df = df.withColumn("first_ip", first_ip_udf(F.col("IP")))
df = df.withColumn("last_ip", last_ip_udf(F.col("IP")))
Outputs:
+--------------+-------------+-------------+
|IP |first_ip |last_ip |
+--------------+-------------+-------------+
|10.10.128.123 |10.10.128.123|10.10.128.123|
|10.10.128.0/17|10.10.128.0 |10.10.255.255|
+--------------+-------------+-------------+
You cannot directly apply python native function to a Spark dataframe column. As demonstrated in this answer, you could create a udf from your function.
Since udf is slow for big dataframes, you could use pandas_udf which is a lot faster.
Input:
import ipaddress
import pandas as pd
from pyspark.sql import functions as F
df = spark.createDataFrame([("10.10.128.123",), ("10.10.128.0/17",)], ["subnet"])
Script:
#F.pandas_udf('string')
def get_first_ip(prefix: pd.Series) -> pd.Series:
return prefix.apply(lambda s: str(ipaddress.IPv4Network(s)[0]))
df = df.withColumn("first_ip", get_first_ip("subnet"))
df.show()
# +--------------+-------------+
# | subnet| first_ip|
# +--------------+-------------+
# | 10.10.128.123|10.10.128.123|
# |10.10.128.0/17| 10.10.128.0|
# +--------------+-------------+
I'm trying to ingest a df I created from a json response into an existing table (the table is currently empty because I can't seem to get this to work)
The df looks something like the below table:
index
clicks_affiliated
0
3214
1
2221
but I'm seeing the following error:
snowflake.connector.errors.ProgrammingError: 000904 (42000): SQL
compilation error: error line 1 at position 94
invalid identifier '"clicks_affiliated"'
and the column names in snowflake match to the columns in my dataframe.
This is my code:
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas, pd_writer
from pandas import json_normalize
import requests
df_norm = json_normalize(json_response, 'reports')
#I've tried also adding the below line (and removing it) but I see the same error
df = df_norm.reset_index(drop=True)
def create_db_engine(db_name, schema_name):
engine = URL(
account="ab12345.us-west-2",
user="my_user",
password="my_pw",
database="DB",
schema="PUBLIC",
warehouse="WH1",
role="DEV"
)
return engine
def create_table(out_df, table_name, idx=False):
url = create_db_engine(db_name="DB", schema_name="PUBLIC")
engine = create_engine(url)
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
print(df.head)
create_table(df, "reporting")
So... it turns out I needed to change my columns in my dataframe to uppercase
I've added this after the dataframe creation to do so and it worked:
df.columns = map(lambda x: str(x).upper(), df.columns)
One of the recent problems with dask I encountered was encodings that take a lot of time and I wanted to speed them up.
Problem: given a dask df (ddf), encode it, and return ddf.
Here is some code to start with:
# !pip install feature_engine
import dask.dataframe as dd
import pandas as pd
import numpy as np
from feature_engine.encoding import CountFrequencyEncoder
df = pd.DataFrame(np.random.randint(1, 5, (100,3)), columns=['a', 'b', 'c'])
# make it object cols
for col in df.columns:
df[col] = df[col].astype(str)
ddf = dd.from_pandas(df, npartitions=3)
x_freq = ddf.copy()
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
x_freq.head()
It will run fine as I would expect, adding pandas df to dask df - no problem.
But when I try another ddf, there is an error:
x_freq = x.copy()
# npartitions = x_freq.npartitions
# x_freq = x_freq.repartition(npartitions=npartitions).reset_index(drop=True)
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
break
x_freq.head()
Error is happening during concat:
ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1
This is how I load "error" ddf:
ddf = dd.read_parquet(os.path.join(dir_list[0], '*.parquet'), engine='pyarrow').repartition(partition_size='100MB')
I read I should try repartition and/or reset index and/or use assign. Neither worked.
x_freq = x.copy()
in the second example is similar to:
x_freq = ddf.copy()
in the first example in a sense that x is just some ddf I'm trying to encode but it would be a lot of code to define it here.
Can anyone help, please?
Here's what I think might be going on.
Your parquet file probably doesn't have divisions information within it. You thus cannot just dd.concat, since it's not clear how the partitions align.
You can check this by
x_freq.known_divisions # is likely False
x_freq.divisions # is likely (None, None, None, None)
Since unknown divisions are the problem, you can re-create the issue by using the synthetic data in the first example
x_freq = ddf.clear_divisions().copy()
You might solve this problem by re-setting the index:
x_freq.reset_index().set_index(index_column_name)
where index_column_name is the name of the index column.
Consider also saving the data with the correct index afterwards so that it doesn't have to be calculated each time.
Note 1: Parallelization
By the way, since you're computing each column before working with it, you're not really utilizing dask's parallelization abilities. Here is a workflow that might utilize parallelization a bit better:
def count_frequency_encoder(s):
return s.replace(s.value_counts(normalize=True).compute().to_dict())
frequency_columns = {
f'{col_name}_freq': count_frequency_encoder(x_freq[col_name])
for col_name in x_freq.columns}
x_freq = x_freq.assign(**frequency_columns)
Note 2: to_frame
A tiny tip:
x_freq[col_name].to_frame()
is equivalent to
x_freq[[col_name]]
I have a pandas dataframe that has several categorical fields.
SQLAlchemy throws a exception "The type of is not a SQLAlchemy type".
I've tried converting the object fields back to string, but get the same error.
dfx = pd.DataFrame()
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
dfx[col_name] = df[col_name].astype('str').copy()
else:
dfx[col_name] = df[col_name].copy()
print(col_name, dfx[col_name].dtype)
.
dfx.to_sql('results', con=engine, dtype=my_dtypes, if_exists='append', method='multi', index=False)
the new dfx seems to have the same categoricals despite creating a new table with .copy()
Also, as a side note, why does to_sql() generate a CREATE TABLE with CLOBs?
No need to use the copy() function here, and you should not have to convert from 'object' to 'str' either.
Are you writing to an Oracle database? The default output type for text data (including 'object') is CLOB. You can get around it by specifying the dtype to use. For example:
import pandas as pd
from sqlalchemy import types, create_engine
from sqlalchemy.exc import InvalidRequestError
conn = create_engine(...)
testdf = pd.DataFrame({'pet': ['dog','cat','mouse','dog','fish','pony','cat']
, 'count': [2,6,12,1,45,1,3]
, 'x': [105.3, 98.7, 112.4, 3.6, 48.9, 208.9, -1.7]})
test_types = dict(zip(
testdf.columns.tolist(),
(types.VARCHAR(length=20), types.Integer(), types.Float()) ))
try:
testdf.to_sql( name="test", schema="myschema"
, con=conn
, if_exists='replace' #'append'
, index=False
, dtype = test_types)
print (f"Wrote final input dataset to table {schema}.{table2}")
except (ValueError, InvalidRequestError):
print ("Could not write to table 'test'.")
If you are not writing to Oracle, please specify your target database - perhaps someone else with experience in that DBMS can advise you.
What #eknumbat is absolutely correct. For AWS Redshift, you can do the following. Note you can find all of the sqlalchemy datatypes here (https://docs.sqlalchemy.org/en/14/core/types.html)
import pandas as pd
from sqlalchemy.types import VARCHAR, INTEGER, FLOAT
from sqlalchemy import create_engine
conn = create_engine(...)
testdf = pd.DataFrame({'pet': ['dog','cat','mouse','dog','fish','pony','cat'],
'count': [2,6,12,1,45,1,3],
'x': [105.3, 98.7, 112.4, 3.6, 48.9, 208.9, -1.7]})
test_types = {'pet': VARCHAR, 'count': Integer, 'x': Float}
testdf.to_sql(name="test",
schema="myschema".
con=conn,
if_exists='replace',
index=False,
dtype = test_types)
I have a dataframe with the dtypes shown below and I want to insert the dataframe into a postgres DB but it fails due to error can't adapt type 'numpy.int64'
id_code int64
sector object
created_date float64
updated_date float64
How can I convert these types to native python types such as from int64 (which is essentially 'numpy.int64') to a classic int that would then be acceptable to postgres via the psycopg2 client.
data['id_code'].astype(np.int) defaults to int64
It is nonetheless possible to convert from one numpy type to another (e.g from int to float)
data['id_code'].astype(float)
changes to
dtype: float64
The bottomline is that psycopg2 doesn't seem to understand numpy datatypes if any one has ideas how to convert them to classic types that would be helpful.
Updated: Insertion to DB
def insert_many():
"""Add data to the table."""
sql_query = """INSERT INTO classification(
id_code, sector, created_date, updated_date)
VALUES (%s, %s, %s, %s);"""
data = pd.read_excel(fh, sheet_name=sheetname)
data_list = list(data.to_records())
conn = None
try:
conn = psycopg2.connect(db)
cur = conn.cursor()
cur.executemany(sql_query, data_list)
conn.commit()
cur.close()
except(Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
Add below somewhere in your code:
import numpy
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def addapt_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
register_adapter(numpy.float64, addapt_numpy_float64)
register_adapter(numpy.int64, addapt_numpy_int64)
same problem here, successfully solve this problem after I transform series to nd.array and int.
you can try as following:
data['id_code'].values.astype(int)
--
update:
if the value including NaN, it still wrong.
It seems that psycopg2 can't explain the np.int64 format, therefore the following methods works for me.
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
psycopg2.extensions.register_adapter(np.int64, psycopg2._psycopg.AsIs)
I'm not sure why your data_list contains NumPy data types, but the same thing happens to me when I run your code. Here is an alternative way to construct data_list that so that integers and floats end up as their native python types:
data_list = [list(row) for row in data.itertuples(index=False)]
Alternate approach
I think you could accomplish the same thing in fewer lines of code by using pandas to_sql:
import sqlalchemy
import pandas as pd
data = pd.read_excel(fh, sheet_name=sheetname)
engine = sqlalchemy.create_engine("postgresql://username#hostname/dbname")
data.to_sql(engine, 'classification', if_exists='append', index=False)
I had the same issue and fixed it using:
df = df.convert_dtypes()