Spark: How to debug pandas-UDF in VS Code - pandas

I'm looking for a way to debug spark pandas UDF in vscode and Pycharm Community version (place breakpoint and stop inside UDF). At the moment when breakpoint is placed inside UDF debugger doesn't stop.
In the reference below there is described Local mode and Distributed mode.
I'm trying at least to debug in Local mode. Pycharm/VS Code there should be a way to debug local enc by "Attach to Local Process". Just I can not figure out how.
At the moment I can not find any answer how to attach pyspark debugger to local process inside UDF in VS Code(my dev ide).
I found only examples below in Pycharm.
Attache to local process How can PySpark be called in debug mode?
When I try to attach to process I'm getting message below in Pycharm. In VS Code I'm getting msg that process can not be attached.
Attaching to a process with PID=33,692
/home/usr_name/anaconda3/envs/yf/bin/python3.8 /snap/pycharm-community/223/plugins/python-ce/helpers/pydev/pydevd_attach_to_process/attach_pydevd.py --port 40717 --pid 33692
WARNING: The 'kernel.yama.ptrace_scope' parameter value is not 0, attach to process may not work correctly.
Please run 'sudo sysctl kernel.yama.ptrace_scope=0' to change the value temporary
or add the 'kernel.yama.ptrace_scope = 0' line to /etc/sysctl.d/10-ptrace.conf to set it permanently.
Process finished with exit code 0
Server stopped.
pyspark_xray https://github.com/bradyjiang/pyspark_xray
With this package, it is possible to debug rdds running on worker, but I was not able to adjust package to debug UDFs
Example code, breakpoint doesn't stop inside UDF pandas_function(url_json):
import pandas as pd
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
spark = pyspark.sql.SparkSession.builder.appName("test") \
.master('local[*]') \
.getOrCreate()
sc = spark.sparkContext
# Create initial dataframe respond_sdf
d_list = [('api_1',"{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }"),
(' api_2', "{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }")]
schema = StructType([
StructField('url', StringType(), True),
StructField('content', StringType(), True)
])
jsons = sc.parallelize(rdd_list)
respond_sdf = spark.createDataFrame(jsons, schema)
# Pandas UDF
def pandas_function(url_json):
# Here I want to place breakpoint
df = pd.DataFrame(eval(url_json['content'][0]))
return df
# Pnadas UDF transformation applied to respond_sdf
respond_sdf.groupby(F.monotonically_increasing_id()).applyInPandas(pandas_function, schema=schema).show()

This example demonstrates how to use excellent pyspark_exray library to step into UDF functions passed into Dataframe.mapInPandas function
https://github.com/bradyjiang/pyspark_xray/blob/master/demo_app02/driver.py

Related

Fail to stream data from Kafka to BigQuery with Apache Beam

I am trying to load a stream of data from a kafka topic to a BigQuery table. While I am able to source the stream from the kafka topic and do transformations on it, I am stuck on loading the transformed data to a BQ table.
Note: I am using apache beam's python SDK with direct runner (right now) to test things. Here's the code:
import os
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from beam_nuggets.io import kafkaio
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--bq_table',
required=True,
help=('Output BigQuery table for results specified as: '
'PROJECT:DATASET.TABLE'))
known_args, pipeline_args = parser.parse_known_args(argv)
bootstrap_server = "localhost:9092"
kafka_topic = 'some_topic'
consumer_config = {
'bootstrap_servers': bootstrap_server,
'group_id': 'etl',
'topic': kafka_topic,
'auto_offset_reset': 'earliest'
}
with beam.Pipeline(argv=pipeline_args) as p:
events = (
p | 'Read from kafka' >> kafkaio.KafkaConsume(
consumer_config=consumer_config, value_decoder=bytes.decode)
| 'toDict' >> beam.MapTuple(lambda k, v: json.loads(v))
| 'extract' >> beam.Map(lambda e: {'x': e['key1'], 'y': e['key2']})
# | 'log' >> beam.ParDo(logging.info) # if I uncomment this (for validation), I see data printed in my console log
)
_ = events | 'Load data to BQ' >> WriteToBigQuery(known_args.bq_table,
schema='x:STRING, y:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
ignore_unknown_columns=True, method='STREAMING_INSERTS')
p.run().wait_until_finish()
if __name__ == "__main__":
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials/file.json"
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger("kafka").setLevel(logging.INFO)
run()
And this is how I run the above code:
python3 main.py \
--bq_table=<table_id> \
--streaming \
--runner=DirectRunner
I have tried using the batch mode to data insertion as well with WriteToBigQuery (method=FILE_LOADS) and providing a temp GCS location, but that didn't help either.
There is no error even though I have enabled debug logs. So, it's getting really difficult to trace the issue to its source. What do you think I am missing?
Edit:
The python process does not end/exit until I interrupt it.
The Kafka consumer group lag is 0, which indicates that the data is being fetched and processed but not getting loaded to the BQ table.

Testing a Jupyter Notebook

I am trying to come up with a method to test a number of Jupyter notebooks. A test should run when a new notebook is implemented in a Github branch and submitted for a pull request. The tests are not that complicated, they are mostly just testing if the notebook runs end-to-end and without any errors, and maybe a few asserts. However:
There are certain calls in some cells that need to be mocked, e.g. a call to download the data from a database.
There may be some magic cells in the notebooks which run a pip command or something else.
I am open to use any testing library, such as 'pytest' or unittest, although pytest is preferred.
I looked at a few libraries for testing notebooks such as nbmake, treon, and testbook, but I was unable to make them work. I also tried to convert the notebook to a python file, but the magic cells were converted to a get_ipython().run_cell_magic(...) call which became an issue, since pytest uses python and not ipython, and get_ipython() is only available in ipython.
So, I am wondering what is a good way to test jupyter notebooks with all of that in mind. Any help is appreciated.
One straightforward approach I've already used is to execute the entire notebook with nbconvert.
A notebook failed.ipynb raising an exception will result in a failed run thanks to the --execute option that tells nbconvert to execute the notebook prior to its conversion.
jupyter nbconvert --to notebook --execute failed.ipynb
# ...
# Exception: FAILED
echo $?
# 1
Another correct notebook passed.ipynb will result in a successful export.
jupyter nbconvert --to notebook --execute passed.ipynb
# [NbConvertApp] Converting notebook passed.ipynb to notebook
# [NbConvertApp] Writing 1172 bytes to passed.nbconvert.ipynb
echo $?
# 0
Cherry on the cake, you can do the same through the API and so wrap it in Pytest!
import nbformat
import pytest
from nbconvert.preprocessors import ExecutePreprocessor
#pytest.mark.parametrize("notebook", ["passed.ipynb", "failed.ipynb"])
def test_notebook_exec(notebook):
with open(notebook) as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
assert ep.preprocess(nb) is not None, f"Got empty notebook for {notebook}"
except Exception:
assert False, f"Failed executing {notebook}"
Running the test gives.
pytest test_nbconv.py
# FAILED test_nbconv.py::test_notebook_exec[failed.ipynb] - AssertionError: Failed executing failed.ipynb
# PASSED test_nbconv.py::test_notebook_exec[passed.ipynb]
Notes
There is several output formats, I've used here notebook.
This doesn’t convert a notebook to a different format per se, instead it allows the running of nbconvert preprocessors on a notebook, and/or conversion to other notebook formats.
The python code example is just a quick draft it can be largely improved.
Here is my own solution using testbook. Let's say I have a notebook called my_notebook.ipynb with the following content:
The trick is to inject a cell before my call to bigquery.Client and mock it:
from testbook import testbook
#testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
x = tb.get('x')
assert x == 7

How to run pandas-Koalas progam suing spark-submit(windows)?

I have pandas data frame(sample program), converted koalas dataframe, now I am to execute on spark cluster(windows standalone), when i try from command prompt as
spark-submit --master local hello.py, getting error ModuleNotFoundError: No module named 'databricks'
import pandas as pd
from databricks import koalas as ks
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
kdf = ks.from_pandas(df)
print(kdf)
What should I change so that I can make use of spark cluster features. My actual program written in pandas does many things, I want to make use of spark cluster to see performance improvements.
You should install koalas via the cluster's admin UI (Libraries/PyPI), if you run pip install koalas on the cluster, it won't work.

How to run pure pandas code in spark and see activity from spark webUI?

Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
print(df)
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
In my Spark configuration I already tried to increase the aforementioned parameter:
spark = (SparkSession
.builder
.appName("TV segmentation - dataprep for scoring")
.config("spark.executor.memory", "25G")
.config("spark.driver.memory", "40G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.maxExecutors", "12")
.config("spark.driver.maxResultSize", "3g")
.config("spark.kryoserializer.buffer.max.mb", "2047mb")
.config("spark.rpc.message.maxSize", "1000mb")
.getOrCreate())
I also tried to repartition my dataframe using:
dfscoring=dfscoring.repartition(100)
but still I keep on getting the same error.
My environment: Python 3.5, Anaconda 5.0, Spark 2
How can I avoid this error ?
i'm in same trouble, then i solve it.
the cause is spark.rpc.message.maxSize if default set 128M, you can change it when launch a spark client, i'm work in pyspark and set the value to 1024, so i write like this:
pyspark --master yarn --conf spark.rpc.message.maxSize=1024
solve it.
I had the same issue and it wasted a day of my life that I am never getting back. I am not sure why this is happening, but here is how I made it work for me.
Step 1: Make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Turned out that python in worker(2.6) had a different version than in driver(3.6). You should check if environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I fixed it by simply switching my kernel from Python 3 Spark 2.2.0 to Python Spark 2.3.1 in Jupyter. You may have to set it up manually. Here is how to make sure your PySpark is set up correctly https://mortada.net/3-easy-steps-to-set-up-pyspark.html
STEP 2: If that doesn't work, try working around it:
This kernel switch worked for DFs that I haven't added any columns to:
spark_df -> panda_df -> back_to_spark_df .... but it didn't work on the DFs where I had added 5 extra columns. So what I tried and it worked was the following:
# 1. Select only the new columns:
df_write = df[['hotel_id','neg_prob','prob','ipw','auc','brier_score']]
# 2. Convert this DF into Spark DF:
df_to_spark = spark.createDataFrame(df_write)
df_to_spark = df_to_spark.repartition(100)
df_to_spark.registerTempTable('df_to_spark')
# 3. Join it to the rest of your data:
final = df_to_spark.join(data,'hotel_id')
# 4. Then write the final DF.
final.write.saveAsTable('schema_name.table_name',mode='overwrite')
Hope that helps!
I had the same problem but using Watson studio. My solution was:
sc.stop()
configura=SparkConf().set('spark.rpc.message.maxSize','256')
sc=SparkContext.getOrCreate(conf=configura)
spark = SparkSession.builder.getOrCreate()
I hope it help someone...
I had faced the same issue while converting the sparkDF to pandasDF.
I am working on Azure-Databricks , first you need to check the memory set in the spark config using below -
spark.conf.get("spark.rpc.message.maxSize")
Then we can increase the memory-
spark.conf.set("spark.rpc.message.maxSize", "500")
For those folks, who are looking for AWS Glue script pyspark based way of doing this. The below code snippet might be useful
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark import SparkConf
myconfig=SparkConf().set('spark.rpc.message.maxSize','256')
#SparkConf can be directly used with its .set property
sc = SparkContext(conf=myconfig)
glueContext = GlueContext(sc)
..
..