I am trying to load a stream of data from a kafka topic to a BigQuery table. While I am able to source the stream from the kafka topic and do transformations on it, I am stuck on loading the transformed data to a BQ table.
Note: I am using apache beam's python SDK with direct runner (right now) to test things. Here's the code:
import os
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from beam_nuggets.io import kafkaio
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--bq_table',
required=True,
help=('Output BigQuery table for results specified as: '
'PROJECT:DATASET.TABLE'))
known_args, pipeline_args = parser.parse_known_args(argv)
bootstrap_server = "localhost:9092"
kafka_topic = 'some_topic'
consumer_config = {
'bootstrap_servers': bootstrap_server,
'group_id': 'etl',
'topic': kafka_topic,
'auto_offset_reset': 'earliest'
}
with beam.Pipeline(argv=pipeline_args) as p:
events = (
p | 'Read from kafka' >> kafkaio.KafkaConsume(
consumer_config=consumer_config, value_decoder=bytes.decode)
| 'toDict' >> beam.MapTuple(lambda k, v: json.loads(v))
| 'extract' >> beam.Map(lambda e: {'x': e['key1'], 'y': e['key2']})
# | 'log' >> beam.ParDo(logging.info) # if I uncomment this (for validation), I see data printed in my console log
)
_ = events | 'Load data to BQ' >> WriteToBigQuery(known_args.bq_table,
schema='x:STRING, y:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
ignore_unknown_columns=True, method='STREAMING_INSERTS')
p.run().wait_until_finish()
if __name__ == "__main__":
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials/file.json"
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger("kafka").setLevel(logging.INFO)
run()
And this is how I run the above code:
python3 main.py \
--bq_table=<table_id> \
--streaming \
--runner=DirectRunner
I have tried using the batch mode to data insertion as well with WriteToBigQuery (method=FILE_LOADS) and providing a temp GCS location, but that didn't help either.
There is no error even though I have enabled debug logs. So, it's getting really difficult to trace the issue to its source. What do you think I am missing?
Edit:
The python process does not end/exit until I interrupt it.
The Kafka consumer group lag is 0, which indicates that the data is being fetched and processed but not getting loaded to the BQ table.
Related
I'm trying to import excel database sheet in big query from my download folder or google drive but I unable to import. Please reply how to import database sheet from google drive, if any method available.
XLSX is not included in the supported data for batch loading in Bigquery. A workaround is to convert XLSX to CSV then load to Bigquery from your local data source. I both achieved it using Bigquery Python API.
See working code below:
import pandas as pd
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "<your-project>.<your-dataset>.<your-table>"
file_path = "<your-path>/20220621.xlsx"
coverted_path = "<your-path>/20220621.csv"
# Converting XLSX to CSV
data_xls = pd.read_excel(file_path, index_col=None)
data_xls.to_csv(coverted_path, encoding='utf-8', index=False)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
with open(coverted_path, "rb") as source_file:
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
.XLSX file:
Converted to CSV:
Loaded to Bigquery:
I follow the official tutotial from microsoft: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-score-model-predict-spark-pool
When I execute:
#Bind model within Spark session
model = pcontext.bind_model(
return_types=RETURN_TYPES,
runtime=RUNTIME,
model_alias="Sales", #This alias will be used in PREDICT call to refer this model
model_uri=AML_MODEL_URI, #In case of AML, it will be AML_MODEL_URI
aml_workspace=ws #This is only for AML. In case of ADLS, this parameter can be removed
).register()
I got : No module named 'azureml.automl'
My Notebook
As per the repro from my end, the above code which you have shared works as excepted and I don't see any error message which you are experiencing.
I had even tested the same code on the newly created Apache spark 3.1 runtime and it works as expected.
I would request you to create a new cluster and see if you are able to run the above code.
I solved it. In my case it works best like this:
Imports
#Import libraries
from pyspark.sql.functions import col, pandas_udf,udf,lit
from notebookutils.mssparkutils import azureML
from azureml.core import Workspace, Model
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.model import Model
import joblib
import pandas as pd
ws = azureML.getWorkspace("AzureMLService")
spark.conf.set("spark.synapse.ml.predict.enabled","true")
Predict function
def forecastModel():
model_path = Model.get_model_path(model_name="modelName", _workspace=ws)
modeljob = joblib.load(model_path + "/model.pkl")
validation_data = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema",True) \
.option("sep", ";") \
.load("abfss://....csv")
validation_data_pd = validation_data.toPandas()
predict = modeljob.forecast(validation_data_pd)
return predict
I'm looking for a way to debug spark pandas UDF in vscode and Pycharm Community version (place breakpoint and stop inside UDF). At the moment when breakpoint is placed inside UDF debugger doesn't stop.
In the reference below there is described Local mode and Distributed mode.
I'm trying at least to debug in Local mode. Pycharm/VS Code there should be a way to debug local enc by "Attach to Local Process". Just I can not figure out how.
At the moment I can not find any answer how to attach pyspark debugger to local process inside UDF in VS Code(my dev ide).
I found only examples below in Pycharm.
Attache to local process How can PySpark be called in debug mode?
When I try to attach to process I'm getting message below in Pycharm. In VS Code I'm getting msg that process can not be attached.
Attaching to a process with PID=33,692
/home/usr_name/anaconda3/envs/yf/bin/python3.8 /snap/pycharm-community/223/plugins/python-ce/helpers/pydev/pydevd_attach_to_process/attach_pydevd.py --port 40717 --pid 33692
WARNING: The 'kernel.yama.ptrace_scope' parameter value is not 0, attach to process may not work correctly.
Please run 'sudo sysctl kernel.yama.ptrace_scope=0' to change the value temporary
or add the 'kernel.yama.ptrace_scope = 0' line to /etc/sysctl.d/10-ptrace.conf to set it permanently.
Process finished with exit code 0
Server stopped.
pyspark_xray https://github.com/bradyjiang/pyspark_xray
With this package, it is possible to debug rdds running on worker, but I was not able to adjust package to debug UDFs
Example code, breakpoint doesn't stop inside UDF pandas_function(url_json):
import pandas as pd
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
spark = pyspark.sql.SparkSession.builder.appName("test") \
.master('local[*]') \
.getOrCreate()
sc = spark.sparkContext
# Create initial dataframe respond_sdf
d_list = [('api_1',"{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }"),
(' api_2', "{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }")]
schema = StructType([
StructField('url', StringType(), True),
StructField('content', StringType(), True)
])
jsons = sc.parallelize(rdd_list)
respond_sdf = spark.createDataFrame(jsons, schema)
# Pandas UDF
def pandas_function(url_json):
# Here I want to place breakpoint
df = pd.DataFrame(eval(url_json['content'][0]))
return df
# Pnadas UDF transformation applied to respond_sdf
respond_sdf.groupby(F.monotonically_increasing_id()).applyInPandas(pandas_function, schema=schema).show()
This example demonstrates how to use excellent pyspark_exray library to step into UDF functions passed into Dataframe.mapInPandas function
https://github.com/bradyjiang/pyspark_xray/blob/master/demo_app02/driver.py
I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!
Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations
I'm trying to enqueue a basic job in redis using python-rq, But it throws this error
"ValueError: Functions from the main module cannot be processed by workers"
Here is my program:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
Break the provided code to two files:
count_words.py:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
and main.py (where you'll import the required function):
from rq import Connection, Queue
from redis import Redis
from count_words import count_words_at_url # added import!
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
I always separate the tasks from the logic running those tasks to different files. It's just better organization. Also note that you can define a class of tasks and import/schedule tasks from that class instead of the (over-simplified) structure I suggest above. This should get you going..
Also see here to confirm you're not the first to struggle with this example. RQ is great once you get the hang of it.
Currently there is a bug in RQ, which leads to this error. You will not be able to pass functions in enqueue from the same file without explicitly importing it.
Just add from app import count_words_at_url above the enqueue function:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
from app import count_words_at_url
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
The other way is to have the functions in a separate file and import them.