sql in spark structure streaming - apache-spark-sql

I am exploring structured streaming by doing a small POC. Below is the code that have written so far. However, I would like to validate some of answers that I could not find in spark documentation(I may have missed it).
Validated so far:
Can we process sql query dynamically or conditionally ? yes, I could pass the sql query as an argument and start the execution.
Can sql query can run in parallel : yes as per (How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?)
Need to validate
what are limitation of the sql query : I found that we cannot perform all type of sql query, as we normally do for relational database for example, partition.
Can execution of particular sql be terminated conditionally ?
Can anyone help me to guide what are the limitation that I need to consider while generating sql queries. I know its very broad question to ask but any guidance will be very helpful that could help me to look in right direction.
The POC code.
"""
Run the example
`$ bin/spark-submit examples/src/main/python/sql/streaming/structured_kafka_SQL_query.py \
host1:port1,host2:port2 subscribe topic1,topic2`
"""
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql import window as w
if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
""", file=sys.stderr)
sys.exit(-1)
bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]
spark = SparkSession\
.builder\
.appName("StructuredKafkaWordCount")\
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
schema = StructType([
StructField("p1", StringType(), True),
StructField("p2", StringType(), True),
StructField("p3" , StringType(), True)
])
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", bootstrapServers)\
.option(subscribeType, topics)\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))\
.select(col("parsed_value.*"))
query="select count(*),p1,p2 from events group by p1,p2"
query1="select count(*),p2,p3 from events group by p2,p3"
query_list=[query,query1] # it can be passed as an argument.
lines.createOrReplaceTempView("events")
for q in query_list:
spark.sql(q).writeStream\
.outputMode('complete')\
.format('console')\
.start()
spark.streams.awaitAnyTermination()
Please let me know if my question is still unclear, I can update it accordinlgy.

answering to one part of my own question.
Can execution of particular sql be terminated conditionally ?
yes, spark provides stream query management API to stop the streaming queries.
StreamingQuery.stop()

Related

Converting a Pandas Dataframe back to Spark DataFrame after first converting other way around

We have a lakehouse architecture with Hive metastore tables. I want to do processing on this data, so I select my data as Spark dataframe.
The specific processing step I wanted to achieve was parsing date columns in my Spark Dataframes that come in a rather strange format: /Date(1582959943313)/, where the number inside /Date(xx)/ is miliseconds since epoch. I thought I was being clever by converting my Spark DF toPandas() and then process the dates:
df_accounts = spark.sql("SELECT * FROM database.accounts")
df_accounts_pandas = df_accounts.toPandas()
df_accounts_pandas['ControlledDate'] = df_accounts_pandas['ControlledDate'].str.slice(start=6, stop=-2)
df_accounts_pandas['Created'] = df_accounts_pandas['Created'].str.slice(start=6, stop=-2)
df_accounts_pandas['ControlledDate'] = pd.to_datetime(df_accounts_pandas['ControlledDate'], unit='ms', origin='unix')
df_accounts_pandas['Created'] = pd.to_datetime(df_accounts_pandas['Created'], unit='ms', origin='unix')
This works fine. Now the next step would be to convert the df back to a Spark Dataframe, and be done with it.
df_accounts = spark.createDataframe(df_accounts_pandas)
This throws a ValueError: Some of types cannot be determined after inferring
This SO Question tells me to either manually define a schema or drop Null columns. But why do I have to make this choice? I don't understand why I was able to do the conversion the other way around, but now cannot convert back.
Are there any other workarounds? My tables have 100's of columns so I don't want to manually define a schema. I don't know if columns that are NULL now will be NULL in the future.
P.S. - Am rather new to spark ecosystem - Is it even a good idea to do processing in pandas like this? (I would like to since it has more options than regular PySpark) Or are there better ways to use pandas functionality on Spark dataframes?
We had this requirement to transform data back and forth between spark and pandas, and we achieved it by serialising to parquet files.
We chose this path because toPandas() kept crashing and spark.createDataframe() had schema mapping issues as you are facing.
For a dataset of size (1M, 300) spark write took about an hour; but rest of the operations were quicker.
Spark DF to Pandas DF:
# sdf is spark dataframe. Serialise to parquet
sdf.repartition(1).write.mode("overwrite").parquet("dbfs:/file/path")
# import in Pandas
pdf = pd.read_parquet("/dbfs/file/path/part-0000xxx.parquet")
Pandas DF to Spark DF:
# pdf is pandas dataframe. Serialize to parquet
pdf.to_parquet("/dbfs/file/path/df.parquet")
# import in Spark
sdf = spark.read.parquet("dbfs:/file/path")
Since you have generated new columns, you can specify the new schema and convert a pandas df back to spark df.
from pyspark.sql.types import *
accounts_new_schema = StructType([ StructField("col1", LongType(), True)\
,StructField("col2", IntegerType(), True)\
,StructField("col3", IntegerType(), True)\
,StructField("col4", IntegerType(), True)\
,StructField("col5", StringType(), True)\
,StructField("col6", StringType(), True)\
,StructField("col7", IntegerType(), True)\
,StructField("col8", IntegerType(), True)\])
spdf = spark.createDataFrame(df_accounts_pandas,schema=accounts_new_schema)

toPandas() fails even on small dataset in local pyspark

I read a lot about memory usage of Spark when doing stuff like collect() or toPandas() (like here). The common wisdom is to use it only on small dataset. The point is how small Spark can handle?
I run locally (for testing) with pyspark, the driver memory set to 20g (I have 32g on my 16 cores mac), but toPandas() crashes even with a dataset as small as 20K rows! That cannot be right, so I suspect I do some (setting) wrong. This is the simplified code to reproduce the error:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# setting the number of rows for the CSV file
N = 20000
ncols = 7
c_name = 'ABCDEFGHIJKLMNOPQRSTUVXYWZ'
# creating a pandas dataframe (df)
df = pd.DataFrame(np.random.randint(999,999999,size=(N, ncols)), columns=list(c_name[:ncols]))
file_name = 'random.csv'
# export the dataframe to csv using comma delimiting
df.to_csv(file_name, index=False)
## Load the csv in spark
df = spark.read.format('csv').option('header', 'true').load(file_name)#.limit(5000)#.coalesce(2)
## some checks
n_parts = df.rdd.getNumPartitions()
print('Number of partitions:', n_parts)
print('Number of rows:', df.count())
## conver spark df -> toPandas
df_p = df.toPandas()
print('With pandas:',len(df_p))
I run this within jupyter, and get errors like:
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /192.168.0.104:61536
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
...
My spark local setting is (everything else default):
('spark.driver.host', '192.168.0.104')
('spark.driver.memory', '20g')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.app.id', 'local-1618499935279')
('spark.driver.port', '55115')
('spark.ui.showConsoleProgress', 'true')
('spark.app.name', 'pyspark-shell')
('spark.driver.maxResultSize', '4g')
Is my setup wrong, or it is expected that even 20g of driver memory can't handle a small dataframe with 20K rows and 7 columns? Will repartitioning help?

Spark: How to debug pandas-UDF in VS Code

I'm looking for a way to debug spark pandas UDF in vscode and Pycharm Community version (place breakpoint and stop inside UDF). At the moment when breakpoint is placed inside UDF debugger doesn't stop.
In the reference below there is described Local mode and Distributed mode.
I'm trying at least to debug in Local mode. Pycharm/VS Code there should be a way to debug local enc by "Attach to Local Process". Just I can not figure out how.
At the moment I can not find any answer how to attach pyspark debugger to local process inside UDF in VS Code(my dev ide).
I found only examples below in Pycharm.
Attache to local process How can PySpark be called in debug mode?
When I try to attach to process I'm getting message below in Pycharm. In VS Code I'm getting msg that process can not be attached.
Attaching to a process with PID=33,692
/home/usr_name/anaconda3/envs/yf/bin/python3.8 /snap/pycharm-community/223/plugins/python-ce/helpers/pydev/pydevd_attach_to_process/attach_pydevd.py --port 40717 --pid 33692
WARNING: The 'kernel.yama.ptrace_scope' parameter value is not 0, attach to process may not work correctly.
Please run 'sudo sysctl kernel.yama.ptrace_scope=0' to change the value temporary
or add the 'kernel.yama.ptrace_scope = 0' line to /etc/sysctl.d/10-ptrace.conf to set it permanently.
Process finished with exit code 0
Server stopped.
pyspark_xray https://github.com/bradyjiang/pyspark_xray
With this package, it is possible to debug rdds running on worker, but I was not able to adjust package to debug UDFs
Example code, breakpoint doesn't stop inside UDF pandas_function(url_json):
import pandas as pd
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
spark = pyspark.sql.SparkSession.builder.appName("test") \
.master('local[*]') \
.getOrCreate()
sc = spark.sparkContext
# Create initial dataframe respond_sdf
d_list = [('api_1',"{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }"),
(' api_2', "{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }")]
schema = StructType([
StructField('url', StringType(), True),
StructField('content', StringType(), True)
])
jsons = sc.parallelize(rdd_list)
respond_sdf = spark.createDataFrame(jsons, schema)
# Pandas UDF
def pandas_function(url_json):
# Here I want to place breakpoint
df = pd.DataFrame(eval(url_json['content'][0]))
return df
# Pnadas UDF transformation applied to respond_sdf
respond_sdf.groupby(F.monotonically_increasing_id()).applyInPandas(pandas_function, schema=schema).show()
This example demonstrates how to use excellent pyspark_exray library to step into UDF functions passed into Dataframe.mapInPandas function
https://github.com/bradyjiang/pyspark_xray/blob/master/demo_app02/driver.py

Suppress output of SQL statements when calling pandas to_sql()

to_sql() is printing every insert statement within my Jupyter Notebook and this makes everything run very slowly for millions of records. How can I decrease the verbosity significantly? I haven't found any verbosity setting of this function. I've tried %%capture as written here. The same method works for another simple test case with print() but not for to_sql(). How do you suppress output in IPython Notebook?
from sqlalchemy import create_engine, NVARCHAR
import cx_Oracle
df.to_sql('table_name', engine, if_exists='append',
schema='schema', index=False, chunksize=10000,
dtype={col_name: NVARCHAR(length=20) for col_name in df} )
Inside create_engine(), set echo=False and all logging will be disabled. More detail here: https://docs.sqlalchemy.org/en/13/core/engines.html#more-on-the-echo-flag.
from sqlalchemy import create_engine, NVARCHAR
import cx_Oracle
host='hostname.net'
port=1521
sid='DB01' #instance is the same as SID
user='USER'
password='password'
sid = cx_Oracle.makedsn(host, port, sid=sid)
cstr = 'oracle://{user}:{password}#{sid}'.format(
user=user,
password=password,
sid=sid
)
engine = create_engine(
cstr,
convert_unicode=False,
pool_recycle=10,
pool_size=50,
echo=False
)
Thanks to #GordThompson for pointing me in the right direction!

How to load csv file into SparkSession

I am learning PySpark from some online source. I googled around and found how I could read csv file into Spark DataFrame using the following codes
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.csv('my_file.csv', header=True)
pandas_df = spark_df.toPandas()
However, on the online site I am learning, it loads the csv file somehow into SparkSession without telling the audience how to do it. That is, when I typed (on the online site's browser)
print(spark.catalog.listTables())
The following output returns.
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
When I tried to print the catalog as above, I got an empty list back.
Is there anyway how to put the csv file into the SparkSession? I have tried to google for this but most of what I found is how to load csv into Spark DataFrame like I showed above.
Thanks very much.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(#type the app name).getOrCreate()
df = spark.read.csv('invoice.csv',inferSchema=True,header=True)
It seems how to do this is left far behind where it should be on the online site.
sdf = spark.read.csv('my_file.csv', header=True)
pdf = sdf.toPandas()
spark_temp = spark.createDataFrame(pdf)
spark_temp.createOrReplaceTempView('my_file')
print(spark.catalog.listTables())
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
One question remains though. I cannot use pd.read_csv('my_file.csv') directly. It resulted in some merge error or something.
This can work:
df = my_spark.read.csv("my_file.csv",inferSchema=True,header=True)
df.createOrReplaceTempView('my_file')
print(my_spark.catalog.listTables())