using UDF's and simpe dataframes in pyspark

using UDF's and simpe dataframes in pyspark - apache-spark-sql

I am new to pyspark and come trying to do something like below
call a function PrintDetails for each cookie and then write the result into a file. The spark.sql query returns the correct data and I can serialize it to a file as well.
Can someone help with the for statement on each cookie. What should the syntax be for calling the UDF and how can I write the output to a text file?
any help is appreciated.
Thanks
#udf(returnType=StringType())
def PrintDetails(cookie, timestamps,current_day, current_hourly_threshold,current_daily_threshold):
#DO SOME WORK
return "%s\t%d\t%d\t%d\t%d\t%s" %(some_data)
def main(argv):
spark = SparkSession \
.builder \
.appName("parquet_test") \
.config("spark.debug.maxToStringFields", "100") \
.getOrCreate()
inputPath = r'D:\Hadoop\Spark\parquet_input_files'
inputFiles = os.path.join(inputPath, '*.parquet')
impressionDate = datetime.strptime("2019_12_31", '%Y_%m_%d')
current_hourly_threshold = 40
current_daily_threshold = 200
parquetFile = spark.read.parquet(inputFiles)
parquetFile.createOrReplaceTempView("parquetFile")
cookie_and_time = spark.sql("SELECT cookie, collect_list(date_format(from_unixtime(ts), 'YYYY-mm-dd-H:M:S')) as imp_times FROM parquetFile group by 1 ")
for cookie in cookie_and_time :
PrintDetails(cookie('cookie'), cookie('imp_times'), impressionDate, current_hourly_threshold, current_daily_threshold))

You can do like below.
cookie_df= cookie_and_time.withColumn("cookies",PrintDetails(cookie('cookie'), cookie('imp_times'), lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold)))
Or you can define all your variables in udf function itself and avoid passing as arguments.

Related

How to escape single quote in sparkSQL

I am new to pySpark and SQL. I am working on below query;
sqlContext.sql("Select Crime_type, substring(Location,11,100) as Location_where_crime_happened, count(*) as Count\
From street_SQLTB\
where LSOA_name = 'City of London 001F' and \
group by Location_where_crime_happened, Crime_type\
having Location_where_crime_happened = 'Alderman'S Walk'")
I am struggling in dealing with single quote. I need to apply filter on Alderman'S Walk. It could be easy one but I am unable to figure out.
Your help is much appreciated.

Try this
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Alderman'S Walk","CA",90000,24,23000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df1 = spark.createDataFrame(data = simpleData, schema = columns)
df1.createOrReplaceTempView('temp')
df = sqlContext.sql("""select * from temp where department = "Alderman'S Walk" """)
display(df)
or
df = sqlContext.sql("select * from temp where department = 'Alderman\\'S Walk' ")
display(df)
Filtered output:

How to method chain .agg() and .assign() functions in Pandas

I am looking to replicate this Dplyr query in Pandas but am having trouble chaining the the .agg() and .assign() functions together, and would be so grateful for any advice
Dplyr code:
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
Attempt at the same in Pandas:
Within the .assign() part I am redirecting the variable back into the original dataframe, but nothing else works
counties.\
groupby('state').\
agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum')).\
reset_index().\
assign(density = counties['total_population'] / counties['total_area']).\
arrange('density', ascending = False).\
head()

Problem is you need lambda for processing chained data, alreday processing in previous chained methods:
assign(density = counties['total_population'] / counties['total_area'])
to:
assign(density = lambda x: x['total_population'] / x['total_area'])
Another problem is for sorting is used instead:
arrange('density', ascending = False)
method DataFrame.sort_values:
sort_values('density', ascending = False):
All together, . is used to start of methods like:
df = (counties.groupby('state')
.agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum'))
.reset_index()
.assign(density = lambda x: x['total_population'] / x['total_area'])
.sort_values('density', ascending = False)
.head())

With datar, it is easy to port your dplyr code to python code, without learning pandas APIs:
from datar.all import f, group_by, summarize, sum, mutate, arrange, desc
counties_selected >> \
group_by(f.state) >> \
summarize(total_area = sum(f.land_area),
total_population = sum(f.population)) >> \
mutate(density = f.total_population / f.total_area) >> \
arrange(desc(f.density))
I am the author of the package. Feel free to submit issues if you have any questions.

Pyspark: How to convert array of strings in a dataframe to array of timestamps

I run a simple query to get cookie as string and timestamps as array using pyspark sql.
I want to pass them to my user defined function but the array of timestamps is passed as an array of unicodes.
Can someone help me figure this out. Thanks
#udf(returnType=StringType())
def PrintDetails(cookie, timestamps, current_day, current_hourly_threshold,current_daily_threshold):
print(type(timestamps[0]))
def main(argv):
spark = SparkSession \
.builder \
.appName("parquet_test") \
.config("spark.debug.maxToStringFields", "100") \
.getOrCreate()
inputPath = r'D:\Hadoop\Spark\parquet_input_files'
inputFiles = os.path.join(inputPath, '*.parquet')
impressionDate = datetime.strptime("2019_12_31", '%Y_%m_%d')
current_hourly_threshold = 40
current_daily_threshold = 200
parquetFile = spark.read.parquet(inputFiles)
parquetFile.createOrReplaceTempView("parquetFile")
cookie_and_time = spark.sql("SELECT cookie, collect_list(date_format(from_unixtime(ts), 'YYYY-MM-dd-hh:mm:ss')) as imp_times FROM parquetFile group by 1 ")
cookie_df = cookie_and_time.withColumn("cookies", PrintDetails(cookie_and_time['cookie'], cookie_and_time['imp_times'], lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold)))
cookie_df.show()
if __name__ == "__main__":
main(sys.argv)

Return KDB query to a pandas dataframe

I would like to extract data from a KDB database and place into a dataframe. My query runs fine in qpad, no issues; just need to write it into my Pandas dataframe. My code:
from qpython import qconnection
# Create the connection and save the handle to a variable
q = qconnection.QConnection(host = 'wokplpaxvj003', port = 11503, username = 'pelucas', password = 'Dive2600', timeout = 3.0)
try:
# initialize connection
q.open()
print(q)
print('IPC version: %s. Is connected: %s' % (q.protocol_version, q.is_connected()))
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}')
df.info()
finally:
q.close()
It fails on the df.info() raising AttributeError: 'QLambda' object has no attribute 'info' so I guess the call is not successful.

It looks like you've sent only a lambda but with no instruction to execute that lambda. Two options:
Don't make it a lambda
df = q.sendSync('select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id')
Execute the lambda
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}[]')

how to store Pyspark dataframe into HBase

I have a code that converts Pyspark streaming data to dataframe. I need to store this dataframe into Hbase. Help me to write code additionally.
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row, SparkSession
def getSparkSessionInstance(sparkConf):
if ('sparkSessionSingletonInstance' not in globals()):
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=sparkConf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: sql_network_wordcount.py <hostname> <port> ",
file=sys.stderr)
exit(-1)
host, port = sys.argv[1:]
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(host, int(port))
def process(time, rdd):
print("========= %s =========" % str(time))
try:
words = rdd.map(lambda line :line.split(" ")).collect()
spark = getSparkSessionInstance(rdd.context.getConf())
linesDataFrame = spark.createDataFrame(words,schema=["lat","lon"])
linesDataFrame.show()
except :
pass
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()

You can use Spark-Hbase connector to access HBase from Spark.It provides an API in both low-level RDD and Dataframes.
The connector requires you to define a Schema for HBase table. Below is an example of Schema defined for a HBase table with name as table1, row key as key and a number of columns (col1-col8). Note that the rowkey also has to be defined in details as a column (col0), which has a specific cf (rowkey).
def catalog = '{
"table":{"namespace":"default", "name":"table1"},\
"rowkey":"key",\
"columns":{\
"col0":{"cf":"rowkey", "col":"key", "type":"string"},\
"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},\
"col2":{"cf":"cf1", "col":"col2", "type":"double"},\
"col3":{"cf":"cf1", "col":"col3", "type":"float"},\
"col4":{"cf":"cf1", "col":"col4", "type":"int"},\
"col5":{"cf":"cf2", "col":"col5", "type":"bigint"},\
"col6":{"cf":"cf2", "col":"col6", "type":"smallint"},\
"col7":{"cf":"cf2", "col":"col7", "type":"string"},\
"col8":{"cf":"cf2", "col":"col8", "type":"tinyint"}\
}\
}'
Once the catalog is defined according to the schema of your dataframe, You can write the dataFrame to HBase using:
df.write\
.options(catalog=catalog)\
.format("org.apache.spark.sql.execution.datasources.hbase")\
.save()
To Read the data from HBase:
df = spark.\
read.\
format("org.apache.spark.sql.execution.datasources.hbase").\
option(catalog=catalog).\
load()
You need to include the Spark-HBase connector package as below while submitting the spark application.
pyspark --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

using UDF's and simpe dataframes in pyspark - apache-spark-sql

You can do like below. cookie_df= cookie_and_time.withColumn("cookies",PrintDetails(cookie('cookie'), cookie('imp_times'), lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold))) Or you can define all your variables in udf function itself and avoid passing as arguments.

Related

How to escape single quote in sparkSQL

How to method chain .agg() and .assign() functions in Pandas

Pyspark: How to convert array of strings in a dataframe to array of timestamps

Return KDB query to a pandas dataframe

how to store Pyspark dataframe into HBase

Categories

Resources