using UDF's and simpe dataframes in pyspark - apache-spark-sql

I am new to pyspark and come trying to do something like below
call a function PrintDetails for each cookie and then write the result into a file. The spark.sql query returns the correct data and I can serialize it to a file as well.
Can someone help with the for statement on each cookie. What should the syntax be for calling the UDF and how can I write the output to a text file?
any help is appreciated.
Thanks
#udf(returnType=StringType())
def PrintDetails(cookie, timestamps,current_day, current_hourly_threshold,current_daily_threshold):
#DO SOME WORK
return "%s\t%d\t%d\t%d\t%d\t%s" %(some_data)
def main(argv):
spark = SparkSession \
.builder \
.appName("parquet_test") \
.config("spark.debug.maxToStringFields", "100") \
.getOrCreate()
inputPath = r'D:\Hadoop\Spark\parquet_input_files'
inputFiles = os.path.join(inputPath, '*.parquet')
impressionDate = datetime.strptime("2019_12_31", '%Y_%m_%d')
current_hourly_threshold = 40
current_daily_threshold = 200
parquetFile = spark.read.parquet(inputFiles)
parquetFile.createOrReplaceTempView("parquetFile")
cookie_and_time = spark.sql("SELECT cookie, collect_list(date_format(from_unixtime(ts), 'YYYY-mm-dd-H:M:S')) as imp_times FROM parquetFile group by 1 ")
for cookie in cookie_and_time :
PrintDetails(cookie('cookie'), cookie('imp_times'), impressionDate, current_hourly_threshold, current_daily_threshold))

You can do like below.
cookie_df= cookie_and_time.withColumn("cookies",PrintDetails(cookie('cookie'), cookie('imp_times'), lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold)))
Or you can define all your variables in udf function itself and avoid passing as arguments.

Related

How to escape single quote in sparkSQL

I am new to pySpark and SQL. I am working on below query;
sqlContext.sql("Select Crime_type, substring(Location,11,100) as Location_where_crime_happened, count(*) as Count\
From street_SQLTB\
where LSOA_name = 'City of London 001F' and \
group by Location_where_crime_happened, Crime_type\
having Location_where_crime_happened = 'Alderman'S Walk'")
I am struggling in dealing with single quote. I need to apply filter on Alderman'S Walk. It could be easy one but I am unable to figure out.
Your help is much appreciated.
Try this
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Alderman'S Walk","CA",90000,24,23000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df1 = spark.createDataFrame(data = simpleData, schema = columns)
df1.createOrReplaceTempView('temp')
df = sqlContext.sql("""select * from temp where department = "Alderman'S Walk" """)
display(df)
or
df = sqlContext.sql("select * from temp where department = 'Alderman\\'S Walk' ")
display(df)
Filtered output:

How to method chain .agg() and .assign() functions in Pandas

I am looking to replicate this Dplyr query in Pandas but am having trouble chaining the the .agg() and .assign() functions together, and would be so grateful for any advice
Dplyr code:
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
Attempt at the same in Pandas:
Within the .assign() part I am redirecting the variable back into the original dataframe, but nothing else works
counties.\
groupby('state').\
agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum')).\
reset_index().\
assign(density = counties['total_population'] / counties['total_area']).\
arrange('density', ascending = False).\
head()
Problem is you need lambda for processing chained data, alreday processing in previous chained methods:
assign(density = counties['total_population'] / counties['total_area'])
to:
assign(density = lambda x: x['total_population'] / x['total_area'])
Another problem is for sorting is used instead:
arrange('density', ascending = False)
method DataFrame.sort_values:
sort_values('density', ascending = False):
All together, . is used to start of methods like:
df = (counties.groupby('state')
.agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum'))
.reset_index()
.assign(density = lambda x: x['total_population'] / x['total_area'])
.sort_values('density', ascending = False)
.head())
With datar, it is easy to port your dplyr code to python code, without learning pandas APIs:
from datar.all import f, group_by, summarize, sum, mutate, arrange, desc
counties_selected >> \
group_by(f.state) >> \
summarize(total_area = sum(f.land_area),
total_population = sum(f.population)) >> \
mutate(density = f.total_population / f.total_area) >> \
arrange(desc(f.density))
I am the author of the package. Feel free to submit issues if you have any questions.

Pyspark: How to convert array of strings in a dataframe to array of timestamps

I run a simple query to get cookie as string and timestamps as array using pyspark sql.
I want to pass them to my user defined function but the array of timestamps is passed as an array of unicodes.
Can someone help me figure this out. Thanks
#udf(returnType=StringType())
def PrintDetails(cookie, timestamps, current_day, current_hourly_threshold,current_daily_threshold):
print(type(timestamps[0]))
def main(argv):
spark = SparkSession \
.builder \
.appName("parquet_test") \
.config("spark.debug.maxToStringFields", "100") \
.getOrCreate()
inputPath = r'D:\Hadoop\Spark\parquet_input_files'
inputFiles = os.path.join(inputPath, '*.parquet')
impressionDate = datetime.strptime("2019_12_31", '%Y_%m_%d')
current_hourly_threshold = 40
current_daily_threshold = 200
parquetFile = spark.read.parquet(inputFiles)
parquetFile.createOrReplaceTempView("parquetFile")
cookie_and_time = spark.sql("SELECT cookie, collect_list(date_format(from_unixtime(ts), 'YYYY-MM-dd-hh:mm:ss')) as imp_times FROM parquetFile group by 1 ")
cookie_df = cookie_and_time.withColumn("cookies", PrintDetails(cookie_and_time['cookie'], cookie_and_time['imp_times'], lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold)))
cookie_df.show()
if __name__ == "__main__":
main(sys.argv)

Return KDB query to a pandas dataframe

I would like to extract data from a KDB database and place into a dataframe. My query runs fine in qpad, no issues; just need to write it into my Pandas dataframe. My code:
from qpython import qconnection
# Create the connection and save the handle to a variable
q = qconnection.QConnection(host = 'wokplpaxvj003', port = 11503, username = 'pelucas', password = 'Dive2600', timeout = 3.0)
try:
# initialize connection
q.open()
print(q)
print('IPC version: %s. Is connected: %s' % (q.protocol_version, q.is_connected()))
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}')
df.info()
finally:
q.close()
It fails on the df.info() raising AttributeError: 'QLambda' object has no attribute 'info' so I guess the call is not successful.
It looks like you've sent only a lambda but with no instruction to execute that lambda. Two options:
Don't make it a lambda
df = q.sendSync('select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id')
Execute the lambda
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}[]')

how to store Pyspark dataframe into HBase

I have a code that converts Pyspark streaming data to dataframe. I need to store this dataframe into Hbase. Help me to write code additionally.
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row, SparkSession
def getSparkSessionInstance(sparkConf):
if ('sparkSessionSingletonInstance' not in globals()):
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=sparkConf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: sql_network_wordcount.py <hostname> <port> ",
file=sys.stderr)
exit(-1)
host, port = sys.argv[1:]
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(host, int(port))
def process(time, rdd):
print("========= %s =========" % str(time))
try:
words = rdd.map(lambda line :line.split(" ")).collect()
spark = getSparkSessionInstance(rdd.context.getConf())
linesDataFrame = spark.createDataFrame(words,schema=["lat","lon"])
linesDataFrame.show()
except :
pass
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
You can use Spark-Hbase connector to access HBase from Spark.It provides an API in both low-level RDD and Dataframes.
The connector requires you to define a Schema for HBase table. Below is an example of Schema defined for a HBase table with name as table1, row key as key and a number of columns (col1-col8). Note that the rowkey also has to be defined in details as a column (col0), which has a specific cf (rowkey).
def catalog = '{
"table":{"namespace":"default", "name":"table1"},\
"rowkey":"key",\
"columns":{\
"col0":{"cf":"rowkey", "col":"key", "type":"string"},\
"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},\
"col2":{"cf":"cf1", "col":"col2", "type":"double"},\
"col3":{"cf":"cf1", "col":"col3", "type":"float"},\
"col4":{"cf":"cf1", "col":"col4", "type":"int"},\
"col5":{"cf":"cf2", "col":"col5", "type":"bigint"},\
"col6":{"cf":"cf2", "col":"col6", "type":"smallint"},\
"col7":{"cf":"cf2", "col":"col7", "type":"string"},\
"col8":{"cf":"cf2", "col":"col8", "type":"tinyint"}\
}\
}'
Once the catalog is defined according to the schema of your dataframe, You can write the dataFrame to HBase using:
df.write\
.options(catalog=catalog)\
.format("org.apache.spark.sql.execution.datasources.hbase")\
.save()
To Read the data from HBase:
df = spark.\
read.\
format("org.apache.spark.sql.execution.datasources.hbase").\
option(catalog=catalog).\
load()
You need to include the Spark-HBase connector package as below while submitting the spark application.
pyspark --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/