Stream Eventhub Fixed Length data to a streaming DataFrame - dataframe

Summary - I have a streaming source as EventHub where the data is receievd in fixedLength format. Now I want to read the streaming source that contains fixedLength into a spark data frame
Note: I am able to read if the fixedLength is from a directory and create substring and categorize according to my need. But How can I do for streaming source like eventHub (Since eventhub takes all the data as Body)
Lets take my fixed lenght file is sample.txt that contains - 00101292017you1235
My code is like below
import org.apache.spark.eventhubs
import org.apache.spark.eventhubs.EventPosition._
import com.microsoft.azure.eventhubs.EventHubClientOptions
import org.apache.spark.eventhubs._
import com.microsoft.azure.eventhubs.impl.StringUtil
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val endpoint = "Endpoint=sb://XXXXXX.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=XXXXXXX"
val eventHub = "XXXX"
val connectionString = ConnectionStringBuilder(endpoint) .setEventHubName("XXXXX") .build
val ehConf = EventHubsConf(connectionString).setStartingPosition(EventPosition.fromEndOfStream) .setMaxEventsPerTrigger(500)
val ehStream = spark.readStream.format("eventhubs").options(ehConf.toMap).load
val messages =ehStream.withColumn("Offset", $"offset".cast(LongType)).withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType)).withColumn("Timestamp", $"enqueuedTime".cast(LongType)).withColumn("Body", $"body".cast(StringType)).select("Offset", "Time (readable)", "Timestamp", "Body")
messages.writeStream.outputMode("append").option("truncate", false) .format("console").start() .awaitTermination()
For the above scenario how can the body data that recieves in eventhub 00101292017you1235
something like
df.select(
df.value.substr(1,3).alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string'),
df.value.substr(15,4).cast('integer').alias('integer')
).show
will result in:
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+

This Helps
val messages =ehStream.withColumn("FirstColumn",$"body".substr(1,3).cast(StringType)).select("FirstColumn")

Related

I want to get the excel file from the data frame created which automatically changes as written in the code

i have tried two methods and both showing different location as given by me in this image
apikey='abcd'
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import time
ts=TimeSeries(key=apikey,output_format='pandas')
data,metadata=ts.get_intraday(symbol='name',interval='1min',outputsize='full')
data
while True:
data, metadata=ts.get_intraday(symbol='TCS',interval='1min',outputsize='full')
data.to_excel('livedat.xlsx')
time.sleep(60)
The code is running properly but I don't know how to get the data file in excel.
imp- the method should get the file which is updated timely i.e 1min automaticaly.
Also i am using IBM watson studio to write the code.
I am not familiar with the alpha_vantage wrapper that you are using however this is how i would perform your question. The code works and i have included comments.
To get the file in the python script i would do pd.read_excel(filepath).
import requests
import pandas as pd
import time
import datetime
# Your API KEY and the URL we will request from
API_KEY = "YOUR API KEY"
url = "https://www.alphavantage.co/query?"
def Generate_file(symbol="IBM", interval="1min"):
# URL parameters
parameters = {"function": "TIME_SERIES_INTRADAY",
"symbol": symbol,
"interval": interval,
"apikey": API_KEY,
"outputsize": "compact"}
# get the json response from AlphaVantage
response = requests.get(url, params=parameters)
data = response.json()
# filter the response to only get the time series data we want
time_series_interval = f"Time Series ({interval})"
prices = data[time_series_interval]
# convert the filtered reponse to a Pandas DataFrame
df = pd.DataFrame.from_dict(prices, orient="index").reset_index()
df = df.rename(columns={"index": time_series_interval})
# create a timestampe for our excel file. So that the file does not get overriden with new data each time.
current_time = datetime.datetime.now()
file_timestamp = current_time.strftime("%Y%m%d_%H.%M")
filename = f"livedat_{file_timestamp}.xlsx"
df.to_excel(filename)
# sent a limit on the number of calls we make to prevent infinite loop
call_limit = 3
number_of_calls = 0
while(number_of_calls < call_limit):
Generate_file() # our function
number_of_calls += 1
time.sleep(60)

How can i save the Kafka read Structured Streaming data as a Dataframe and apply parsing on it?

I am trying to read realtime streaming data from Kafka topics through Spark Structured streaming, However my understanding i would need the streaming to stop at sometime so i can apply my parsing logic on it and push it to MongoDB. Is there a way i can save the streaming data into a separate dataframe with/without stopping the streaming?
I checked the guide and other blogs and i am not getting a straight forward answer for my requirement
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:9092, host:9092, host:9092")
.option("subscribe", "TOPIC_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.option("startingOffsets", "earliest")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val consoleOutput = dfs.writeStream
.outputMode("append")
.format("console")
.start()
consoleOutput.awaitTermination()
consoleOutput.stop()
I need the streaming data to be saved somehow in a dataframe either by stopping the streaming or without stopping
Below is the parsing logic what i have and instead of picking dataset from a file path i need the streamed data to be my new dataset and should be able to apply my rest of the logic and get output. Saving it to Mongo is not my primary focus now;
val log = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load("C:\\Users\\raheem_mohammed\\IdeaProjects\\diag.csv")
log.createOrReplaceTempView("logs")
val df = spark.sql("select _raw, _time from logs").toDF
//Adds Id number to each of the event
val logs = dfs.withColumn("Id", monotonicallyIncreasingId()+1)
//Register Dataframe as a temp table
logs.createOrReplaceTempView("logs")
val = spark.sql("select Id, value from logs")
//Extracts columns from _raw column. Also finds the probabilities of compositeNames.
//If true then the compositeName belongs to one of the four possibilities
val extractedDF = dfss.withColumn("managed_server", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("alert_summary", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("oracle_details", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("ecid", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
//.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+(\S+)\]""",2))
.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
.withColumn("composite_name", col("_raw").contains("composite_name"))
.withColumn("compositename", col("_raw").contains("compositename"))
.withColumn("composites", col("_raw").contains("composites"))
.withColumn("componentDN", col("_raw").contains("componentDN"))
//Filters out any NULL values if found
val finalData = extractedDF.filter(
col("managed_server").isNotNull &&
col("alert_summary").isNotNull &&
col("oracle_details").isNotNull &&
col("ecid").isNotNull &&
col("CompName").isNotNull &&
col("composite_name").isNotNull &&
col("compositename").isNotNull &&
col("composites").isNotNull &&
col("componentDN").isNotNull)
finalData.show(false)

Facing issue while using SparkUDF with multiple arguments

I am trying to encript the data using SHA-256 by passing as an argument in Spark UDF but getting below error. Please find the program snippet and error details below.
Code Snippet:
package com.sample
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import java.security.MessageDigest
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction
import javax.xml.bind.DatatypeConverter;
import org.apache.spark.sql.Column
object Customer {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Customer-data").setMaster("local[2]").set("spark.executor.memory", "1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
//val hash_algm=sc.getConf.get("halgm")
val hash_algm="SHA-256"
val df = spark.read.format("csv").option("header", "true").load("file:///home/tcs/Documents/KiranDocs/Data_files/sample_data")
spark.udf.register("encriptedVal1", encriptedVal)
//calling encription UDF function
//val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df("customer_id"))).show()
val resDF2 = df.withColumn(("ssn_number"), encriptedVal(array("customer_id", hash_algm))).show()
println("data set"+resDF2)
sc.stop()
}
def encriptedVal = udf((s: String,s1:String) => {
val digest = MessageDigest.getInstance(s1)
val hash = digest.digest(s.getBytes("UTF-8"))
DatatypeConverter.printHexBinary(hash)
})
}
Error details are below:
Exception in thread "main" 2019-01-21 19:42:48 INFO SparkContext:54 -
Invoking stop() from shutdown hook java.lang.ClassCastException:
com.sample.Customer$$anonfun$encriptedVal$1 cannot be cast to
scala.Function1 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:104)
at
org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:85)
at com.sample.Customer$.main(Customer.scala:26) at
com.sample.Customer.main(Customer.scala)
The problem here is how you call the defined UDF. You should use it like the following:
val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df.col("customer_id"), lit(hash_algm)))
because it accepts two Columns object (both Columns must be String type as defined in your UDF).

Unable load a CSV file as dataframe in spark

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
Code:
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.builder()
.master("local")
.appName("SparkSessioncsvExample")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("D:/Scala/C2ImportCalEventSample.csv")}
But I'm able to load the file with the code:
val df = spark.sparkContext
.textFile("D:/Scala/C2ImportCalEventSample1.csv")
//.flatMap(header='true')
.map(line => line.split(","))
// .map(line => line.map()
.toDF()
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
jdk1.8.0_20
sbt-1.2.7
Thanks any anyone who can help me on this.
java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

sparkSql no such method error

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.