Cannot access scala value/variable inside RDD foreach function (Null) - dataframe

I have a Spark Structured Streaming job that needs to use the rdd.forEach inside the forEachBatch function as per the bellow code:
val tableName = "ddb_table"
df
.writeStream
.foreachBatch { (batchDF: DataFrame, _: Long) =>
batchDF
.rdd
.foreach(
r => updateDDB(r, tableName, "key")
)
curDate= LocalDate.now().toString.replaceAll("-", "/")
prevDate= LocalDate.now().minusDays(1).toString.replaceAll("-", "/")
}
.outputMode(OutputMode.Append)
.option("checkpointLocation", "checkPointDir")
.start()
.awaitTermination()
What happens is that the tableName variable is not recognized inside the rdd.forEach function because the call to the DynamoDB API inside the updateDDB raises an exception stating that the tableName cannot be null.
The issue is clearly in the rdd/forEach and the way it works with variables. I read some things about broadcast variables, but I don't have enough experience working with RDDs and Spark in a much lower level to be sure what is the way to go.
Some notes:
I need this to be inside the forEachBatch function because I need to update other variables apart from this write to DDB (in this case the curDate and prevDate variables)
The code runs successfully when I pass the tableName parameter directly in the function call.
I have one class that extends the ForEachWriter that works ok when using the forEach instead of the forEachBatch, but as stated in point 1) I need to use the second because I need to update several things at a streaming batch time.

Related

Can i use invoke on stored tabular function in Log Analytics?

I'm trying to clean up some shared functionality across queries and would like to have a number of filter functions as stored Log Analytics functions.
Now the below works fine if the function is defined in the same place as the query, but when i split the function into a stored LA function, I can't figure out how to get the invoke operator to work.
`//function to filter
let remove_robotstxt=( T:(requestUri_s:string) ) {
T
| where parse_url( requestUri_s).Path != "/robots.txt"
};
//
//
AzureDiagnostics
| where Category == "FrontdoorAccessLog"
| invoke remove_robotstxt()`
Passing params such as strings to functions works just fine, but how about tabular functions? What am i missing?
I have tried a union to the function and a number of other things, but my query doesnt seem to see the function being available.
I ended up just creating the function as a queried table. So in your case:
AzureDiagnostics
| where parse_url( requestUri_s).Path != "/robots.txt"
And save function as AzureDiagnosticsRemovedRobots.
And you can simply call this function directly:
AzureDiagnosticsRemovedRobots
| where Category == "FrontdoorAccessLog"
This might not be exactly what you're looking for but it kind of works for me.

BigQuery fails to save view that uses functions

We're using BigQuery with their new dialect of "standard" SQL.
the new SQL supports inline functions written in SQL instead of JS, so we created a function to handle date conversion.
CREATE TEMPORARY FUNCTION
STR_TO_TIMESTAMP(str STRING)
RETURNS TIMESTAMP AS (PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ', str));
It must be a temporary function as Google returns Error: Only temporary functions are currently supported; use CREATE TEMPORARY FUNCTION
if you try a permanent function.
If you try to save a view with a query that uses the function inline - you get the following error: Failed to save view. No support for CREATE TEMPORARY FUNCTION statements inside views.
If you try to outsmart it, and remove the function (hoping to add it during query time), you'll receive this error Failed to save view. Function not found: STR_TO_TIMESTAMP at [4:7].
Any suggestions on how to address this? We have more complex functions than the example shown.
Since the issue was marked as resolved, BigQuery now supports permanents registration of UDFs.
In order to use your UDF in a view, you'll need to first create it.
CREATE OR REPLACE FUNCTION `ACCOUNT-NAME11111.test.STR_TO_TIMESTAMP`
(str STRING)
RETURNS TIMESTAMP AS (PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ', str));
Note that you must use a backtick for the function's name.
There's no TEMPORARY in the statement, as the function will be globally registered and persisted.
Due to the way BigQuery handles namespaces, you must include both the project name and the dataset name (test) in the function's name.
Once it's created and working successfully, you can use it a view.
create view test.test_view as
select `ACCOUNT-NAME11111.test.STR_TO_TIMESTAMP`('2015-02-10T13:00:00Z') as ts
You can then query you view directly without explicitly specifying the UDF anywhere.
select * from test.test_view
As per the documentation https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_function_statement , the functionality is still in Beta phase but is doable. The functions can be viewed in the same dataset it was created and the view can be created.
Please share if that worked fine for you or if you have any findings which would be helpful for others.
Saving a view created with a temp function is still not supported, but what you can do is plan the SQL-query (already rolled out for the latest UI), and then save it as a table. This worked for me, but I guess it depends on the query parameters you want.
##standardSQL
## JS in SQL to extract multiple h.CDs at the same time.
CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY<STRUCT< index INT64,
value STRING>>, index INT64)
RETURNS STRING
LANGUAGE js AS """
for(var i = 0; i < cd.length; i++) {
var item = cd[i];
if(item.index == index) {
return item.value
}
}
return '';
""";
SELECT DISTINCT h.page.pagePath, getcustomDimension(h.customDimensions,20), fullVisitorId,h.page.pagePathLevel1, h.page.pagePathLevel2, h.page.pagePathLevel3, getcustomDimension(h.customDimensions,3)
FROM
`XXX.ga_sessions_*`,
UNNEST(hits) AS h
WHERE
### rolling timeframe
_TABLE_SUFFIX = FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(),INTERVAL YY DAY))
AND h.type='PAGE'
Credit for the solution goes to https://medium.com/#JustinCarmony/strategies-for-easier-google-analytics-bigquery-analysis-custom-dimensions-cad8afe7a153

Aerospike limiting records by lexicographic order

Can Aerospike get records by lexicographic order.For example if U want all the records that start with "a" then U like to search for bin >="a" AND bin <="az"
aerospike support UDF modules(in LUA and C language) https://www.aerospike.com/docs/udf/developing_lua_modules.html
which can serve your purpose.
User-Defined Functions written in Lua extend the core functionality of Aerospike. You would create a stream UDF and attach it to a query.
One best practice for stream UDFs in Aerospike is to eliminate as many records as possible before passing the results into the UDF, so in this case I would create another bin to hold a prefix (first letter, or a substring, depending on your use case) and build a secondary index on it. The idea is that the query portion should return as small of a subset as you can reliably. For your example the prefix can be a single character, you can add a new bin 'firstchar' to the records in the set, then build a secondary index on it.
The stream UDF module would look something like:
local function range_filter(bin_name, substr_from, substr_to)
return function(record)
local val = record[bin_name]
if type(val) ~= 'string' then
return false
end
if val >= substr_from and val <= substr_to then
return true
else
return false
end
end
end
local function rec_to_map(record)
local xrec = map()
for i, bin_name in ipairs(record.bin_names(record)) do
xrec[bin_name] = xrec[bin_name]
end
return xrec
end
function str_between(stream, bin_name, substr_from, substr_to)
return stream : filter(range_filter(bin_name, substr_from, substr_to)) : map(rec_to_map)
end
In the Python client you'd invoke it as follows:
import aerospike
from aerospike import predicates as p
# instantiate the client and connect to the cluster, then:
query = client.query('test', 'this')
query.where(p.equals('firstchar', 'a'))
query.apply('strrangemod', 'str_between', ['a','az'])

Spark: How to send arguments to Spark foreach function

I am trying to save the contents of a Spark RDD to Redis with the following code
import redis
class RedisStorageAdapter(BaseStorageAdapter):
#staticmethod
def save(record):
###--- How do I get action_name ---- ###
redis_key = #<self.source_action_name>
redis_host=settings['REDIS']['HOST']
redis_port=settings['REDIS']['PORT']
redis_db=settings['REDIS']['DB']
redis_client = redis.StrictRedis(redis_host, redis_port, redis_db)
redis_client.sadd(redis_key, record)
def store_output(self, results_rdd):
print self.source_action_name
results_rdd.foreach(RedisStorageAdapter.save)
But I want the Redis Key to be different based on what self.source_action_name is initialized to (in BaseStorageAdapter)
How do I pass the source_action_name to RedisStorageAdapter.save function? foreach function only allows the name of the function to be executed and no parameter list
Also - if there is a better way to move data from RDD to Redis, let me know
Of course, foreach takes a function, not a function name. So you can pass to it a lambda function:
results_rdd.foreach(lambda x: RedisStorageAdapter.save(x, self.source_action_name))

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")