Spark: How to send arguments to Spark foreach function - redis

I am trying to save the contents of a Spark RDD to Redis with the following code
import redis
class RedisStorageAdapter(BaseStorageAdapter):
#staticmethod
def save(record):
###--- How do I get action_name ---- ###
redis_key = #<self.source_action_name>
redis_host=settings['REDIS']['HOST']
redis_port=settings['REDIS']['PORT']
redis_db=settings['REDIS']['DB']
redis_client = redis.StrictRedis(redis_host, redis_port, redis_db)
redis_client.sadd(redis_key, record)
def store_output(self, results_rdd):
print self.source_action_name
results_rdd.foreach(RedisStorageAdapter.save)
But I want the Redis Key to be different based on what self.source_action_name is initialized to (in BaseStorageAdapter)
How do I pass the source_action_name to RedisStorageAdapter.save function? foreach function only allows the name of the function to be executed and no parameter list
Also - if there is a better way to move data from RDD to Redis, let me know

Of course, foreach takes a function, not a function name. So you can pass to it a lambda function:
results_rdd.foreach(lambda x: RedisStorageAdapter.save(x, self.source_action_name))

Related

Cannot access scala value/variable inside RDD foreach function (Null)

I have a Spark Structured Streaming job that needs to use the rdd.forEach inside the forEachBatch function as per the bellow code:
val tableName = "ddb_table"
df
.writeStream
.foreachBatch { (batchDF: DataFrame, _: Long) =>
batchDF
.rdd
.foreach(
r => updateDDB(r, tableName, "key")
)
curDate= LocalDate.now().toString.replaceAll("-", "/")
prevDate= LocalDate.now().minusDays(1).toString.replaceAll("-", "/")
}
.outputMode(OutputMode.Append)
.option("checkpointLocation", "checkPointDir")
.start()
.awaitTermination()
What happens is that the tableName variable is not recognized inside the rdd.forEach function because the call to the DynamoDB API inside the updateDDB raises an exception stating that the tableName cannot be null.
The issue is clearly in the rdd/forEach and the way it works with variables. I read some things about broadcast variables, but I don't have enough experience working with RDDs and Spark in a much lower level to be sure what is the way to go.
Some notes:
I need this to be inside the forEachBatch function because I need to update other variables apart from this write to DDB (in this case the curDate and prevDate variables)
The code runs successfully when I pass the tableName parameter directly in the function call.
I have one class that extends the ForEachWriter that works ok when using the forEach instead of the forEachBatch, but as stated in point 1) I need to use the second because I need to update several things at a streaming batch time.

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

How to properly iterate over a for loop using Dask?

When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"
import pandas as pd
import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=16)
for field in fields:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)
If I add .compute() to the last line:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()
it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()?
edit ---------------
Please see screenshot below for a worked example
You will want to call .compute() at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory
Also, watch out, lambdas late-bind in Python, so the field value may end up being the same for all of your columns.
Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.
from functools import partial
def string_check(string, search):
return search in string
search_terms = ['foo', 'bar']
for s in search_terms:
string_check_partial = partial(string_check, search=s)
df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)

Aerospike limiting records by lexicographic order

Can Aerospike get records by lexicographic order.For example if U want all the records that start with "a" then U like to search for bin >="a" AND bin <="az"
aerospike support UDF modules(in LUA and C language) https://www.aerospike.com/docs/udf/developing_lua_modules.html
which can serve your purpose.
User-Defined Functions written in Lua extend the core functionality of Aerospike. You would create a stream UDF and attach it to a query.
One best practice for stream UDFs in Aerospike is to eliminate as many records as possible before passing the results into the UDF, so in this case I would create another bin to hold a prefix (first letter, or a substring, depending on your use case) and build a secondary index on it. The idea is that the query portion should return as small of a subset as you can reliably. For your example the prefix can be a single character, you can add a new bin 'firstchar' to the records in the set, then build a secondary index on it.
The stream UDF module would look something like:
local function range_filter(bin_name, substr_from, substr_to)
return function(record)
local val = record[bin_name]
if type(val) ~= 'string' then
return false
end
if val >= substr_from and val <= substr_to then
return true
else
return false
end
end
end
local function rec_to_map(record)
local xrec = map()
for i, bin_name in ipairs(record.bin_names(record)) do
xrec[bin_name] = xrec[bin_name]
end
return xrec
end
function str_between(stream, bin_name, substr_from, substr_to)
return stream : filter(range_filter(bin_name, substr_from, substr_to)) : map(rec_to_map)
end
In the Python client you'd invoke it as follows:
import aerospike
from aerospike import predicates as p
# instantiate the client and connect to the cluster, then:
query = client.query('test', 'this')
query.where(p.equals('firstchar', 'a'))
query.apply('strrangemod', 'str_between', ['a','az'])

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")