Scala Sparksession not initialised properly in junit dataframe foreach

Scala Sparksession not initialised properly in junit dataframe foreach - dataframe

I have a spark 2.4 scala 2.11 code which does some work using a foreach over a dataframe. In unit tests I'm using a local sparksession object.
The issue is that- inside the foreach loop I'm trying to create and modify a dataframe using Seq todf method. For some wierd reason the spark session inside the loop is not being instantiated properly and I'm getting a null pointer exception in sqlcontext when trying to do the above said manipulation
Any ideas how can I ensure that the spark session object inside the loop is proper so that the unit test gets executed properly
Additional opinion
It seems spark has a connection to a file due to which it cannot be serialized, hence it fails in junit.

Related

Python multiprocessing, logging to different files

I would like to run a code on n processes, and have the logs from each process in a separate file.
I tried, naively, sthing like this
from multiprocessing import Process
import logging
class Worker(Process):
def __init__(self, logger_name, log_file):
super().__init__()
self.logger = logging.getLogger(logger_name)
self.log_file = log_file
self.logger.addHandler(logging.FileHandler(log_file))
print("from init", self.logger, self.logger.handlers)
def run(self) -> None:
print("from run", self.logger, self.logger.handlers)
if __name__ == '__main__':
p1 = Worker("l1", "log1")
p1.start()
(tried in python 3.9 and 3.11)
but from some reason, the handler is gone. This is the output:
from init <Logger l1 (WARNING)> [<FileHandler log1 (NOTSET)>]
from run <Logger l1 (WARNING)> []
Why is the FileHandler gone? Should I use the AddHandler within the run method -- is it a correct way?
I was trying to use this answer but couldn't make it really work.
For the moment, I solved it via defining the handlers in run but it seems like a dirty hack to me...
UPDATE: This happens on my MacBook python installations. On a linux server, I couldn't reproduce this. Very confusing.
In either case, the question is probably:
"Is this the correct way to log to files, with several copies of one
process?"

I found the reason for the observed behavior. It has to do with pickling of objects when they are transferred between Processes.
In the standard library's implementation of Logger, a __reduce__ method is defined. This method is used in cases where an object cannot be reliably pickled. Instead of trying to pickle the object itself, the pickle protocol instead uses the returned value from __reduce__. In the case of Logger, __reduce__ returns a function name (getLogger) and a string (the name of the Logger being pickled) to be used as an argument. In the unpicking procedure, the unpickling protocol makes a function call (logging.getLogger(name)); the result of that function call becomes the unpickled Logger instance.
The original Logger and the unpickled Logger will have the same name, but perhaps not much else in common. The unpickled Logger will have the default configuration, whereas the original Logger will have any customization you may have performed.
In Python, Process objects do not share an address space (at least, not on Windows). When a new Process is launched, its instance variables must somehow be "transferred" from one Process to another. This is done by pickling/unpickling. In the example code, the instance variables declared in the Worker.__init__ function do indeed appear in the new Process, as you can verify by printing them in Worker.run. But under the hood Python has actually pickled and unpickled all of the instance variables, to make it look like they magically have migrated to the new Process. In the vast majority of cases this works just fine. But not necessarily if one of those instance variables defines a __reduce__ method.
A logging.FileHandler cannot, I suspect, be pickled since it uses operating system resources (a file). This is probably the reason (or at least one of the reasons) why Logger objects can't be pickled.

Load Julia modules on demand

I I have a very simple question. Is it possible to load modules on demand in Julia. That is, can the modules be loaded when they are actually needed instead of being loaded at "parse-time" at the top level.
The use case scenario I have in mind is that I have some set of code that is able to do some plotting using PyPlot, but code is far from always executed.
At the moment this means that I have at top level a statement like using PyPlot, which takes quite a load of time to load.
(Yes i know: One should not restart Julia to often, bla bla bla... but nevertheless this is a point of annoyance)
Is there a way to ensure that PyPlot is only loaded is if is actually needed?
The simplest idea would have been to include the using PyPlot inside the function that actually do the plotting
function my_plot()
using PyPlot
plot(1:10,1:10)
end
but this results in a syntax error:
ERROR: syntax: "using" expression not at top level
So, is there another way to achieve this?

The "using" statement runs when the line of code is encountered, and does not have to be at the top of the file. It does need to be in global scope, which means that the variables in the module loaded with "using" will be available to all functions in your program after the "using" statement is executed, not just a single function as might happen in the local scope of a function.
If you call the using statement as an expression within a Julia eval statement, all code executed within the "eval" statement in Julia is automatically done so in global scope, even if the eval is syntactically called within a function's local scope. So if you use the macro #eval
function my_plot()
#eval using PyPlot # or without the macro, as eval(:(using PyPlot))
plot(1:10,1:10)
end
this acts as if the using PyPlot is done outside a function, and so avoids the syntax error.

Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"

While printing the schema of a SQL database,I am getting the following error:

Use SparkSession instead of SQLContext. So do:
sqlContext = SparkSession.builder.master("local[*]").appName("appName").
config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
The rest of your code should work normally.
You may adjust the variable name from sqlContext to reflect the reference held.

Kotlin: Method reference not working?

It seems I'm unable to use a method reference of an object in Kotlin. This feature exists in Java.
For example in Java if I was looping through a string to append each character to a writer:
string.forEach(writer::append);
But in Kotlin using the same syntax does not work because:

For now, Kotlin only supports references to top-level and local functions and members of classes, not individual instances. See the docs here.
So, you can say Writer::append and get a function Writer.(Char) -> Writer, but taking a writer instance and saying writer::append to get a function (Char) -> Writer is not supported at the moment.

Starting from Kotlin 1.1 writer::append is a perfectly valid bound callable reference.
However, you still cannot write string.forEach(writer::append) because Writer#append method returns a Writer instance and forEach expects a function that returns Unit.

I am using Kotlin 1.3 and while referencing a Java method I got a very similar error. As mentioned in this comment, making a lambda and passing it to the forEach method is a good option.
key.forEach { writter.append(it) }
Being it the implicit name of a single parameter.

Selenium: How do I inject a Javascript variable across my tests?

I'm using Selenium client driver 2.4.0. When running tests using the WebDriverBackedSelenium object, e.g.
final FirefoxDriver driver = new FirefoxDriver();
selenium = new WebDriverBackedSelenium(driver, baseUrl);
how do I inject a Javascript array into my tests that can retain scope across different pages? That is, I want to create a JS var "myArray" that I can access (using selenium.getEval) when I open "http://mydomain.com/page1.html" but I can then reference when I open a different page ("http://mydomain.com/page2.html") within the same Selenium test.
Thanks, - Dave

I don't think it is possible out of box.
Workaround should work - add to the page some library that can deserialize from JSON (e.g. Dojo), use it to load an array definition to some JavaScript variable and before leaving page get it back, storing it out of scope request.
But I must say you have a kind of strange request - what are trying to do ?

You can do it with casting. Execute JavaScript to return an array. The JS array must contain only one type, which must be primitive.
For example, execute a script which returns an array of Strings:
ArrayList<String> strings = (ArrayList<String>) js.executeScript(returnArrayOfStrings);
If you need an array of any other type, you can build it from those strings. For example, if you need an array of WebElements, design your JS to return locators, and then iterate through, finding elements and building a new array:
ArrayList<String> xpaths = (ArrayList<String>) js.executeScript(getLocators);
ArrayList<WebElement> elements = new ArrayList<WebElement>();
for (String xpath: xpaths){
element = driver.findElement(By.xpath(xpath));
elements.add(element);
}
You have the Array in Java, so you can keep it in memory when your tests go to different pages and still reference the Java Array,
The only catch is, if your JS array is changing on the client side, your Java Array won't automatically update itself (the jsexecuter only returns once per execution), but that's not a big deal - instead of referencing the Java Array directly, you can access it via a getter which first executes the JS again to get a new Array, which you can use to replace the previous one, or merge them etc, before returning the new/updated array to your test code.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scala Sparksession not initialised properly in junit dataframe foreach - dataframe

Related

Python multiprocessing, logging to different files

Load Julia modules on demand

Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"

Kotlin: Method reference not working?

Selenium: How do I inject a Javascript variable across my tests?

Categories

Resources