How Python interact with JVM inside Spark

How Python interact with JVM inside Spark - jvm

I am writing Python code to develop some Spark applications. I am really curious how Python interact with running JVM and started reading the source code of Spark.
I can see that in the end, all the Spark transformations/actions ended up be calling certain jvm methods in the following way.
self._jvm.java.util.ArrayList(),
self._jvm.PythonAccumulatorParam(host, port))
self._jvm.org.apache.spark.util.Utils.getLocalDir(self._jsc.sc().conf())
self._jvm.org.apache.spark.util.Utils.createTempDir(local_dir, "pyspark") \
.getAbsolutePath()
...
As a Python programmer, I am really curious what is going on with this _jvm object. However, I have briefly read all the source code under pyspark and only found _jvm to be an attribute of Context class, beyond that, I know nothing about neither _jvm's attributes nor methods.
Can anyone help me understand how pyspark translate into JVM operations? should I read some scala code and see if _jvm is defined there?

It uses py4j. There is a special protocol to translate python calls into JVM calls. All of this you can find in Pyspark code, see java_gateway.py

Related

How do I use such a line in Kotlin?

I use Python, but I don't know how it works in Kotlin. This is an example
example => exec("""print("hello")""") output => hello
exec("""print("hello")""") output => hello

Kotlin supports JSR-223. You can use the jvm scripting engine to eval kts files.
val engine = ScriptEngineManager().getEngineByExtension("kts")
engine.eval("""print("hello")""")
You need JSR-223 library dependency. Refer to example
implementation("org.jetbrains.kotlin:kotlin-scripting-jsr223:$kotlinVersion")

Short answer: this isn't practical in Kotlin.
Technically, there may be ways, but they're likely to be far more trouble than they're worth; you're far better looking for a different approach to your problem.
Unlike a dynamic (‘scripting’) language like Python, Kotlin is statically-compiled. In the case of Kotlin/JVM, you run the Kotlin compiler to generate .class files with Java bytecode, which is then run by a JVM.
So if you really need to convert a string into code and run it, you'd have to find a way to ensure that a Kotlin compiler is available on the platform where your code is running (which it often won't be; compiled bytecode can run on any platform with a JVM, and most of those won't have Kotlin installed too). You'd then have to find a way to run the compiler; this will probably mean writing your source code out to a file, starting up the compiler program as a separate process (as I don't think there's an API for calling it directly), and checking the results. Then you'd have find the resulting bytecode and load into the JVM, which will probably mean setting up a separate classloader instance.
All of which is likely to be slow, fragile, and very awkward.
(See these previous questions which cover some of the same ground.)
(The details will be different for Kotlin/JS and Kotlin/Native, but I think the principles are roughly the same.)
In general, each computer language has its own approach, its own mind-set and ways of doing things, and it's best to try to understand that and accept that patterns and techniques from one language don't always translate well into another. (In the Olden Days™, it used to be said that a determined programmer could write FORTRAN programs in any language — but only in satire.)
Perhaps if you could explain why you want to do this, and what sort of problem you're trying to solve (probably as a separate question), we might be able to suggest more natural solutions in Kotlin.

Ironpython - Issues attaching to an instance of an already running program

Ok folks this is a long one, so please bear with me. I'll preface this by stating that I am -for all intents and purposes- a noob.
I'm trying to link to a running instance of a program (ETABS) using IronPython. The program has an API and decent documentation on how one can go about hooking into the running instance (EXAMPLE). However, their examples are for Python, C#, VB.net but not IronPython.
No biggie I thought, the Marshal module can be used to hook into it. So I tried this:
from System.Runtime.InteropServices import Marshal
csiApp = Marshal.GetActiveObject("CSI.ETABS.API.ETABSObject")
SapModel=csiApp.SapModel
Unfortunately I get errors on that last line - "ETABSObject has no attribute SapModel".
And yes, I've tried running it with csiApp.SapModel() as well with the same results.
So I delved deeper into it and apparently the object needs to be cast into another type - at least that's the way its been done for the C# example (LINK). Since - to my knowledge - we can't really cast objects around in Python (and yes, I've already tried clr.Convert) I came to the conclusion that the object being returned to Ironpython is a few abstractions removed from the object that I really need. Apparently comtypes can handle this automatically in the background (seeing as the python example works flawlessly). The code block below shows the object types returned to Ironpython and to pure python respectively:
Ipy : <System.MarshalByRefObject object at 0x000000000000002B [CSI.ETABS.API.ETABSObject]>
Python with comtypes : <POINTER(cOAPI) ptr=0x2e68d17f7c8 at 2e690b36a48>
I'm working on Ironpython 2.7.3 and can't really update it (for several reasons not relevant to this post). Would love to have advice on how to fix this or on how to install comtypes on Ipy.

So I think I've found the reason why this is happening - Ironpython cannot directly use MarshalByRefObjects (source) since Reflection doesn't work on these. It seems I'll need to create a C# class which can cast this object into the one I want, compile it into a dll and load that into my Ipy code.
I'll leave this here in case someone with more knowledge has a better answer.

How to retrieve help for Pandas methods using '??'

I am new to Pandas, trying to learn the basics from lecture videos. In one of these the presenter demonstrates that one can call help on methods using ??.
For example if I have loaded a dataframe df then typing df.getitem?? should print the docstring as well as the source code to the console. This would be really great to have but it doesn't work for me! I tried different variants of the command and also tried to find a comment online on this, without success.
What do I need to type in order to retrieve the docstring as well as the source code of a Pandas method? Thanks a lot for your help !
(I am using Python 3.5 and PyCharm in case that makes a difference)

I believe that your lecturer was using ipython as this does support dynamic object information. For instance this is the output in ipython when you do df.__getitem__?? you see the following:
I strongly recommend ipython for interactive python development, you'll find a lot of devs using this for data exploration and analysis, the workbook is really useful for saving your commands and the output

What is a good workflow for developing Julia modules with IPython/Jupyter?

I find myself frequently developing new Julia modules while at the same time using those modules for my work. So I'll have an IPython (Jupyter) notebook, with something like:
using DataFrames
using MyModule
Then I'll do something like:
x = myfunction(7, 3)
But I'll have to modify that function, and unfortunately by that point I can't simply do
using MyModule
again. I'm not really sure why; I thought that calling using simply declares available modules in order to make the global scope aware of them, and then when a name is actually needed, the runtime searches for the definition among the currently loaded modules (starting with Main).
So shouldn't using MyModule simply just refresh the definitions of the items in the already declared module? Why do I have to completely stop and restart the kernel in order to use my updated functions? (Is it because names are bound only once to functions that are declared using the function keyword?)
I've looked at Julia Workflow Tips, but I don't find the whole Tmp, tst.jl system very simple or elegant... at least for a notebook.
Any suggestions?

I think there's a lot of truth in this statement attributed to one of the Juno developers: Jupyter notebook is for working with data. Juno IDE is for working with code.
Jupyter is great for using modules in a notebook style that the output you're getting is reproducible. Juno and the REPL have less overhead that let you keep starting new sessions (faster testing, and fixes the problem you noted), have multiple tabs open to follow code around a complex module, and can use the debugger (in v0.5). They address different development issues for difference stages of use. I think you're pushing against the tide if you're using the wrong tool for the wrong job.

Key binding to interactively execute commands from Python interpreter history in order?

I sometimes test Python modules as I develop them by running a Python interactive prompt in a terminal, importing my new module and testing out the functionality. Of course, since my code is in development there are bugs, and frequent restarts of the interpreter are required. This isn't too painful when I've only executed a couple of interpreter lines before restarting: my key sequence when the interpreter restart looks like Up Up Enter Up Up Enter... but extrapolate it to 5 or more statements to be repeated and it gets seriously painful!
Of course I could put my test code into a script which I execute with python -i, but this is such a scratch activity that it doesn't seem quite "above threshold" for opening a text editor :) What I'm really pining for is the Ctrl-r behaviour from the bash shell: executing a sequence of 10 commands in sequence in bash involves finding the command in history (repeated Up or Ctrl-r for a search -- both work in the Python interpreter shell) and then just pressing Ctrl-o ten times. One of my favourite bash shell features.
The problem is that while lots of other readline binding functionality like Ctrl-a, Ctrl-e, Ctrl-r, and Ctrl-s work in the Python interpreter, Ctrl-o does not. I've not been able to find any references to this online, although perhaps the readline module can be used to add this functionality to the python prompt. Any suggestions?
Edit: Yes, I know that using the interactive interpreter is not a development methodology that scales beyond a few lines! But it is convenient for small tests, and IMO the interactiveness can help to work out whether a developing API is natural and convenient, or too heavy. So please confine the answers to the technical question of whether readline history-stepping can be made to work in python, rather than the side-opinion of whether one should or shouldn't choose to (sometimes) work this way!
Edit: Since posting I realised that I am already using the readline module to make some Python interpreter history functions work. But the Ctrl-o binding to the operate-and-get-next readline command doesn't seem to be supported, even if I put readline.parse_and_bind("Control-o: operate-and-get-next") in my PYTHONSTARTUP file.

I often test Python modules as I develop them by running a Python interactive prompt in a terminal, importing my new module and testing out the functionality.
Stop using this pattern and start writing your test code in a file and your life will be much easier.
No matter what, running that file will be less trouble.
If you make the checks automatic rather than reading the results, it will be quicker and less error-prone to check your code.
You can save that file when you're done and run it whenever you change your code or environment.
You can perform metrics on the tests, like making sure you don't have parts of your code you didn't test.
Are you familiar with the unittest module?

Answering my own question, after some discussion on the python-ideas list: despite contradictory information in some readline documentation it seems that the operate-and-get-next function is in fact defined as a bash extension to readline, not by core readline.
So that's why Ctrl-o neither behaves as hoped by default when importing the readline module in a Python interpreter session, nor when attempting to manually force this binding: the function doesn't exist in the readline library to be bound.
A Google search reveals https://bugs.launchpad.net/ipython/+bug/382638, on which the GNU readline maintainer gives reasons for not adding this functionality to core readline and says that it should be implemented by the calling application. He also says "its implementation is not complicated", although it's not obvious to me how (or whether it's even possible) to do this as a pure Python extension to the readline module behaviour.
So no, this is not possible at the moment, unless the operate-and-get-next function from bash is explicitly implemented in the Python readline module or in the interpreter itself.

This isn't exactly an answer to your question, but if that is your development style you might want to look at DreamPie. It is a GUI wrapper for the Python terminal that provides various handy shortcuts. One of these is the ability to drag-select across the interpreter display and copy only the code (not the output). You can then paste this code in and run it again. I find this handy for the type of workflow you describe.

Your best bet will be to check that project : http://ipython.org
This is an example with a history search with Ctrl+R :
EDIT
If you are running debian or derivated :
sudo apt-get install ipython

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How Python interact with JVM inside Spark - jvm

It uses py4j. There is a special protocol to translate python calls into JVM calls. All of this you can find in Pyspark code, see java_gateway.py

Related

How do I use such a line in Kotlin?

Ironpython - Issues attaching to an instance of an already running program

How to retrieve help for Pandas methods using '??'

What is a good workflow for developing Julia modules with IPython/Jupyter?

Key binding to interactively execute commands from Python interpreter history in order?

Categories

Resources