How to execute an untrusted function efficiently in a cross-platform way? - cross-platform

I am writing an open source cross-platform application written in C++ that targets Windows, Mac, and Linux on x86 CPUs. The application produces a stream of data (integers) that needs to be validated, and my application will perform actions depending on the validation result. There are multiple validators, which we shall call "modules", and they can be swapped out for one another.
Anybody can write and share modules with other users, so my application has to ensure that maliciously-written modules cannot harm the user in any way (perhaps except via high CPU usage, in which case my application should be able to kill the module after some amount of time - this can be done by using a surrogate process). Furthermore, the stream of data is being sent at a high rate (up to 100kB/s).
Fortunately, the code in these modules are usually simple arithmetic operations on data in the stream (usually processing each incoming integer in constant time), and they do not need to make any system calls (not even heap allocation).
I've considered the following possibilities (all of them with some drawbacks):
Kernel-based sandboxing
On Linux, we can use secure computing (seccomp), which prevents a process from making any system calls except for reading and writing with already-open file destriptors. Module creators would write their modules as a single function that takes in input and output file descriptors (in a language like C or C++) and compile it into a shared object, then distribute that shared object.
My application will probably prepare input and output file descriptors, then fork() itself or exec() a surrogate process, and this child process uses dlopen() and dlsym() to get a pointer to the untrusted function. Then strict secure computing mode will be enabled, before executing the untrusted function.
Drawbacks: There's the problem that dlopen() will actually run the constructor function from the shared library. This would have to be properly sandboxed as well, and I can't think of a way to do so. Also, of course, this thing will only work on Linux. As far as I know, there is no way to ban WinNT system calls on Windows, so a similar solution on Windows won't be very secure.
Application-level sandboxing
[[ Any form of application-level sandboxing means that we cannot run untrusted machine code of any form. An untrusted function can overwrite its return value or data outside its call stack, thereby compromising the whole application (and effectively acquiring any permissions that the original application had). ]]
Make modules use a simple scripting language that does not support any system calls - just pure arithmetic operations and perhaps the ability to read an input stream. My application would contain an interpreter for this language.
Drawbacks: Unfortunately I have not found this scripting language. Many scripting languages have extensive functionalities (e.g. Python) and a sandbox (e.g. PyPy's sandbox) simply filters OS system calls. I would be shipping a lot of useless interpreter code with my application, and it arguably is more prone to security issues due to bugs in the intepreter than a language with simply no functionality to do things other than simple calculations and control flow instructions (basically a function that does not make any system calls). Furthermore, marshalling the data between C++ (machine code) and the scripting language is usually a slow process.
Distribute modules with a 'safe' compiled language that again does not support any system calls. My application would contain a JIT for this language.
Marshalling won't be necessary because my application would call into the JITted machine code of the untrusted module, so performance across this boundary should be fast. The untrusted module now won't be able to corrupt the stack, attempt return-oriented programming, or perform any other malicious actions, due to the language restrictions and checks of the 'safe' language. WebAssembly is the first and only language that comes to mind (if it can be called a language). (As far as I can tell, WebAssembly seems to provide the security guarantees for my use case, right?)
Drawbacks: The existing implementations of WebAssembly seem to be all browser-based, so I would have to steal an implementation from an open source browser. This does seem like a lot of work, considering that I would have to uncouple it from all the JavaScript and other browser bits. However, a standalone WebAssembly JIT based on LLVM seems to be under development.
Question:
What is the best way to execute an untrusted function efficiently that works on Windows, Mac, and Linux?
Right now, I think that the scripting language way would probably be the safest, and be the easiest for module writers. But for a more efficient solution, WebAssembly is probably better. Am I right, or are there better or easier solutions that I have not thought of?
(Remark: I think several pairs of tags used in this question have never been seen together before!)

Regarding WebAssembly:
Unfortunately, there is no production-quality stand-alone implementation yet. I expect some to show up in the future, but it hasn't happened yet.
For historical reasons, existing production implementations are all part of a JavaScript VM. Fortunately, none of these VMs is tied to a browser. If you don't mind including some unused JS baggage, you can embed them as they are (ripping out the JS would be very hard). One problem, though, is that these VMs don't yet provide embedding interfaces for Wasm specifically. You have to go through JS, which is stupid.
There is an initial design for a C and C++ API for WebAssembly, which would give direct access to an embedded Wasm VM. It is meant to be VM-neutral, i.e., could be implemented by any existing VM (the repo contains a prototype implementation on top of V8). This may evolve into a standard, but I cannot promise any timeline. Right now it's only for the brave.

Related

Run MATLAB program with Web Server inputs

I have a MATLAB application that I want to execute on a linux box with inputs from a web server. Requests to the server would all be from the local network.
Searching for different solutions, I've seen recommendations to host a Django server that serves an HTML form where users could input all the various data needed by the application. When a user fills out the form and submits it, the data would be sent through an API to the MATLAB application, which would serve up the report in a network shared drive.
Would this work well? Is there a different/easier solution available?
Need more details to know if this would "work well". But in terms of the general outline you presented, seems feasible.
When you say "the data would be sent through an API to the MATLAB application", what exactly do you mean here? What API are we talking about? And what is "the Matlab application"? Do you mean just installing regular Matlab on this server machine, and then having the Django or other web application server run the matlab command to run a Matlab program, running as a distinct process (corresponding to a single matlab -batch execution, probably?) that services that? Two issues here: One, Matlab is a large program with a slow startup time. Matlab Production Server and similar solutions handle this by maintaining a pool of already-running "warmed up" Matlab worker processes to service incoming requests. Two, licensing: the "regular" Matlab licenses are aimed towards interactive use by humans; running Matlab like that on the server side to handle requests for a web app used by multiple humans may not be covered. Talk to your organization's lawyer or IT licensing expert before doing this.
#Will is right here: The Matlab Production Server is the product or "solution" that MathWorks provides for this scenario. And it's relatively easy to use. But ain't cheap. (On the other hand, when you're talking about Matlab, what is?)
If you have someone who can do a bit of system programming for you, there's a more affordable alternative: use the Matlab Compiler to build your Matlab code into a "CTF" DLL, and write a thin custom server wrapper on top of that, which can accept service calls for the particular Matlab things you need done, and dispatches them to your code. (Running that in a pool of multiple processes, if you want to be able to service multiple concurrent clients.) "Compiled" Matlab libraries that run against the Matlab Runtime do not require any additional licenses for their runtime execution.
Big questions here are: Do you want this to Go Fast? How many clients are you going to have, and how often are they going to be sending requests? What kind of data will be contained in the inputs and outputs to this Matlab code?
Have a look at the -batch option to the matlab command. Have a look at the various deployment options supported by the Matlab Compiler. And talk to your organization's lawyer.
If you decide to go the matlab -batch route, you probably do not want to pass the inputs to your Matlab code as command-line arguments. Command lines and environment variables only pass simple strings, and parsing those sucks, especially once you get in to nontrivial numerics. Bundle up all your inputs as JSON files, MAT files, or something similar, and then pass just a reference to those files (or SQL blobs, or similar) on the command line.
Also, depending on what your Matlab code is like, GNU Octave (https://gnu.org/software/octave/index) may be an option for you. Octave is many years behind Matlab in terms of functionality and stability, and doesn't have equivalents of all the Matlab Toolboxes, so it isn't a drop-in replacement in general terms. But for simple stuff, it works. And it is unencumbered by licensing, and has faster startup times in command-line mode.
An easier solution (though possibly not the cheapest if your existing MATLAB license doesn't already cover it) is to use MATLAB Production Server which basically exists for this category of problem. It has a RESTful API that would straightforwardly handle your user input use case.

Call Metatrader MQL4/MQL5 function from imported DLL

I would like to call MQL4 or MQL5 function from my own imported DLL in Metatrader.
Is it possible?
Forest,
As far as i have experienced during the past 2 years working with MetaTrader, there is no real way to call MQL functions from an external DLL. But there are some custom built APIs that closely resemble to what you want to achieve:
MT4 API
MetaTraderâ„¢ Java / .Net API
These APIs do somewhat allow you to use MQL functionality out-of-the-box
Principle
After several hundred man*years in the FX domain, there is another approach to orchestrate a smooth and elegant MT4 Terminal co-operation with other processes than to try to push water up the hill or than to pay USD500+ for a kit, that will stop working right upon the next shock once Build 524-> Build 562->Build 586->Build 600->Build 609->Build 624->... moves again
A non-existent toy
Yes, MT4 architecture does not expose it's own interface to allow self to be "disturbed" by an undeterministic obligation to handle external low-level calls via DLL et al.
How to fix it
Nevertheless, it is possible to reverse the architecture and make MT4 Terminal act as a lightweight thin-Client, operating a smart messaging library, trough which the MT4 functions are being exposed for a remote call ( RPC ).
Example
This way a Python Node may collect MT4 data for numerical processing,
same way a PHP Node may in parallel handle remote-syslog-s,
same way a C++ Node may integrate another task,
same way another Python Node may act as a CLI terminal interface with a Custom-specific scripting-syntax language to command MetaTrader-side activities via command-line / stdio
simply -- whatever your application infrastructure needs can be done this way
( One may even improve a poor real-time features of the native MT4 threads to gain a much better soft-real-time predictability and a low-latency massively parallel architecture .. and still be on a safer-side, protected from being torpedoed by any next "new"-MQL4 )
nota bene: just to imagine the invisible threat, the headbang collision in "new"-MQL4.56789 is, besides others, that string, while being syntax-proposed as string, is not in fact a string but a struct and all your previous DLL-related work simply has to be re-worked and wrapped-around to emulate a string-as-struct or a new DLL-interface has to be designed for cases, which return a value in a buffered ArrayOfBYTEs, which MQL4.56789 side can receive and process, but which it cannot free on it's own and memory leaks.
If it's acceptable for your DLL to be a .NET DLL, then you could try
this MT4 .NET integration library called NQuotes.
With this library it's possible to access any MQL4 function from your DLL.

.NET Framework FxCop rule CA1401 PInvokesShouldNotBeVisible rule - why does this rule exist?

This rule indicates that P/Invokes should not be made public. My question is why? A caller can trivially create their own declaration within their own assembly to make the exact same call. A caller could just write a C library to call the API. What benefit, security or otherwise, is gained by making these declarations internal?
Well, in the .NET security model, it's possible for your assembly to have permission to do P/Invokes, but for your caller not to have. (AllowPartiallyTrustedCallersAttribute, which permits code running as partially trusted to call into an assembly that is fully trusted, exists to enable this.)
Which is, essentially, what you want when the library you are writing exists to provide safe access, or limited access, to some system facility that you don't want sandboxed applications of one type or another to have arbitrary access to.
On another note, it's politeness as well as security. P/Invokes are de facto unsafe, inasmuch as calling them badly can result in all kinds of interesting ways to crash that you generally don't run into in the comfortable .NET world. Wrapping some error-checking and general safety code around them, along with such things as mangling input data into the Win32 API (or whatever's) format, translating its error codes into the appropriate .NET exceptions, etc., etc., is just plain courtesy to your library's future users, IMO.

what are the pros and cons of using a DLL?

Windows still use DLLs and Mac programs seem to not use DLL at all. Are there benefits or disadvantages of using either technique?
If a program installation includes all the DLL it requires so that it will work 100% well, will it be the same as statically linking all the libraries?
MacOS X, like other flavours of Unix, use shared libraries, which are just another form of DLL.
And yes both are advantageous as the DLL or shared library code can be shared between multiple processes. It does this by the OS loading the DLL or shared library and mapping it into the virtual address space of the processes that use it.
On Windows, you have to use dynamically-loaded libraries because GDI and USER libraries are avaliable as a DLL only. You can't link either of those in or talk to them using a protocol that doesn't involve dynamic loading.
On other OSes, you want to use dynamic loading anyway for complex apps, otherwise your binary would bloat for no good reason, and it increases the probably that your app would be incompatible with the system in the long run (However, in short run static linking can somewhat shield you from tiny breaking changes in libraries). And you can't link in proprietary libraries on OSes which rely on them.
Windows still use DLLs and Mac
programs seem to not use DLL at all.
Are they benefits or disadvantages of
using either technique?
Any kind of modularization is good since it makes updating the software easier, i.e. you do not have to update the whole program binary if a bug is fixed in the program. If the bug appears in some dll, only the dll needs to be updated.
The only downside with it imo, is that you introduce another complexity into the development of the program, e.g. if a dll is a c or c++ dll, different calling conventions etc.
If a program installation includes all
the DLL it requires, will it be the
same as statically linking all the
libraries?
More or less yes. Depends on if you are calling functions in a dll which you assume static linkage with. The dll could just as well be a "free standing" dynamic library, that you only can access via LoadLibrary() and GetProcAddress() etc.
One big advantage of shared libraries (DLLs on Windows or .so on Unix) is that you can rebuild the library and its consumers separately while with static libraries you have to rebuild the library and then relink all the consumers which is very slow on Unix systems and not very fast on Windows.
MacOS software uses "dll's" as well, they are just named differently (shared libraries).
Dll's make sense if you have code you want to reuse in different components of your software. Mostly this makes sense in big software projects.
Static linking makes sense for small single-component applications, when there is no need for code reuse. It simplifies distribution since your component has no external dependencies.
Besides memory/disk space usage, another important advantage of using shared libraries is that updates to the library will be automatically picked up by all programs on the system which use the library.
When there was a security vulnerability in the InfoZIP ZIP libraries, an update to the DLL/.so automatically made all software safe which used these. Software that was linked statically had to be recompiled.
Windows still use DLLs and Mac programs seem to not use DLL at all. Are they benefits or disadvantages of using either technique?
Both use shared libraries, they just use a different name.
If a program installation includes all the DLL it requires so that it will work 100% well, will it be the same as statically linking all the libraries?
Somewhat. When you statically link libraries to a program, you will get a single, very big file, with DLLs, you will have many files.
The statically linked file won't need the "resolve shared libraries" step (which happens while the program loads). A long time ago, loading a static program meant that the whole program was first loaded into RAM and then, the "resolve shared libraries" step happened. Today, only the parts of the program, which are actually executed, are loaded on demand. So with a static program, you don't need to resolve the DLLs. With DLLs, you don't need to load them all at once. So performance wise, they should be on par.
Which leaves the "DLL Hell". Many programs on Windows bring all DLLs they need and they write them into the Windows directory. The net effect is that the last installed programs works and everything else might be broken. But there is a simple workaround: Install the DLLs into the same directory as the EXE. Windows will search the current directory first and then the various Windows paths. This way, you'll waste a bit of disk space but your program will work and, more importantly, you won't break anything else.
One might argue that you shouldn't install DLLs which already exist (with the same version) in the Windows directory but then, you're again vulnerable to some bad app which overwrites the version you need with something that breaks your neck. The drawback is that you must distribute security fixes for your app yourself; you can't rely on Windows Update or similar things to secure your code. This is a tight spot; crackers are making lots of money from security issues and people will not like you when someone steals their banking data because you didn't issue security fixes soon enough.
If you plan to support your application very tightly for many, say, 20 years, installing all DLLs in the program directory is for you. If not, then write code which checks that suitable versions of all DLLs are installed and tell the user about it, so they know why your app suddenly starts to crash.
Yes, see this text :
Dynamic linking has the following
advantages: Saves memory and
reduces swapping. Many processes can
use a single DLL simultaneously,
sharing a single copy of the DLL in
memory. In contrast, Windows must load
a copy of the library code into memory
for each application that is built
with a static link library. Saves
disk space. Many applications can
share a single copy of the DLL on
disk. In contrast, each application
built with a static link library has
the library code linked into its
executable image as a separate
copy. Upgrades to the DLL are
easier. When the functions in a DLL
change, the applications that use them
do not need to be recompiled or
relinked as long as the function
arguments and return values do not
change. In contrast, statically linked
object code requires that the
application be relinked when the
functions change. Provides
after-market support. For example, a
display driver DLL can be modified to
support a display that was not
available when the application was
shipped. Supports multilanguage
programs. Programs written in
different programming languages can
call the same DLL function as long as
the programs follow the function's
calling convention. The programs and
the DLL function must be compatible in
the following ways: the order in which
the function expects its arguments to
be pushed onto the stack, whether the
function or the application is
responsible for cleaning up the stack,
and whether any arguments are passed
in registers. Provides a mechanism
to extend the MFC library classes. You
can derive classes from the existing
MFC classes and place them in an MFC
extension DLL for use by MFC
applications. Eases the creation
of international versions. By placing
resources in a DLL, it is much easier
to create international versions of an
application. You can place the strings
for each language version of your
application in a separate resource DLL
and have the different language
versions load the appropriate
resources. A potential
disadvantage to using DLLs is that the
application is not self-contained; it
depends on the existence of a separate
DLL module.
From my point of view an shared component has some advantages that are somtimes realized as disadvantages.
shared component defines interfaces in your process. So you are forced to decide which components/interfaces are visible outside and which are hidden. This automatically defines which interface has to be stable and which does not have to be stable and can be refactored without affecting any code outside the component..
Memory administration in case of C++ and Windows must be well thought. So normally you should not handle memory outside of an dll that isn't freed in the same dll. If you do so your component may fail if: different runtimes or compiler version are used.
So I think that using shared coponents will help the software to get better organized.

Is this a reasonable "Application entry point"?

I have recently come across a situation where code is dynamically loading some libraries, wiring them up, then calling what is termed the "application entry point" (one of the libraries must implement IApplication.Run()).
Is this a valid "Appliation entry point"?
I would always have considered the application entry point to be before the loading of the libraries and found the IApplication.Run() being called after a considerable amount of work slightly misleading.
The terms application and system are terms that are so widely and diversely used that you need to agree what they mean upfront with your conversation partner. E.g. sometimes an application is something with a UI, and a system is 'UI-less'. In general it's just a case of you say potato, I say potato.
As for the example you use: that's just what a runtime (e.g. .NET or java) does: loading a set of libraries and calling the application entry point, i.e. the "main" method.
So in your case, the code loading the libraries is doing just the same, and probably calling a method on an interface, you could then consider the loading code to be the runtime for that application. It's just a matter of perspective.
The term "application" can mean whatever you want it to mean. "Application" merely means a collection of resources (libraries, code, images, etc) that work together to help you solve a problem.
So to answer your question, yes, it's a valid use of the term 'application'.
Application on its own means actually nothing. It is often used by people to talk about computer programs that provide some value to the user. A more correct term is application software and this has the following definition:
Application software is a subclass of
computer software that employs the
capabilities of a computer directly
and thoroughly to a task that the user
wishes to perform. This should be
contrasted with system software which
is involved in integrating a
computer's various capabilities, but
typically does not directly apply them
in the performance of tasks that
benefit the user. In this context the
term application refers to both the
application software and its
implementation.
And since application really means application software, and software is any piece of code that performs any kind of task on a computer, I'd say also a library can be an application.
Most terms are of artificial nature anyway. Is a plugin no application? Is the flash plugin of your browser no application? People say no, it's just a plugin. Why? Because it can't run on it's own, it needs to be loaded into a real process. But there is no definition saying only things that "can run on their own" are applications. Same holds true for a library. The core application could just be an empty container and all logic and functionality, even the interaction with the user, could be performed by plugins or libraries, in which case that would be more an application than the empty container that just provides some context for the application to run. Compare this to Java. A Java application can't run on it's own, it must run within a Java Virtual Machine (JVM), does that mean the JVM is the application and the Java Code is just... well what? Isn't the Java code the real application and the JVM just an empty runtime environment that provides nothing to the end user without the loaded Java code?
I think in this context "application entry point" means "the point at which the application (your code) enters the library".
I think probably what you're referring to is the main() function in C/C++ code or WinMain in a Windows app. That is, it's the point where execution is normally started in an app. Your question is pretty broad and vague--for example, which OS are you running this on--but this may be what you're looking for. This might also address the question.
Bear in mind when you're asking questions, details are your friend. People can give you a much better, more informed answer when you provide them with details.
EDIT:
In a broader context consider what has to happen from the standpoint of the OS. When the user specifies that they want to run an app, the OS has to load the app from the hard drive and then when the app is loaded into memory, it has to pass control to some point in the memory blocked occupied by the newly loaded app to continue execution. That would be the "Application Entry Point". When an app is constructed with dynamically linked code the OS has to load all that dynamically linked code in order to get the correct app image into memory. Loading up those shared bits of code does not change the fact that the OS must have a point to which to pass control when the app is loaded into memory.