feasibility on data mining program call stack using AOP - aop

I am reading an article in IEEE Computer magazine about using data mining on applications.
The part that is intriguing to me is the idea that we can have software that can monitor the execution flow of an program, and put the data into a database, where we can do some data mining.
This data could then be used by a data mining tool to look for information, such as if there is certain patterns that may be called that may lead to changing the API, and, ideally, it may also be able to determine bugs, in that if you have to call functions in some order, it can help detect that.
There are probably other uses, but this would be a start.
So, would such a tool be useful?
I am thinking that AOP may be the only way to really do this on a dynamic application, as you could then track the flow of every call and save the stack, and perhaps gather some other information, such as parameters.
Unfortunately software engineers don't tend to be experts on data mining, and those that do data mining may not be an expert on writing complex applications.
For me, where this would get interesting is to then start to analyze distributed applications, or those using cloud computing, but that may be very complicated.
Second question, is this type of question that should be a community wiki?

Yes, I think it would be useful.
No, it shouldn't be a community wiki.
Check out the book "Programming Collective Intelligence" by Segaran for some good programmatic use of data mining strategies.

Related

Domain specific language to perform data extraction and transformation in ETL pipeline

Does anyone of domain specific languages (DSL) that facilitate data extraction and transformation as part of an Extract-Transform-Load (ETL) pipeline?
I'd like to extract data from a 3rd party SQL database and transform the data to an already defined JSON format to store it into my application. There are many different possible database schemata to extract data from, so I was wondering whether there is already a way to configure this through the help of a (commonly used) extraction language (ideally that language is also agnostic to other data sources such as web services, etc).
I had a look around, but other than a few research papers I couldn't find much in terms of agreed standards for ETL (minus the 'L' which I've got covered) and I don't want to reinvent the wheel.
I'd appreciate any pointers in the right direction.
Creating a good, all-encompassing DSL for ETL is I believe not just hard, it's a bit of a fool's errand. To handle the many real-world ETL complexities, you end up re-creating a general-purpose language.
And ETL "without programming skill" as this research paper attempts will struggle with the messiness of cleaning and conforming disparate source systems.
Using a general-purpose language by itself is of course possible but very time consuming due to the low abstraction level, and all the infrastructure code you'd have to implement.
Graphical ETL tools and some ETL DSLs address this by adding scripts or calling out to external programs. While useful and essential, this does have the disadvantage of employing multiple different programming models, with associated mental and technical friction when moving between them.
A different and I believe a better approach is to instead add ETL capabilities to a general-purpose language. Done well, you combine the benefits of ETL specific functionality and a high abstraction level with the power of a general-purpose language and its large eco-system, all delivered via a single programming model.
As one example of this latter approach, my company provides actionETL, a cross-platform .NET ETL library that combines an ETL mindset with the advantages of modern application development. For example, it provides familiar control flow and dataflow ETL capabilities, and uses internal DSLs in several places to simplify configuration. Do try it out if it sounds like a good fit.
actionETL now also has a free Community edition.
Cheers,
Kristian

Is BPMN right for my purpose?

Intro
The company I work in (it is an intern-like position though, until I am done with university) recently implemented an automated warehouse solution, where goods are transported by means of autonomous shuttles. The basic functions of the shuttles are controlled by onboard electronics (microcontroller), routing through the warehouse racking is done by software solution which in turn communicates with our ERP solution. Effectively the ERP solution handles the whole warehousing.
Task
There are well documented processes for every of the four layers (operator who loads the the shuttles, shuttle itself, routing, ERP) individually. But since we kind of puzzled all four of them together to one solution (which was kind of new to all of the participating companies), there are only vague, on-the-flyish process descriptions involving all four layers available.
Now I have been tasked to come up with a solution to illustrate the processes at work.
Example
ERP signals goods in demand at assembly station A1
Warehouse operator looks at screen and starts loading boxes to be picked up by
shuttle
Warehouse operator puts in details into ERP, such as count/weight, box number,
...
Warehouse operator clears boxes for pick-up (by confirming inputs in ERP)
ERP generates transport order
ERP sends transport order to routing software
Routing software sends telegram to shuttle control
Shuttle control turns wheels and asks for directions to pick up boxes
...
Question
As mentioned, I have to graphically represent the kind of processes similar to the one shown in the (easy and not complete) example above. I need to incorporate the operator's actions as well as basic communication between shuttle, routing software and ERP.
Since I attended a course on BPMN at university it came to mind immediately. But now, after immersing myself into information about BPMN for several hours I still can't conclusively tell if BPMN helps my efforts or just further complicates the whole thing.
Is BPMN the right tool for my purpose?
Disclaimer
I am not a Business Analyst. I have looked at alternatives to BPMN (simple flowcharts, activity diagrams, ...) but they don't seem to fit.
Just putting together the existing processes for every respective layer yields no result, owing to the different and sometimes too detailed process descriptions.
Edit
The ERP is SAP ERP 6.0 EHP7 with integrated WMS component.
TL;DR: use the notation you would be implement process in, i.e. choose BPMS, not BPMN.
The notation itself means nothing unless it has proper tool for modelling and further process implementation aka BPMS. You can find dozens of comparisons (e.g. BPMN vs EPC or BPMN vs BPEL), however they won't help you unless you have clear understanding where and how you will be implement you modeled process.
Generally speaking, EPC is used for more high-level view of the process, whereas BPMN is utilized for more fine-grained view, where all technical details of communications between peers can be described. However, it depends.
I also recommend you to review this table
and answer the question to yourself whether your process changes (in)frequently or not, and whether you need separate BPM tool.
How I see it from your description: you have four participants (four layers), which are four lanes in BPMN terms, and they are collaborating/communicating with each other during the process. Generally speaking, this fits to BPMN application area, but personally I feel that you should stick your ERP tooling. I don't know which ERP you use, but every serious ERP solution includes tool for process customization. For example, SAP has Workflow,
which can widely enhance and extend existing processes within SAP. Probably, your ERP have it too.
Again, it's not clear which warehouse management system you use and if it is integrated to your ERP. It seems to be not, and it seems to be some old legacy system, because of which you start re-modelling the stuff. In this particular case it might me wiser to acquire special advanced warehouse management package (take a look at SAP's EWM features as an example) which can cover most of your requirements.

Integrating my RESTful web app with clients' SAP installations

My company runs a couple of B2B apps (written in Rails) dealing with parts and inventory and we've been trying to figure out the best way to integrate with some of our bigger users. We already offer the REST-style API that comes with Rails, but that, of course requires an IT Department on their end to decide to integrate it, so we'd like to lower that barrier if possible.
From what we've found, most of them are on SAP systems. Now, pretty much all I know about SAP is it's 1) expensive, 2) huge, 3) and does everything and anything you could ever need for your gigantic business to run. Naturally, this is all a bit imposing, and the resources on the site are a cross between impenetrable buzz-word laden sales material, and impenetrable jargon laden advanced technical material with little for the new, but technically competent user to be able to sink his teeth into.
So what I'm wondering is: as a 3rd party, that's not running a SAP installation, is there a way for us to offer access to our site's data through a web service or other API? Is it just a matter of providing or implementing a certain WSDL (and what would that be)? Is this feasible for someone without in-depth experience with SAP? Or is this a complete non-starter?
I'd say it's not possible without someone who knows the SAP system. You probably won't need to hire someone with in-depth SAP knowledge, but at least for the initial implementation, you'll need both the knowledge and a working system you can develop against. Technically speaking, it's not really that hard, but considering the fact that SAP systems are designed to handle multiple organizations, countries, legal systems, localizations and several thousands of users simultaneously, things are bound to be a bit more complex than almost any other software around - and most of the time not even bloated, it's just easy to get lost in that kind of flexibility.
My recommendation would be to find a customer (or a prospective customer) who has someone in their IT department with the necessary technical and processual knowledge and who is interested in conducting a development project. This way, you'd get access to a real system (testing of course) and someone who can explain to you the basics of the system. But, as I said, be prepared for complexity.
vwegert makes some excellent points.
As to this part of your question:
So what I'm wondering is: as a 3rd
party, that's not running a SAP
installation, is there a way for us to
offer access to our site's data
through a web service or other API? Is
it just a matter of providing or
implementing a certain WSDL (and what
would that be)?
Technically it is possible to expose any of your system's services as web-services to a client's SAP system. In order to do this you do not need any prior knowledge of SAP. (SAP should be able to import a WSDL, although there may be some limitations in the earlier pre-ECC5 systems).
For example a service that provides meter reads, airport departure schedules, industry trends etc is not dependend of what is in the user's system or how they set it up. However as soon as there is a need to initiate updates to the client system's data is when you need access to more specialised SAP knowledge.
Also note that many SAP functions can also be exposed as web services, but generally you do need someone with SAP (ABAP) knowledge to do this.
The ABAP language is actually fairly simple, but there is a huge learning curve to understand the data model and the myriad of configurable options in SAP.

When its enough for a programming language that you need to switch to another?

I have wonder that many big applications (e.g. social websites such as facebook) are build with many languages into its platform.
They usually start with AJAX browser support, then scale down to PHP scripting, then move towards a powrful OOP technologie such as Java or .NET, and finally a primitive language to increase performance in crucial operations such as C.
My question is how should I determinate the edge of the layers between languages. When PHP, when Java, when C and so on. And the other question is if should those languages integrate in a vertcal fashion for simplicity and maintanance, or could it be cases when you decide to program on module of your app in Java and the other in native C.
What are the context variables that push me to move to a better performance language? (e.g. concurrency issues due increase of users)
Don't tell me that PHP overlaps .NET and Java Technologies. In a starter point it does, but when the network is overload you start seeing the diferences. I mean how can I achieve Multithreading in PHP as in Java with the same performance. The thing it's hard to answer my wuestion is becasue there is not so much reading about this. You maybe find some good books covering PHP, but few telling how when and why integrate different languages.
Each language was created for different purposes, Python is strong with string operations, Perl very powerful in batch scripting, PHP a very reliable application web server, C the mother of most popular languages.
Best,
Demian.
On one end of the scale, you move to a higher performance language whenever your profiling and measurements tell you that you have a bottleneck that can't be fixed with better algorithms, data structures, or other optimisation.
At the other end, you move to a higher level language (ie. more abstraction, better libraries) whenever your management allow you to do so. ;)
I believe most teams simply use what they are best familiar with.
There are also questions of licensing that can influence the decision.
That is, if you're talking about technologies that compare to each other and solve the problem on the same level (for example ASP.NET/JSF/JSP/PHP...). But you can't compare .NET with C++ for example, they are meant to solve different problems on different abstraction levels.
My criterion for any programming language is "does it help me to get the job done or does it just get in the way?" If the latter, then it's time to move on.
From an economical point of view the answer is easy: on a regular basis just look what will be cheaper. Either continue with the current technology and maybe stretch the envelope a bit more. Or switch to something new. When you compare the two alternatives the cost of the investment already done is not important anymore since you've already spent that money/effort. You only have to look ahead: cost of licenses, education, etc.
Of course this is easier said then done, but just sitting down with a few people, thinking about it, and maybe try to come up with some numbers already helps a lot. I have seen too many projects that continued with technology that really wasn't suited for the job anymore.
Also hard numbers don't tell the whole story. There will be resistance because of unfamiliar technology, experts who are losing their status, etc.
Identify the bottleneck
Solve bottleneck
Go to 1
I'm sure you can imagine that step 2 is the one where decisions like "What programming language do we use" and "where do we put the coffee machine" come into play. That's the basic rule.

BPMS or just plain programming?

What do you prefer (from your developer's point of view) when it comes to implement a business process?
A Business Process Management System (BPMS) or just your favorite IDE with the needed tools and frameworks (a reporting tool for example)?
What is from your point of view the greatest Benefit of a BPMS compared to an IDE with your personal tools and frameworks?
OK. Maybe I should be more specific... I got to know one specific BPMS which should make it easy to implement a business process by configuring rules. But for me as a developer it is hard to work with the system. I would like to work with text files which I can refactor and I would like to be able to choose the right technology or framework for the job I have to do. Instead the system forces me to configure.
There are rules where I can use java, but even then I have to stick to the systems editor without intellisense etc.
So this leads me to the answer of my own question - I would like to use the tools I am used to instead of having to learn how to work with a BPMS (at least the one I know) because it limits me more than it helps. The BPMS I know is a framework from which it is hard to escape! At this time, I would prefer a framework like Grail over any BPMS I know.
So maybe the more specific question is: do you feel the same or are there BPMSes which support you in beeing a developer and think like a developer or do most of them force you to do your job a different way?
In my experience the development environments provided by BPMS systems are third rate, unproductive, and practically force you to write hard to maintain, poorly designed code (due to their limitations). Almost all the "features" (UI, integrations, etc) provided by the BPMS system I'm familiar with (the one sold by that company named for its database) were not worth the money we paid.
If you're forced to use BPMS, as a developer, my advice would be to build as much of your application in a conventional development environment, such as Java or .Net, build as little as possible in the BPMS environment itself, and integrate the two. The only things that should go in the BPMS is the minimum to make the business process work.
Not sure what exactly you ask, but the choice BPM vs. plain programming will depend on the requirements. A "business process" is a relatively vague term in software engineering.
Here are a few criterion to evaluate your needs:
complexity of the rules - Are the decisions/rules embodied in your process simple, complicated, configurable, hard-coded?
volatility of the process - How frequently does your process change? Who should be able to make the change?
integration need - Is your process realized using multiple heterogenous services, or is all implemented in the same language?
synchronous/asynchrounous - Is your process "long-running" with the need to handle asynchronous actions?
human tasks - Does your process involves human interaction, with task being assigned/routed to people according to their roles/responsibilities?
monitoring of the process - What is the level of control you want on the existing process instances being executed? Do you need to audit the actions, etc. ?
error handling - Depending on the previous points, how do you plan to deal with errors, or retry of faulty process execution?
Depending on the answer to these questions, you may realize that your process is closer to a simple state chart with a few actions and decisions that can be executed in a sequence, or you may realize that you need something more elaborated, and that you don't want to re-implement all that yourself.
Between plain programming and a full-fledge BPM solution (e.g. Oracle BPM suite which contains BPEL, rule engine, etc.), there are intermediate solutions such as jBPM or Windows Workflow Foundation and probably a lot of others. These intermediate solution are frequently good trade-off.
I have worked with Biztalk in the past and more recently with JBPM. My opinion is biased against BPMs for the following reasons:
Steep learning curve : To make a process work, I have to understand how the system and the editor works. It is hard enough for a developer to understand the system, let alone a business user. The drag and drop and visual representation is a great demo tool. It certainly impresses managers (who ultimately pay for it), but a developer's productivity just drops.
Non developers changing the workflow : I haven't seen one BPM solution do it flawlessly. Though it doesn't look like code, right click on the box and you do have to put some code, otherwise it is not going to work. So you definitely need a developer to do it. The best part is that it is neither developer friendly nor business user friendly, just demo user friendly.
Testablity and refactoring : It is virtually impossible to test drive a BPMS. You do have 'unit test frameworks' advertised, but most of them are hacks and hard to use. Recently I tried the JBPM one; I ended up writing a lot of glue code and fake workflow handlers to make it work. The deal breaker for me though is refactoring. If the business radically changes it's mind about how a business process should look, then good luck re-arranging the boxes, because just re-arranging them won't work, all the variables bound to the boxes also need to be re-arranged. I would prefer the power of the IDE and tests to refactor my business process.
If your application has workflow, then you could try a workflow library (with or without persistent state). It will still manage your workflows without all the bloat that comes with a BPM. If a business user needs to understand the code, then let the business prepare good process flowcharts and translate them into good domain driven code. Use cucumber style acceptance tests to make bring the developers and business together. A BPM is just something that tries to do too many things and ends up doing all those things badly.
BPMS-- a lot of common business case, use case are already implemented. So you just have to know how to use it. For common workflow, you don't even need to write a single line of code, though mostly you would have to write some scripts to cover things that are not yet implemented.
Plain programming-- just use the IDE to hack out the code. The positive side: more control. The negative? A lot of times are spent on rewriting boilerplate code. And you have to maintain them.
So in a nutshell, I would prefer a Business Process Management System. One that I would recommend is ProcessMaker. It features an intuitive process designer that allows you to design workflow with drag and drop. And you can always write trigger to extend the process functionalities. It's open source as well.