Domain specific language to perform data extraction and transformation in ETL pipeline - sql

Does anyone of domain specific languages (DSL) that facilitate data extraction and transformation as part of an Extract-Transform-Load (ETL) pipeline?
I'd like to extract data from a 3rd party SQL database and transform the data to an already defined JSON format to store it into my application. There are many different possible database schemata to extract data from, so I was wondering whether there is already a way to configure this through the help of a (commonly used) extraction language (ideally that language is also agnostic to other data sources such as web services, etc).
I had a look around, but other than a few research papers I couldn't find much in terms of agreed standards for ETL (minus the 'L' which I've got covered) and I don't want to reinvent the wheel.
I'd appreciate any pointers in the right direction.

Creating a good, all-encompassing DSL for ETL is I believe not just hard, it's a bit of a fool's errand. To handle the many real-world ETL complexities, you end up re-creating a general-purpose language.
And ETL "without programming skill" as this research paper attempts will struggle with the messiness of cleaning and conforming disparate source systems.
Using a general-purpose language by itself is of course possible but very time consuming due to the low abstraction level, and all the infrastructure code you'd have to implement.
Graphical ETL tools and some ETL DSLs address this by adding scripts or calling out to external programs. While useful and essential, this does have the disadvantage of employing multiple different programming models, with associated mental and technical friction when moving between them.
A different and I believe a better approach is to instead add ETL capabilities to a general-purpose language. Done well, you combine the benefits of ETL specific functionality and a high abstraction level with the power of a general-purpose language and its large eco-system, all delivered via a single programming model.
As one example of this latter approach, my company provides actionETL, a cross-platform .NET ETL library that combines an ETL mindset with the advantages of modern application development. For example, it provides familiar control flow and dataflow ETL capabilities, and uses internal DSLs in several places to simplify configuration. Do try it out if it sounds like a good fit.
actionETL now also has a free Community edition.
Cheers,
Kristian

Related

Business rules in DMN or database table?

I'm learning Camunda the workflow engine. I understand that for some long-running processes, process modeling brings many tactical and strategic benefits such as expressiveness, fail-tolerance and observability with additional overhead ofcourse.
The book I'm reading also advocates the use of DMN (decision tables) to bundle business rules inside the process model. The motive is to centralize maintenance and decouple configuration from the code. I'm taking this advice with grain of salt, as decision tables smells somewhat clunky to work with. There is no strong typing and powerful IDE features. I'm used to implementations where business parameters are stored in database table and consumed by the application. The implementation also provides admin GUI to maintain these parameters at runtime.
For what reason I should favor DMN over more solid database based solution?
You are moving the business logic to BPMN to get it out of the code, make it transparent in graphical model, accessible to all stakeholders, support business-IT alignment, empower business to own they business process/logic, support multi-version enactment at runtime, and more...
The same reasoning applies to business rules, which are too complex to be modeled out as graphs in BPMN diagrams. The DMN standard is also aimed at business people and the expression language used is intentionally kept simpler than an Excel formula. It is the "Friendly Enough Expression Language" (FEEL). So you see where this is going.
Database tables
are not well accessible to business users
do not flexible changes to the table structure(s)/schema at runtime
usually do not support multi-version enactment at runtime
do not support a graphical, logical decomposition of rules (into DRDs) unless you work with multiple tables - but db schema are not flexible
cannot be easily deployed to many systems
cannot be easily tested in unit test
likely do not automatically generate audit data, which is accessible for audits and analytics
These are just a few points. So, for business rules, definitely DMN over DB tables.

Any advantages when using an ORM Tool (Framework)?

I searched over .... I see many advantages, but it seems that all the advantages comes from a comparison over in-line SQL. I know in-line SQL is bad. But why compare with a bad one to show the other better?
If stored procedures are used (possibly exclusively), it seems none of the advantages still exists. Stored procedures definitely provide performance advantages in terms of security, performance (If a ORM can outrun a stored procedure, then the stored procedure is badly written) and a well written stored procedure is an automatic repository (pattern). Stored procedures can definitely provide better transaction and transaction isolation control.
I really appreciate an answer -- how ORM is better over a well architected application using stored procedures.
--- Thanks for all the answers that I receive so far ... It seems that the advantages still come from comparing using ORM's "dynamically generated SQL" with using "statically written in-line SQL" in the code. Yes, it has advantages. But it is not he question.
The question is better stated as the following:
If you consider having the stored procedures to implement your business logic (SPs can be written very advanced, and also very efficiently), in the Application code (.NET, JAVA), you have a very thin layer wrapper of the stored procedures organized by business need. My question is how ORM out-perform this architecture (Of course a well designed one).
ORM Tools make possible to develop abstraction layer between database and the model in the OO environment. The main advantage of this layer is that the developers who are not familiar with SQL can work with the model.
I have been seeking a good answer myself. Here is what I feel makes the difference:
1) ORM increases the developer productivity - mapping domain class to database is easier.
2) Stored Procs can potentially contain business logic - it is difficult to test these. This is mainly because of lack of tools/mocking framework.
3) ORM frameworks are tested ones which give you features like caching out of the box - no need to reinvent the wheel - and in most applications I've seen which do not use any ORM feature end up writing in-house Data Layer which ORM offers out of the box.
That being said - ORM does add some overhead as well, and it requires the developers to be aware of a new platform - writing efficient mapping comes with practise so there is a learning curve.
In the modern day setup, network bandwidth isn't as precious as rapid development and good quality (well tested) code. I guess this makes ORM well suited for database driven apps.
An ORM is a tool that can be used to build what you call a "well architected system". The idea is that when you are developing in a non-Relational language, there will be an impedance mismatch between the relational operation set provided by SQL/Stored Procedures and the language that you are using to build the rest of your application.
For developers using an object-oriented language (whether it is C++, C#, or Java) there are many considerations when mapping a complex relational schema into a rich Domain Model. It is certainly possible to perform all of this mapping in your own code, but as your interactions in this "no-man's-land" between OO and Relational paradigms grow more complex the more useful an ORM engine and associated tooling can be.
Some considerations as you plan out your mapping layer:
Do you need manage single-table or multi-table inheritance?
Do you want to leverage lazy loading?
Do you want to manually keep classes and tables synchronized or are you planning on using a tool to generate per-table classes (such as with a DataSet)?
Another consideration, especially when working in a team, is that when relational to domain layer mapping is performed by hand, there can be a great deal of variation in the way developers write the mapping. This can lead to inconsistencies, overlapping, and gaps that are difficult to detect. The selection of an ORM (especially a well known / solidly established ORM) can have an enormous (hopefully positive) impact on the solution and the pre-existing community surrounding that ORM will shape how you conceive of the mapping layer (you will find that there are significant cultural differences between Spring.NET and Entity Framework users, for instance).
Does an ORM make a good architecture? No. Are there systems whose architectures would be better off with an ORM? definitely. Are there projects that have been crippled by the unnecessary addition of an ORM? I'm guessing that there are many.
I suggest approaching this question from a different angle, and apply it to the specific application you are working on. Do you have any pain points by using SQL and/or Stored Procedures that an ORM might solve? Do you see any risks or have any concerns over problems that the introduction of an ORM might cause? Only by weighing the answers to these questions will you be able to determine if an ORM is a good fit for any given solution.

What are the most practical Object-oriented software modeling methods in real world projects?

I want to develope a big project, but I really don't know what is the best way to model my project. Do I even need to model my project?
What are the most practical OOP software modeling methods in real world projects? What are the best and most useful ones?
Many times its needed to capture the complex structure of classes you have in you OO system, so class diagrams from UML are used for modeling. You can also want to describe interactions of classes, for that sequence diagrams are useful. There are also other UML diagrams and each has its purpose.
If you are looking for an approach to modeling, try looking at Unified Process, which is adevelopment method, which is created by authors of UML and uses UML quite heavily and also describes how UML can be used.
Agile methodology is currently what is recommended. If you add a slice of UML then it would be better :-)
Modeling (design) is the most important part of every project.
In fact as times goes by, we sacrifice performance to gain higher level of design.
Why .NET framework is popular (compare to old tools) ? In most cases its libraries are wrappers over traditional win32 APIs, a waste of performance, instead it provides better design, which makes it easy to learn and use.
So if your project have a good design it would be easy to understand, develop, debug, maintain and extend.
Another example is OOP itself which has classes, interfaces... and bunch of constructor/destructor calls. OOP concepts are borrowed from psychiatry and the way human being see the world.
Here are two different concepts:
1) Design methodology
2) Project management methodology
There are many and I don't name good or bad. Each of them fits a scenario.
About design methodology I prefer DDD (Domain Driven Design) as it maps the industry domain terminology and concepts. So if you have a decision problem about what to do if A->B->C happened, simply you can ask a domain professional and he will say what they do in real world. DDD is good for old enough industries that have cumulative wisdom. I'm not gonna write more about design since we don't know about the project.
Project management methodologies (like agile) are the way you build the building from the map (design). The goal of project management is to use resources optimal (time, money, human resources...). This is done through work breakdown structure and make work as parallel as possible. The most known project management methodology is the traditional one in which we do everything in sequence, as civil engineers do (foundation, structure, walls...). This was good for many centuries until last decades (software industry), since in traditional project management you know where you are, where you want to go, and how to reach there. This way you can buy your furniture for a home that's a land yet !
Software industry has very rapid changes in tools and methods because is was new and no best practices were founded on thousands of failed projects. Many times when a project started it has changes because of changes in developing tools and frameworks. Other source of change is the scope of the project (where to go). Software is an intangible product so you fall in the trap of time estimations easily. For software development best practice are iterative methodologies.
Iterative methodologies suggest, a working incomplete solution which you make more complete in next iterate, rather than a non working partially complete one. This has a time overhead, instead, you sure the solution works and if any problem, you find in early stages. That's why we have nightly builds !
The best is Visual Studio 2010 Ultimate others are too cumbersome. Otherwise use light tools like yuml see http://askuml.com for samples.

feasibility on data mining program call stack using AOP

I am reading an article in IEEE Computer magazine about using data mining on applications.
The part that is intriguing to me is the idea that we can have software that can monitor the execution flow of an program, and put the data into a database, where we can do some data mining.
This data could then be used by a data mining tool to look for information, such as if there is certain patterns that may be called that may lead to changing the API, and, ideally, it may also be able to determine bugs, in that if you have to call functions in some order, it can help detect that.
There are probably other uses, but this would be a start.
So, would such a tool be useful?
I am thinking that AOP may be the only way to really do this on a dynamic application, as you could then track the flow of every call and save the stack, and perhaps gather some other information, such as parameters.
Unfortunately software engineers don't tend to be experts on data mining, and those that do data mining may not be an expert on writing complex applications.
For me, where this would get interesting is to then start to analyze distributed applications, or those using cloud computing, but that may be very complicated.
Second question, is this type of question that should be a community wiki?
Yes, I think it would be useful.
No, it shouldn't be a community wiki.
Check out the book "Programming Collective Intelligence" by Segaran for some good programmatic use of data mining strategies.

Are Databases and Functional Programming at odds?

I've been a web developer for some time now, and have recently started learning some functional programming. Like others, I've had some significant trouble apply many of these concepts to my professional work. For me, the primary reason for this is I see a conflict between between FP's goal of remaining stateless seems quite at odds with that fact that most web development work I've done has been heavily tied to databases, which are very data-centric.
One thing that made me a much more productive developer on the OOP side of things was the discovery of object-relational mappers like MyGeneration d00dads for .Net, Class::DBI for perl, ActiveRecord for ruby, etc. This allowed me to stay away from writing insert and select statements all day, and to focus on working with the data easily as objects. Of course, I could still write SQL queries when their power was needed, but otherwise it was abstracted nicely behind the scenes.
Now, turning to functional-programming, it seems like with many of the FP web frameworks like Links require writing a lot of boilerplate sql code, as in this example. Weblocks seems a little better, but it seems to use kind of an OOP model for working with data, and still requires code to be manually written for each table in your database as in this example. I suppose you use some code generation to write these mapping functions, but that seems decidedly un-lisp-like.
(Note I have not looked at Weblocks or Links extremely closely, I may just be misunderstanding how they are used).
So the question is, for the database access portions (which I believe are pretty large) of web application, or other development requiring interface with a sql database we seem to be forced down one of the following paths:
Don't Use Functional Programming
Access Data in an annoying, un-abstracted way that involves manually writing a lot of SQL or SQL-like code ala Links
Force our functional Language into a pseudo-OOP paradigm, thus removing some of the elegance and stability of true functional programming.
Clearly, none of these options seem ideal. Has found a way circumvent these issues? Is there really an even an issue here?
Note: I personally am most familiar with LISP on the FP front, so if you want to give any examples and know multiple FP languages, lisp would probably be the preferred language of choice
PS: For Issues specific to other aspects of web development see this question.
Coming at this from the perspective of a database person, I find that front end developers try too hard to find ways to make databases fit their model rather than consider the most effective ways to use database which are not object oriented or functional but relational and using set-theory. I have seen this generally result in poorly performing code. And further it creates code that is difficult to performance tune.
When considering database access there are three main considerations - data integrity (why all business rules should be enforced at the database level not through the user interface), performance, and security. SQL is written to manage the first two considerations more effectively than any front end language. Because it is specifically designed to do that. The task of a database is far different than the task of a user interface. Is it any wonder that the type of code that is most effective in managing the task is conceptually different?
And databases hold information critical to the survival of a company. Is is any wonder that businesses aren't willing to experiment with new methods when their survival is at stake. Heck many businesses are unwilling to even upgrade to new versions of their existing database. So there is in inherent conservatism in database design. And it is deliberately that way.
I wouldn't try to write T-SQL or use database design concepts to create your user-interface, why would you try to use your interface language and design concepts to access my database? Because you think SQL isn't fancy (or new) enough? Or you don't feel comfortable with it? Just because something doesn't fit the model you feel most comfortable with, doesn't mean it is bad or wrong. It means that it is different and probably different for a legitimate reason. You use a different tool for a different task.
First of all, I would not say that CLOS (Common Lisp Object System) is "pseudo-OO". It is first class OO.
Second, I believe that you should use the paradigm that fits your needs.
You cannot statelessly store data, while a function is flow of data and does not really need state.
If you have several needs intermixed, mix your paradigms. Do not restrict yourself to only using the lower right corner of your toolbox.
You should look at the paper "Out of the Tar Pit" by Ben Moseley and Peter Marks, available here: "Out of the Tar Pit" (Feb. 6, 2006)
It is a modern classic which details a programming paradigm/system called Functional-Relational Programming. While not directly relating to databases, it discusses how to isolate interactions with the outside world (databases, for example) from the functional core of a system.
The paper also discusses how to implement a system where the internal state of the application is defined and modified using a relational algebra, which obviously is related to relational databases.
This paper will not give an an exact answer to how to integrate databases and functional programming, but it will help you design a system to minimize the problem.
Functional languages do not have the goal to remain stateless, they have the goal to make management of state explicit. For instance, in Haskell, you can consider the State monad as the heart of "normal" state and the IO monad a representation of state which must exist outside of the program. Both of these monads allow you to (a) explicitly represent stateful actions and (b) build stateful actions by composing them using referentially transparent tools.
You reference a number of ORMs, which, per their name, abstract databases as sets of objects. Truely, this is not what the information in a relational database represents! Per its name, it represents relational data. SQL is an algebra (language) for handling relationships on a relational data set and is actually quite "functional" itself. I bring this up so as to consider that (a) ORMs are not the only way to map database information, (b) that SQL is actually a pretty nice language for some database designs, and (c) that functional languages often have relational algebra mappings which expose the power of SQL in an idiomatic (and in the case of Haskell, typechecked) fashion.
I would say most lisps are a poor man's functional language. It's fully capable of being used according to modern functional practices, but since it doesn't require them the community is less likely to use them. This leads to a mixture of methods which can be highly useful but certainly obscures how pure functional interfaces can still use databases meaningfully.
I don't think the stateless nature of fp languages is a problem with connecting to databases. Lisp is a non-pure functional programming language so it shouldn't have any problem dealing with state. And pure functional programming languages like Haskell have ways of dealing with input and output that can be applied to using databases.
From your question it seems like your main problem lies in finding a good way to abstract away the record-based data you get back from your database into something that is lisp-y (lisp-ish?) without having to write a lot of SQL code. This seems more like a problem with the tooling/libraries than a problem with the language paradigm. If you want to do pure FP maybe lisp isn't the right language for you. Common lisp seems more about integrating good ideas from oo, fp and other paradigms than about pure fp. Maybe you should be using Erlang or Haskell if you want to go the pure FP route.
I do think the 'pseudo-oo' ideas in lisp have their merit too. You might want to try them out. If they don't fit the way you want to work with your data you could try creating a layer on top of Weblocks that allows you to work with your data the way you want. This might be easier than writing everything yourself.
Disclaimer: I'm not a Lisp expert. I'm mostly interested in programming languages and have been playing with Lisp/CLOS, Scheme, Erlang, Python and a bit of Ruby. In daily programming life I'm still forced to use C#.
If your database doesn't destroy information, then you can work with it in a functional manner consistent with "pure functional" programming values by working in functions of the entire database as a value.
If at time T the database states that "Bob likes Suzie", and you had a function likes which accepted a database and a liker, then so long as you can recover the database at time T you have a pure functional program that involves a database. e.g.
# Start: Time T
likes(db, "Bob")
=> "Suzie"
# Change who bob likes
...
likes(db "Bob")
=> "Alice"
# Recover the database from T
db = getDb(T)
likes(db, "Bob")
=> "Suzie"
To do this you can't ever throw away information you might use (which in all practicality means you cannot throw away information), so your storage needs will increase monotonically. But you can start to work with your database as a linear series of discrete values, where subsequent values are related to the prior ones through transactions.
This is the major idea behind Datomic, for example.
Not at all. There are a genre of databases known as 'Functional Databases', of which Mnesia is perhaps the most accessible example. The basic principle is that functional programming is declarative, so it can be optimised. You can implement a join using List Comprehensions on persistent collections and the query optimiser can automagically work out how to implement the disk access.
Mnesia is written in Erlang and there is at least one web framework (Erlyweb) available for that platform. Erlang is inherently parallel with a shared-nothing threading model, so in certain ways it lends itself to scalable architectures.
A database is the perfect way to keep track of state in a stateless API. If you subscribe to REST, then your goal is to write stateless code that interacts with a datastore (or some other backend) that keeps track of state information in a transparent way so that your client doesn't have to.
The idea of an Object-Relational Mapper, where you import a database record as an object and then modify it, is just as applicable and useful to functional programming as it is to object oriented programming. The one caveat is that functional programming does not modify the object in place, but the database API can allow you to modify the record in place. The control flow of your client would look something like this:
Import the record as an object (the database API can lock the record at this point),
Read the object and branch based on its contents as you like,
Package a new object with your desired modifications,
Pass the new object to the appropriate API call which updates the record on the database.
The database will update the record with your changes. Pure functional programming might disallow reassigning variables within the scope of your program, but your database API can still allow in-place updates.
I'm most comfortable with Haskell. The most prominent Haskell web framework (comparable to Rails and Django) is called Yesod. It seems to have a pretty cool, type-safe, multi-backend ORM. Have a look at the Persistance chapter in their book.
Databases and Functional Programming can be fused.
for example:
Clojure is a functional programming language based on relational database theory.
Clojure -> DBMS, Super Foxpro
STM -> Transaction,MVCC
Persistent Collections -> db, table, col
hash-map -> indexed data
Watch -> trigger, log
Spec -> constraint
Core API -> SQL, Built-in function
function -> Stored Procedure
Meta Data -> System Table
Note: In the latest spec2, spec is more like RMDB.
see: spec-alpha2 wiki: Schema-and-select
I advocate: Building a relational data model on top of hash-map to achieve a combination of NoSQL and RMDB advantages. This is actually a reverse implementation of posgtresql.
Duck Typing: If it looks like a duck and quacks like a duck, it must be a duck.
If clojure's data model like a RMDB, clojure's facilities like a RMDB and clojure's data manipulation like a RMDB, clojure must be a RMDB.
Clojure is a functional programming language based on relational database theory
Everything is RMDB
Implement relational data model and programming based on hash-map (NoSQL)