I've inherited an application that accesses several websites and parses HTML to store data locally.
The HTML parsing is a breeze, but in an effort to make it easier to add additional site "parsers" in the future, I'm working on the overall design.
The area that I'm struggling with is how to encapsulate the parsing and data "conversion" or "mapping" for each "parser" so that I can create a standard convention for adding new ones.
The overall structure is as follows:
-- A scheduled task kicks off every 15 minutes, and runs what is essentially a method in a central controller.
-- The system loops over a list of "parsers" that are to be executed
-- Each parser goes to its specific site and downloads the page and parses it for data
-- The "columns" of data from each site do not "line up" with the local database table, so a translation of sorts needs to happen
-- after translation, the data is stored locally.
My initial thought was that each parser should return ONLY a recordset representing THAT site's data, after which some other translator
should turn that into a recordset that is then stored locally. But because it would be different for each site, i quickly started leaning towards each parser actually returning the same thing -- a properly formatted (by local database schema standards) recordset ready for storage.
The prior version of the app actually had each parser writing to a csv file, which was then used to import the data. The designer was doing something close to the way I'm leaning, only I believe all of this can be done in-memory without writing to a CSV file.
So should each parser have the job of retrieving the data and creating a generic locally-relevant recordset for storage? If the local database structure changes, each parser would have to be touched, and touched pretty deep. If I had a mappng "convention" at the start of each parser that said which column numbers from the remote site map to which column numbers of the local site, then when one or the other changed, it may not be quite so difficult to update all the parsers, and creation of new parsers would be easier as a "format" for the structure of them would already be in place.
OO-wise, I envision having a ParserBase object, which each specific site would extend (ParserVendorX, ParserVendorY, etc). The base (or perhaps abstract) parser would define all the methods that must be in each specific parser, and off the top of my head I'd say the following private methods would be required:
retrieveData
parseData
translateData
and the only public method might be "getData" - which would simply return a recordset for a data object to use for storing the data in the database.
So I'm looking for either a recommendation on a pattern that might apply here, and/or real world solutions others may have implemented for projects similar to what i'm working on.
For the record, I'm purposely not mentioning which language I'm working in unless it's absolutely essential... this is a high-level question so someone's solution in any other language would still be considered relevant.
Thanks!
As a start have a look at the Strategy Pattern
The strategy pattern defines a family of algorithms, encapsulates each one, and makes them interchangeable
In your case you would have a family of parsers and many implementations that do different things.
For example:
public interface BaseParser
{
bool Parse(SomeRequest request);
}
With many implimentations:
public class Html5Parser : BaseParser
{
public bool Parse(SomeRequest request)
{
... does stuff
}
}
public class XHtmlParser : BaseParser
{
public bool Parse(SomeRequest request)
{
... does stuff
}
}
Then you could execute them like this:
foreach(var parser in ParserList)
{
parser.Parse(myRequest);
}
As for what to do within each one, apply the single responsibility principle from S.O.L.I.D
and you should be able to figure it out, just neatly encapsulate each piece of work into a seperate component to make up the whole.
Related
In OOP everything is an object with own attributes and methods. However, often you want to run a process that spans over multiple steps that need to be run in sequence. For example, you might need to download an XML file, parse it and run business actions accordingly. This includes at least three steps: downloading, unmarshalling, interpreting the decoded request.
In a really bad design you would do this all in one method. In a slightly better design you would put the single steps into methods or, much better, new classes. Since you want to test and reuse the single classes, they shouldn't know about each other. In my case, a central control class runs them all in sequence, taking the output of one step and passing it to the next. I noticed that such control-and-command classes tend to grow quickly and are rather not flexible or extendible.
My question therefore is: what OOP patterns can be used to implement a business process and when to apply which one?
My research so far:
The mediator pattern seems to be what I'm using right now, but some definitions say it's only managing "peer" classes. I'm not sure it applies to a chain of isolated steps.
You could probably call it a strategy pattern when more than one of the aforementioned mediators is used. I guess this would solve the problem of the mediator not being very flexible.
Using events (probably related to the Chain of Responsibility pattern) I could make the single steps listen for special events and send different events. This way the pipeline is super-flexible, but also hard to follow and to control.
Chain of Responsibility is the best for this case. It is pretty much definition of CoR.
If you are using spring you can consider interesting spring based implementation of this pattern:
https://www.javacodegeeks.com/2012/11/chain-of-responsibility-using-spring-autowired-list.html
Obviously without spring it is very similar.
Is dependency injection not sufficient ? This makes your code reusable and testable (as you requested) and no need to use some complicated design pattern.
public final class SomeBusinessProcess {
private final Server server;
private final Marshaller marshaller;
private final Codec codec;
public SomeBusinessProcess(Server server, Marshaller marshaller, Codec codec) {
this.server = server;
this.marshaller = marshaller;
this.codec = codec;
}
public Foo retrieve(String filename) {
File f = server.download(filename);
byte[] content = marshaller.unmarshal(f);
return codec.decode(content);
}
}
I believe that a Composite Command (a vairation of the Command Pattern) would fit what you describe. The application of those is frequent in Eclipse.
I'm currently developing a ZF2 application and want to implement it pretty similar to the ZF2 example Blog application. Probably in the future the DAL will be replaced by Doctrine, but in the first version the model should work like in the Blog application: Service + Mapper + Data Objects.
In the Blog application the method ZendDbSqlMapper#save(...) gets (like every other public method of a Mapper) an Data Object as argument, extracts it, and writes then the data to the database. But my real world case is a bit more complex and I don't (but want to) understand, whether the approach is still applicable to it and how.
The application should primarily deal with (saving and retrieving of) requests / orders of some technical services. (In the next step they are manually processed by an employee and implemented.) So, the common case will be the saving (updating/creating) of an Order.
The physical model looks like this:
As you can see, the Order has some dependencies, that also have dependencies etc. On creating an OrderI have to create a LogicalConnection first. For a LogicalConnectionan (abstract) PhysicalConnectionand a concrete physical connection variant like PhysicalConnectionX are needed. (It implements the Class Table Inheritance.) Furthermore a LogicalConnectionneeds a new Customer (to simplify: every time a new customer for a new order) and an Endpoint with a concrete endpoint variant like EndpointA (also a CTI implementation). The tables on the left side of the data model are just basic data, that should/can not be changed. (Of course the updating is even a bit more complicated, since I have to check for every related object, if it already exists, to avoid e.g. creating of multiple customers for the same endpoint.)
My first idea was to implement it like this:
transform the input, the model gets from the form (I don't use Zend\Collection, because my for is structured completely diffreren that my objects and the database);
hydrate the Order object for it (recursive hydration is already implemented);
create a Mapper for every object type;
and let every Mapper#save(...)
call the save(...)on the mappers of the object it depends on;
and then care only for its object.
Pseudocode:
MyDataObjectA {
$id;
$myObjectB;
}
MyDataObjectB {
$id;
}
MapperA {
save($dataObjectA) {
saving $dataObjectA
calling MapperA#save($dataObjectA->getObjectB() )
}
}
MapperB {
save($dataObjectB) {
saving $dataObjectB
}
}
It's a lot of code and every case has to be handled manually. (And I'm not sure, but maybe I can have some problems with context dependent saving, since this approach doesn't concider the context.) However -- I don't believe, it's a recommended solution.
Well, it might smack of an ORM. But what's about the model structure from the ZF2 Blog tutorial? Is it applicable for such a case? Or is it only useful for very simple structures and nearly never for a real world application? (Then I would ask -- do we really need this tutorial, if it shows an approach, that nearly never can be used in a real application?) Or maybe I just understand something wrong and there is a better (efficient, elegant etc.) approach?
I have created an application, using ARC, that parses data from an online XML file. I am able to get everything I need using one class and one call to the API. The API provides the XML data. Due to the large xml file, I have a lot of variables, IBOutlets, and IBActions associated with this class.
But there are two approaches to this:
1) create a class which parses the XML data and also implements that data for your application
, i.e. create one class that does everything (as I have already done)
or
2) create a class which parses the XML data and create other classes which handle the data obtained from the XML parser class, i.e. one class does the parsing and another class implements that data
Note that some APIs that provide XML data track the number of calls/minute or calls/day to their service. So you would not want several classes calling the API, it would be better to make one request to the API which receives all the data you need.
So is it better to use several smaller classes to handle the xml data or is it fine to just use one large class to do everything?
When in doubt, smaller classes are better.
2) create a class which parses the XML data and create other classes which handle the data obtained from the XML parser class, i.e. one class does the parsing and another class implements that data
One key advantage of this is that the thing that the latter class models is separate from the parsing work that the former class does. This becomes important:
As Peter Willsey said, when your XML parser changes. For example, if you switch from stream-based to document-based parsing, or vice versa, or if you switch from one parsing library to another.
When your XML input changes. If you want to add support for a new format or a new version of a format, or kill off support for an obsolete format, you can simply add/remove parsing classes; the model class can remain unchanged (or receive only small and obvious improvements to support new functionality in new/improved formats).
When you add support for non-XML inputs. For example, JSON, plists, keyed archives, or custom proprietary formats. Again, you can simply add/remove parsing classes; the model class need not change much, if at all.
Even if none of these things ever happen, they're still better separated than mashed together. Parsing input and modeling the user's data are two different jobs; mashing them together makes them hard or impossible to reason about separately. Keep them separate, and you can change one without having to step around the other.
I guess it depends on your application. Something to consider is, what if you have to change the XML Parser you are using? You will have to rewrite your monolithic class and you could break a lot of unrelated functionality. If you abstracted the XML parser it would just be a matter of rewriting that particular class's implementation. Or what if the scope of your application changes and suddenly you have several views ? Will you be able to reuse code elsewhere without violating the DRY (Don't repeat yourself) principle ?
What you want to strive for is low coupling and high cohesion, meaning classes should not depend on each other and classes should have well defined responsibilities with highly related methods.
I'm currently refactoring a bunch of old Matlab code (pre new-oop Matlab), and the GUI code is a mess.
The GUI is basically a bunch of separate Matlab figures, each of which must present the same data in a different way.
The old code handles this problem by using a global struct to hold all of data to be displayed along with the meta-data (mainly information about the size needed to graph the current data).
My question is whether or not this is the right way, stylistically, to do this in the current version of Matlab. I've considered bundling the data into a handle class and the meta-data into another, and then passing both to every figure in the GUI, but I don't know if the added encapsulation is worth the messiness of the added arguments.
Are there any general style rules for making such decisions in Matlab GUI programming?
There are a couple ways to do this. You can use getappdata and setappdata to associate data with an individual figure:
%# Associate some data to the main figure handle...
setappdata(main_FH, 'myData', data);
%# Retrieve that data from the main figure handle
myData = getappdata(main_FH, 'myData');
%# check if some app data exists for main_FH
validAppData = isappdata(main_FH, 'myData');
You could also use set(FH, 'UserData', myData) (and get() too), although there's only one UserData property for every handle; you could set it to a struct and use isfield() rather than isappdata() to see if the field in myData exists.
Finally, there's guidata, but this is essentially a wrapper for ___appdata for GUIDE GUIs.
There's a summary of the ways to pass data like this on The MathWorks website.
Have you ever structured your source code based on your user interface parts? For example if your UI consists of:
GridView for showing some properties
3D rendering panel
panel for choosing active tools
, then you name and group your variables and functions more or less in the following way:
class Application
{
string PropertiesPanel_dataFile;
DataSet PropertiesPanel_dataSet;
string[] PropertiesPanel_dataIdColumn;
void PropertiesPanel_OpenXml();
void PropertiesPanel_UpdateGridView();
string ThreeDView_modelFile;
Panel ThreeDView_panel;
PointF[] ThreeDView_objectLocations;
void ThreeDView_RenderView();
string ToolPanel_configurationFile;
Button[] ToolPanel_buttons;
void ToolPanel_CreateButtons();
}
What's your opinions on this? Can this architecture work in long run?
PS. Even though this solution might remind you of Front-ahead-design april-fool's joke http://thedailywtf.com/Articles/FrontAhead-Design.aspx my question is serious one.
EDIT
Have been maintaining and extending this kind of code for half a year now. Application has grown to over 3000 lines in the main .cs file, and about 2000 lines spread out in to smaller files (that contain generic-purpose helper-functions and classes). There are many parts of the code that should be generalized and taken out of the main file, and I'm constantly working on that, but in the end it doesn't really matter. The structure and subdivision of the code is so simple, that it's really easy to navigate though it. Since the UI contains less than 7 major components, there's no problem in fitting the whole design in you head at once. It's always pleasant to return to this code (after some break) and know immediately where to start from.
I guess one the reasons this gigantic procedural-like structure works in my case is the event-like nature of UI programming in c#. For the most part all this code does is implementation of different kinds of events, that are really specific to this project. Even though some event-functions immediately grow into couple of pages long monsters, coupling between event-handlers is not that tight, so it makes it easier to refactor and compress them afterwards. That's why Iam intentionally leaving generalization and refactoring for later time, when other projects start to require the same parts of implementation that this project uses.
PS to make it possible to navigate through 3000 lines of code I'm using FindNextSelection- and FindPrevSelection-macros in visual studio. After left-clicking on some variable I'm pressing F4 to jump to the next instance of it, and F2 to the previous instance. It's also possible to select some part of variable name and jump between partial-name matches. Without these shortcuts I would most defenetly lost my way long time ago :)
That looks very procedural in concept and is completely bypassing the value of OOD. The sensible approach would be to create objects for each of your elements and the values you have given would be properties of those objects, i.e.
class PropertiesPanel
{
public string DataFile { get; set; }
public DataSet DataSet { get; set; }
public string[] DataIDColumn { get; set; }
etc...
I think you get the idea so I'm not going to type the whole lot out. That's a first stage and there may be further work you could do to structure your application appropriately.
The best advice I ever received for OOD was to look to the smallest object that each logical branch of your app can be distilled to, it probably on has native types for properties (with .NET there no point in reinventing Framework objects either so they can be in your base class) and then using inheritance, polymorphism and encapsulation to expand on those base classes until you have an object that encapsulates the logical branch.
At the time I was writing an app that pushed data to an I2C device so I started with a class that put a bit onto an I2C bus, that was inherited by a class that put a byte onto a bus, inherited by a class that put an array of bytes onto the bus, and finally a class that put an address and an array of bytes. This is rather extreme OOD but it produced very clean code with each class being very small and very easy to debug.
It's possibly more work up front in thinking about the problem but in the long run it save soooooo much time it's just not funny.
It's OK to structure your user interface code based on your UI parts, but the non-UI related logic of your program should be kept separate.
But event on the UI part you shouldn't just smash everything into one class. Instead you should divide your UI code into several classes, so that every class only deals with one UI component and doesn't deal with others it doesn't need to know about:
class Application
{
PropertiesPanel
ThreeDView
ToolPanel
}
class PropertiesPanel {
string dataFile;
DataSet dataSet;
string[] dataIdColumn;
void OpenXml();
void UpdateGridView();
}
class ThreeDView {
string modelFile;
Panel panel;
PointF[] objectLocations;
void RenderView();
}
class ToolPanel {
string configurationFile;
Button[] buttons;
void CreateButtons();
}
What's your opinions on this?
It’s a mess.
Can this architecture work in long
run?
No. (At least not without a lot of sweat.)
The names are insanely verbose. If you think about it, the long name prefixes are there to create a kind of separate ‘namespace’, to group related things together. There already is a better language construct for this very kind of thing – it’s classes. But the main problem is elsewhere.
User interfaces change often, concepts change seldom. If your code structure mirrors the user interface, you are locked to this particular interface. This makes reusing and refactoring the code quite hard. If you structure the code around the base concepts from the problem domain, you have a better chance to reuse already existing code as the software develops – the design will adapt to changes. And changes always happen in software.
(I hope that the ‘base concepts from the problem domain’ part is clear. For example, if you create a system for a local theater, you should base your design on the concepts of a Movie, Visitor, Seat, and so on, instead of structuring it around MovieList, SeatMap, TheaterPlan and such.)
Most of the time it is a good idea to decouple the core code from the GUI as much as possible (This is exactly what the Model–View–Controller design system is all about.) It is not an academic excercise, nor it is only required if the interface is going to change. A great example of decoupling the GUI from the rest of the code is the GUI programming on Mac OS X, with the Interface Designer, bindings and such. Unfortunately it takes a while to get into, you cannot simply skim the docs on the web and be enlightened.