I have run into a problem when designing my software.
My software consists of a few classes, Bot, Website, and Scraper.
Bot is the most abstract, executive class responsible for managing the program at a high-level.
Website is a class which contains scraped data from that particular website.
Scraper is a class which may have multiple instances per Website. Each instance is responsible for a different part of a single website.
Scraper has a function scrape_data() which returns the JSON data associated with the Website. I want to pass this data into the Website somehow, but can't find a way since Scraper sits on a lower level of abstraction. Here's the ideas I've tried:
# In this idea, Website would have to poll scraper. Scraper is already polling Server, so this seems messy and inefficient
class Website:
def __init__(self):
self.scrapers = list()
self.data = dict()
def add_scraper(self, scraper):
self.scrapers.append(scraper)
def add_data(type, json):
self.data[type] = json
...
# The problem here is scraper has no awareness of the dict of websites. It cannot pass the data returned by Scraper into the respective Website
class Bot:
def __init__(self):
self.scrapers = list()
self.websites = dict()
How can I solve my problem? What sort of more fundamental rules or design patterns apply to this problem, so I can use them in the future?
As soon as you start talking about a many-to-many parent/child relationship, You should be thinking about compositional patterns rather than traditional inheritance. Specifically, the Decorator Pattern. Your add_scraper method is kind of a tipoff that you're essentially looking to build a handler-stack.
The classic example for this pattern is a set of classes responsible for producing the price of a coffee. You start with a base component "coffee", and you have one class per ingredient, each with its own price modifier. A class for whole milk, one for skim, one for sugar, one for hazelnut syrup, one for chocolate, etc. And all the ingredients as well as the base components share an interface that guarantees the existence of a 'getPrice' method. As the user places their order, the base component gets injected into the first ingredient/wrapper-class. The wrapped object gets injected into subsequent ingredient-wrappers and so-on, until finally getPrice is called. And each instance of getPrice should be written to first pull from the previously injected one, so the calculation reaches throughout all layers.
The benefits are that new ingredients can be added without impacting the existing menu, existing ones can have their price changed in isolation, and ingredients can be added to multiple types of drinks.
In your case, the data-struct being decorated is the Website object. The ingredient classes would be your Scrapers, and the getPrice method would be scrape_data. And the scrape_data method should expect to receive an instance of Website as a parameter, and return it after hydration. Each Scraper needs no awareness of how the other scrapers work, or which ones to implement. All it needs to know is that a previous one exists and adheres to an interface guaranteeing that it too has a scrape_data method. And all will ultimately be manipulating the same Website object, so that what gets spit back out to your Bot has been hydrated by all of them.
This puts the onus of knowing what Scrapers to apply to what Website on your Bot class, which is essentially now a Service class. Since it lives in the upper abstraction layer, it has the high-level perspective needed to know this.
One way to go about this is, taking inspiration from noded structures, to have an atribute in the Scraper class that directly references its respective Website, as if I'm understanding correctly you described a one-to-many relationship (one Website can have multiple Scrapers). Then, when a Scraper needs to pass its data to its Website, you can reference directly said atribute:
class Website:
def __init__(self):
self.scrapers = list() #You can indeed remove this list of scrapers since the
#scrapper will reference its master website, not the other way around
self.data = dict() #I'm not sure how you want the data to be stores,
#it could be a list, a dict, etc.
def add_scraper(self, scraper):
self.scrapers.append(scraper)
def add_data(type, json):
self.data[type] = json
class Scraper:
def __init__(self, master_website):
#Respective code
self.master = master_website #This way you have a direct reference to the website.
#This master_website is a Website object
...
def scrape_data(self):
json = #this returns the scraped data in JSON format
self.master.add_data(type, json)
I don't know how efficient this would be or if you want to know at any moment which scrapers are linked to which website, though
Related
Using Django rest framework to build an API webservice that contains many of already trained machine learning models. Some models can predict a batch_size of 1 or an image at a time. Others need a history of data (timelines) to be able to predict/forecasts. Usually these timelines can hardly fit and passed as parameter. Being that, we want to give the requester the ability to request by either:
sending the data (small batches) to predict as parameter.
passing a database id/reference as parameter then the API will query the database and do the predictions.
So the question is, what would be the best API design for identifying which approach the requester chose?. Some considered approaches:
Add /db to the path of the endpoint ex: POST models/<X>/db. The problem with this approach is that (2x) endpoints are generated for each model.
Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to check which approach. Also, make the code less readable.
Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode which is not convenient.
What would be the best approach for this case
The fact that you currently have more than one source would cause me to seriously consider attempting to abstract the "source" component as much as possible, to allow all manner of sources. For example, suppose that future users would like to pull data out of a mongodb, instead of a whatever db you currently are using? Or from some other storage structure? Or pull from a third party? Or, or, or....
In any case the question is now "how much do they all have in common, and what should they all implement?"
class Source(object):
def __get_batch__(self, batch_size=1):
raise NotImplementedError() #each source needs to implement this on its own
#http_library.POST_endpoint("/db")
class DBSource(Source):
def __init__(self, post_data):
if post_data["table"] in ["data1", "data2"]:
self.table = table
else:
raise Exception("Must use predefined table to prevent SQL injection")
def __get_batch__(self, batch_size=1):
return sql_library.query("SELECT * FROM {} LIMIT ?".format(self.table), batch_size)
#http_library.POST_endpoint("/local")
class LocalSource(Source):
def __init__(self, post_data):
self.data = post_data["data"]
def __get_batch__(self, batch_size=1):
data = self.data[self.i, self.i+batch_size]
i += batch_size
return data
This is just an example. However, if a fixed part of your path designates "the source", then you have left yourself open to scale this indefinitely.
Add /db to the path of the endpoint ex: POST models//db. The problem with this approach is that (2x) endpoints are generated for
each model.
Inevitable. DRY out common code to sub-methods.
Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to
check which approach. Also, make the code less readable.
There won't be any additional overhead (that's what your underlying framework does to match a URL to a function/method anyway). However, these are 2 separate functionalities, I would keep them separate, so I would prefer the first approach.
Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode
which is not convenient.
Yikes! unless you provide a UI letting a user to select his preference and apply it globally (I don't think any UX will agree to that)
That being said, the api design should be driven by questioning who is mastering (or owning) the data. If it's the application and user already knows the ID of that entity, then you shouldn't be asking the data from the user.
If it's the user, and then if it won't fit in a POST body, then I would say, a real-time API may not be the right solution, think about message queues/pub-sub based systems.
If you need a hybrid solution as you asked in the question, then, I would prefer the 1st approach.
First, thanks for any advice. I am new to all of this and apologize for any obvious blunders.
Second, the question:
In an interface for entering clients that often possess a number of roles, it seemed efficient to create a set of inputs which possessed both visual characteristics and associated data binding based simply on the inputs name.
For example, inquirerfirstname would be any caller or emailer who contacted our company.
The name would dictate a label, placeholder, and the location in firebase where the data would be stored.
The single name could be used--I thought--with a relational table (state machine or series of nested ifs) to define the properties of the input and change its outward appearance and inner bindings through property manipulation.
I created a set of nested iffs, and console logged the property changes in the inputs, but their representation in the host element (a collection of inputs that generated messages to clients as well as messages to sales staff) remained unaffected.
I attempted using the ready callback. I forced the state change with a button.
I was unable to use the var name = new MyInput( name). I believe using this method would be most effective but am unsure how to "stamp" the JavaScript into a heavyweight stamped parent element.
An example of a more complicated and dynamic use of a constructor and a factory implementation that can read database (J-son) objects and respond to generate HTML elements would be awesome.
In vanilla a for each would seem to do the trick but definitions and structure as well as binding would not be organic--read it might be easier just to HTML stamp the inputs in polymer by hand.
I would be really greatful for any help. I have looked for a week and failed to find one example that took data binding, physical appearance, attribute swapping, property binding and object reading into account.
I guess it's a lot, but each piece independently (save the use of the constructor) I think I get.
Thanks again.
Jason
Ps: I am aware that the stamping of the element seems to preclude dynamic property attribute and binding assignments. I was hoping a compute attribute mixed with a factoryimpl would be an option (With a nice example).
I have a multi-step form where the user fills out info on several different pages. In conventional rails, you keep each resource separate in its own controller and you use the REST actions to manipulate the data.
In the conventional system I would have 3-5 different controllers (some steps are optional) for a single multi-step form. There's no real sense of "order" in the controllers if I do it the conventional way. A new developer coming on to the project has to learn what steps map to what steps and so forth.
On the other hand, I have thought about breaking convention and having a single controller that organizes the entire multi-step form. This controller would be full of methods like:
def personal_info
# code...
end
def person_info_update
# code...
end
def residence_info
# code...
end
def residence_info_update
# code...
end
# many more coupled methods like the above...
This single controller will get fairly long, but it's essentially a bunch of coupled methods: one for showing the step (form) and the other for updating and redirecting to the next step.
This would be breaking rails convention and I would have to setup my own routing.
But I'm curious how others have solved this problem? I know both CAN work, but I would like to know which is easier to maintain and code with in the long run.
A resource does not equal a page. I suspect that both ways would break a constraint on REST.
All of your interests have been with the View domain, which resides in your browser. If you want to display a single form in multiple parts you should do so using HTML, CSS etc.
Otherwise your just creating temporary storage on your servers for the forms progress.
I did something like this with https://github.com/pluginaweek/state_machine
The idea was to have one state per step of the form and simply render a different form partial depending on which state the actual resource has. The above gem let's you specify validations and callbacks for each states.
Like this, you can use the standard REST controller actions.
I need some help figuring out the best way to proceed with creating a Rails 3 engine(or plugin, and/or gem).
Apologies for the length of this question...here's part 1:
My company uses an email service provider to send all of our outbound customer emails. They have created a SOAP web service and I have incorporated it into a sample Rails 3 app. The goal of creating an app first was so that I could then take that code and turn it into a gem.
Here's some of the background: The SOAP service has 23 actions in all and, in creating my sample app, I grouped similar actions together. Some of these actions involve uploading/downloading mailing lists and HTML content via the SOAP WS and, as a result, there is a MySQL database with a few tables to store HTML content and lists as a sort of "staging area".
All in all, I have 5 models to contain the SOAP actions (they do not inherit from ActiveRecord::Base) and 3 models that interact with the MySQL database.
I also have a corresponding controller for each model and a view for each SOAP action that I used to help me test the actions as I implemented them.
So...I'm not sure where to go from here. My code needs a lot of DRY-ing up. For example, the WS requires that the user authentication info be sent in the envelope body of each request. So, that means each method in the model has the same auth info hard coded into it which is extremely repetitive; obviously I'd like for that to be cleaner. I also look back now through the code and see that the requests themselves are repetitive and could probably be consolidated.
All of that I think I can figure out on my own, but here is something that seems obvious but I can't figure out. How can I create methods that can be used in all of my models (thinking specifically of the user auth part of the equation).
Here's part 2:
My intention from the beginning has been to extract my code and package it into a gem incase any of my ESP's other clients could use it (plus I'll be using it in several different apps). However, I'd like for it to be very configurable. There should be a default minimal configuration (i.e. just models that wrap the SOAP actions) created just by adding the gem to a Gemfile. However, I'd also like for there to be some tools available (like generators or Rake tasks) to get a user started. What I have in mind is options to create migration files, models, controllers, or views (or the whole nine yards if they want).
So, here's where I'm stuck on knowing whether I should pursue the plugin or engine route. I read Jordan West's series on creating an engine and I really like the thought of that, but I'm not sure if that is the right route for me.
So if you've read this far and I haven't confused the hell out of you, I could use some guidance :)
Thanks
Let's answer your question in parts.
Part One
Ruby's flexibility means you can share code across all of your models extremely easily. Are they extending any sort of class? If they are, simply add the methods to the parent object like so:
class SOAPModel
def request(action, params)
# Request code goes in here
end
end
Then it's simply a case of calling request in your respective models. Alternatively, you could access this method statically with SOAPModel.request. It's really up to you. Otherwise, if (for some bizarre reason) you can't touch a parent object, you could define the methods dynamically:
[User, Post, Message, Comment, File].each do |model|
model.send :define_method, :request, proc { |action, params|
# Request code goes in here
}
end
It's Ruby, so there are tons of ways of doing it.
Part Two
Gems are more than flexible to handle your problem; both Rails and Rake are pretty smart and will look inside your gem (as long as it's in your environment file and Gemfile). Create a generators directory and a /name/name_generator.rb where name is the name of your generator. The just run rails g name and you're there. Same goes for Rake (tasks).
I hope that helps!
Let me begin with an illustrative example (assume the implementation is in a statically typed language such as Java or C#).
Assume that you are building a content management system (CMS) or something similar. The data is hierarchically organised into Folders. Each folder has a collection of children; a child may be a Page or a Folder. All items are stored within a root folder. No cycles are allowed. We have an acyclic graph.
The system will have a remote API and instances of Folder and Page must be serialized / de-serialized across the network. With a typical implementation of folder, in which a folder's children are a List, serialization of the root node would send the entire graph. This is unacceptable for obvious reasons.
I am interested to hear people have solved this problem in the past.
I have two potential suggestions:
Navigation by query: Change the domain model so that the folder class contains only a list of IDs for each child. To access a child we must query for it. Serialisation is now trivial since the graph ends at a well defined point. The major downside is that we lose type safety - the ID could be for something other than a folder/child.
Stop and re-attach: During serialization stop whenever we detect a reference to a folder or page, send the ID instead. When de-serializing we must then look up the corresponding object for each ID and re-attach it at the relevant position in the nascent object.
I don't know what kind of API you are trying to build, but your suggestion #1 sounds like it is close to what is recommended for REST style services and APIs. Basically, a Folder object would contain a list of URLs to its children.
The Navigation by query solution was used for NFS. By reading through your question, it looks to me, as if you're trying to implements kind of a file system yourself.
If you're looking specifically into sending objects over the network there is always CORBA. Aside from that there is DCOM and the newer WCF. But wait there is more like RMI. Furthermore there are Web Services. I'll stop here now.
Suppose You model the whole tree with every element being a Node, specialisations of Node being Folder and, umm, Leaf. You have a "root" Node. Nodes have a methods
canHaveChildren()
getChildren()
Leaf nodes have the obvious behaviours (never even need to hit the network)
Folders getChildren() get the next set of nodes.
I did devise a system with Restful services along these lines. Seemed to be reasonably easy to program to.
I would not do it by the Navigation by query method. Simply because I would like to stick with the domain model where folders contains folders or pages.
Customizing the serialization might also be tricky, bug prone and difficult to change\understand.
I would suggest that you introduce and object like FolderBowser in your model which takes an id and gives you a list of contents of the folder. That will make your service operations simpler.
Cheers,
Unmesh
The classical solution is probably to use a proxy pattern, where some of the graph is sent over the network and some of the folders are replaced by proxies that will not have their lists of children populated until they are queried. A round trip to the server takes a significant amount of time and it will probably result in too many requests if all folders are proxies (this would yield a new request each time the contents of a folder is inspected), so you want to go for some trade off between the size of each chunk of data and the number of server requests needed in a typical scenario. This is of course application specific, but sending the contents of all child folders in for instance depth 2 might be a useful strategy...
Long story short: What will probably work best is your solution #1 with the exception that you want to send more than one folder at a time because of the overhead of a round trip to the server...