How would you model a read/follow systems? - oop

So I've the following domain model :
Article which is basically a blog post and currently an Entity.
Now, I'd like to add the following feature :
When an user view the article (in its browser), an api call is made to "flag" the blog post as being read.
Now, if I do some computation, I should be able to determine which articles haven't been read yet.
When an user post a comment to an article, an api call is made to "flag" the blog post as being followed.
Now, if I do some computation, I should be able to determine if there are some new posted comments since the latest user's comment post.
Basically, both feature (read & follow) share the attribute, an article id, an user id and a read/action date.
Note that, if an Article is followed, and then read, the read date should be used.
Therefore, I though I could use the same object and adding an extra attributes to mark it as followed.
Do you have any design ideas?
Note that are much articles & users, I'm using Doctrine2 and MySQL but this apply to any languages.

To ensure your application scales well, I'd do your computations locally when the events are triggered. I.e. someone adds a comment and it causes the system to check who has an investment in that new comment. Otherwise you end up with a scheduled task processing all the data, which will run fine at first, but will have an exponentially increasing workload as the relations between users, articles and comments increases.
You can also look into using the Map/Reduce pattern, Ayende has a good introduction article to this, which is almost in the same application domain as you describe (articles, comments, etc.).
As for the event of marking an article or comment as read by a particular user, this is something that is neither an article or user thing. If you were using a document database and wanted to store this data against a user, then it could build up quite a bit of data over time, I'd be more tempted to either store the data in a new entity or against the article (as in theory this will have an initial burst of interest and them dip in interest to a level representing it's popularity.
Hopefully some of that might help.

Related

2sxc Knowledge Management solution hurdles

I'm evaluating 2sxc as a possible platform for implementing a knowledge management solution but we're in a bit of a rush. Our alternative is DNN Live Articles.
So far I really like the look of 2sxc, but I have questions regarding our possible use of it.
The main questions I have are around hierarchical lists like nested Categories and permissions.
From the look of some of the apps I've installed like FAQs with Categories but I can't find anything yet where they are nested. I tried creating a Content Type and adding fields where the first is the Category Name and the second is Parent Category. I created a new Content Type Field with a Data Type of Entity, but the only option for Input Type is default and Content Block Items. It works but when you create a new category the content that comes up in the Parent Category field covers just about everything - not sure I understand the concept behind this.
Then the second issue is permissions. Does this system somehow incorporate permissions because we'd like to lock down knowledge articles by category, but I haven't seen any implementations that showcase how one would do this.
Regarding #1 I don't understand your question, sorry :)
Regarding #2: there is no rule-based security, so you can't say "items with category X may be edited, but category Y may not"
BUT: you can easily implement this in your UI, if your main concern is user guidance and not "bad people with very good IT skills"

phpBB3: is there any way to restrict visibility at the topic level?

My BB has an "archives" forum. I want to write a mod that will make it so that, if a topic is in the archives, then it can be viewed only by users who joined before the topic was posted. Is this feasible?
Yes it is feasible. Using the Topics Only Visible to OP mod, you even have a fairly close approximation of what you need to happen functionality wise.
From this mod, a few things would need to change
The viewtopic.php instructions would need to account for post creation date and user creation date instead of being for the original poster
The viewforum.php instructions would need to account for first post creation and user creation date instead of being for the original poster
After look through the installation instructions and changes you'd need to make, it seems those are the two biggest changes required. The ACP changes appear to be more wording, and the variable names could probably be slightly more appropriate since your mod won't be about only the OP seeing the post.

API object versioning

I'm building an API and I have a question about how to represent objects.
Imagine we have a system with Articles that have a bunch of properties. Some of these properties are complex, for example the Author of the Article refers to another object. We have an URL to fetch all the articles in the system, and another URL to fetch a particular Article.
My first approach to implement this would be to create two representations of the same object Article, because when you request all the articles, it makes sense that you don't retrieve all the information about the Articles, but for example just the title, the date and the name of the author (instead of the whole Author object), excluding other properties like tags, or the content. The idea beneath this is to try to make the response of all the Articles a little bit lighter.
Now I'm going to the client side, and I decide to implement a SDK for Android, for example. So the first step would be to create the objects to store the information that I retrieve from the API. Now a problem pops up, because I want to define the Article object, but I would need two versions of it and it's not only more difficult to implement, but it's going to be more difficult to use.
So my question is, when defining an API, is it a good practice to have multiple versions of the same object (maybe a light one, and a full one) to save some bandwidth when sending the result of a request but generating a more difficult to use service, or it's not worth it and you should retrieve always the same version of the object, generating heavier responses but making the service easier to use?
I work at a company that deals with Articles as well and we also have a REST API to expose the data.
I think you're on the right track, but I'll even take it one step further. These are the potential three calls for large entities in an API:
Index. For the articles, this would be something like /articles. It just returns a list of article ids. You can add parameters to filter, sort, etc. It's very lightweight and I've found it to be very useful.
Header/Mini/Light version. These are only the crucial fields that you think will meet the widest variety of use cases. For us, we have a lot of use cases where we might want to display the top 5 articles, and in those cases, only title, author and maybe publication date. Those fields belong in a "header" article, or a "light" article. This is especially useful for AJAX calls as you don't want to return the entire article (for us the object is quite large.)
Full version. This is the full article. All the text/paragraphs/image references - everything. It's a heavy call to make, but you will be guaranteed to get whatever is available.
Then it just takes discipline to leave the objects the way they are. Ideally users are able to get the version described in (2) to save time over the wire, but if they have to, they go with (3).
I've considered having a dynamic way to return only fields people are interested in, but it would be a lot of implementation. Basically the idea was to let the user go to /article and then show them a sample JSON result. Then the user could click on the fields they wanted returned and get a token. Then they'd pass the token as a parameter to the API and the API would then know which fields to return.
Creates a dynamic schema. Lots of work and I never got around to it, but you can see that if you want to be creative, you can.
Consider whether your data (for one API client) is changing a lot or not. If it's possible to cache data on the client, that'll improve performance by not contacting the API as much. Otherwise I think it's a good idea to have a light-weight and full-scale object type (or more like two views of the same object type).
In the client you should implement it as one object type (to keep it DRY; Don't Repeat Yourself) with all the properties. When fetching a light-weight object, you only store a few of the properties, the rest being null (or similar “undefined” value for the given property type). It should be possible to determine whether all or only a partial subset of the properties are loaded.
When making API requests in the client on a given model (ie. authors) you should be explicit about whether the light-weight or full-scale object is needed and whether cached data is acceptable. This makes it possible to control the data in the UI layer. For example a list of authors might only need to display a name and a number of articles connected with that author. When displaying the author screen, more properties are needed. Also, if using cached data, you should provide a way for the user to refresh it.
When the app works you can start to implement optimizations like: Don't fetch light-weight data if full-scala data is already known & Don't fetch data at all if a recent cache copy exists. I think the best is to look at the actual use cases and improve performance with the highest value for the user.

How to analyse Wikipedia article's data base with R?

This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).
I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
How do you start planning for such a project?
You'll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
# 2010-03-17 04:33:50 done Log events to all pages.
* This contains the log of actions performed on pages.
* pages-logging.xml.gz 1.0 GB
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:
select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);
etc.
(I'm not sure what their actual schema is, but it'll be something like that.)
You'll run into issues, no doubt, but you'll learn a lot too. Good luck!
You could
work with the wikipedia database dumps, as already mentioned
work with the live mediawiki API, see this minimal example at Rosettacode or my unfinished approach with a S3 class or this package by Peter Konings
work with dbpedia, an effort to extract knowledge from wikipedia into a knowledge base. They offer an online sparql access I don't know much about, and also datasets as n-triples for download. See this python script which might be a starting point for an R script. This approach might be useful to access the content stored in the wikipedia (such as the infoboxes) but I am not sure if information on contributors to the wikipedia is available.
Try WikiXRay (Python/R) and zotero.

Generate webpages directly from database or cache?

[I'm not asking about the architecture of SO, but it would be helpful to the question.]
On SO, when a user clicks on his/her name and clicks on "responses" they see other users responses to comment threads, questions, and answers in which they have participated. I've had the sneaking suspicion that I've missed certain responses out there, which made me wonder: if you had to build that thing, would you pull everything dynamically from the database every time a user requested it? Or would you modify it when there is new related activity in the application? Or would you build it in a nightly daemon process?
I imagine that the real answer is that it's dynamically constructed every time, but that the tables are denormalized in such a way so as to make the thing less time-consuming. How would you build it?
I'm asking about any platform, of course, not only on .Net.
I would pull it dynamically from the database every time. I think this gives you the best result from a user experience and then I would apply the principal that premature optimization is evil. Later if there were performance issues I would look into caching.
I think doing it as a daemon/push process would actually result in more overall work being done. That is the updates would happen more frequently than the users are requesting the info.
Obviously, when an answer or comment is posted, you'll want to identify the user that should be informed in their responses tab. Then just add a row to a responses table containing the response text, timestamp, and the user to which it belongs. That way you can dynamically generate the tab with a simple
select * from responses where user=<userid> order by time desc limit 30
or something like that.
p.s. Extra credit to anyone that can write a query that will remove old responses - assume that each person should have the last 30 responses in their responses tab.
I expect that userid would be a natural option for the clustered index. If you have an "Active" boolean field then you don't need to worry much about locks; the table could be write-only except to update the (unindexed) Active column. I bet it already works that way, since it appears that everything is recoverable.
Don't need no stinking extra-credit response remover.
I would assume this is denormalized in the database. The Comment table probably has both and answer_id and an answer_uid so the SQL to find comments on you answers just run against the comment table. The same setup would work on the Answer table. Each answer has a question_id and a question_uid.
Having said that, these are probably the same table and you have response_to_id and response_to_uid and that makes lots of code simpler and makes the "recent" tab a single select as well. In fact the difference between the two selects is one uses the uid and one uses the response_to_uid.
I'd say that your UI and your database should both be driven by your Application Domain; so they will reflect each other based on their common provenance there.
Some quick notes to illustrate, using simplified Object Role Modeling as discussed by Fowler et al.
Entities
Users
Questions
Answers
Comments
Entity Roles
(Note: In Object Role Modeling, most Roles are reflexive. Some, e.g. booleans here, are monopolar)
Question has User
Question has QuestionVersions
Question as Answers
Question has Comments
Answer has AnswerVersions
Answer has Comments
Question has User
QuestionVersion has Text
QuestionVersion has Timestamp
QuestionVersion has IsDeleted (could be inferred from nonNULL timestamp eg)
QuestionVersion has DeltedByUser
QuestionVersion has DeletedTimestamp
Answer has User
AnswerVersion has Text
AnswerVersion has Timestamp
AnswerVersion has IsDeleted
AnswerVersion has DeltedByUser
AnswerVersion has DeletedTimestamp
Comment has Text
Comment has User
Comment has Timestamp
Comment IsDeleted (boolean)
(note - no versions on comments)
I think that's the basics. These assertions drive ERDs in ORM. Hopefully it's self-evident how they drive the User Stories as well.
I don't think an implementation of a normalized design like this would require denormalization - especially since I think it's clear (from behavior) that queries => UI displays are cached to be refreshed 1X per minute.