Logging visits performance wise - sql

So I'm developping a website in php, with the framework symfony (not like it matters for the question though).
My website has some kind of articles, pages that will be created.
So I'd like to have counts of visits by day, week, etc... not only for my personal stats but to display on the homepage any article of the day, or something like that.
The way I would do it is : each time someone visit an article, it insert a record in a visit_log table, with the date and the id of the table.
An ON DUPLICATE (or equivalent) would be interesting to perform an update of the count per day instead of an update.
That's pretty simple and working but I can't help but wondering : is it the right way to do ? I'm thinking mostly of performance but of the "well thought" as well. I know the table can get big over time but I guess a cron to clean it regularly would do the trick.
Any thought ? I guess it's a super common thing that people thought of

Are you facing performance issues with the current implementation?
If you're, then one approach would be: instead of inserting each view directly into the database as it takes place, use an in-memory store to keep a list of pages and their view count {'Id': 'count'} for some interval (e.g. 5-10 mins) or till the store reaches a certain size, and after the store reaches its time or size limits, insert the counts into the database.
Another approach is to periodically parse the web-server logs (most web-servers log this information by default), and again do bulk inserts to the database.

Related

Refactoring my database schema

I'm refactoring my current schema and it's too abstract for me.
I monitor my servers with a homemade monitoring software. This software sends HTTP requests to a Rails web server with about ten different fields worth of information so I can get a quick overview of everything.
My current implementation:
server [id, name, created_date, edited_date, ..., etc ]
status_update [id, server_id, field1, field2, field3, created_date, edited_date, ..., etc]
I treat the servers as Users and status updates as Tweets. I delete any status_update on a server_id older than the tenth one just to keep from growing to infinity.
Though I'm starting to run into a few complications. I need to display information from the most recent status_update on the index page, I need to sort the servers based on status_update info, I need to store info from certain status_updates that may be way older than 10 status_updates old. It also seems like I'm going to start needing to store information from status_updates in both the server and status_update, which would cause hitting the DB multiple times on an insert. Thus, I am looking to refactor.
My requirements:
I only need to display information from the most recent update.
Having the next 9 status_updates helps debug if the system goes offline.
I need to be able to sort based on some info from most recent status_update.
I need the database to remain small (Heroku free).
Ideal performance, IE not hitting database more than once unless necessary.
Non-Complicated DB structure so I can pass it along.
Edit: Additional Info => I am looking to ultimately monitor about 150-200 servers (a lot for a hobby dev, but I'm cheap). Each monitoring service posts every five minutes or so unless something goes wrong. So, worst case scenario has me reaching max capacity every four hours.
I was thinking it would be nice to track when the last time X event happened, and what the result was. Thus, tracking that information would have to be moved to the server model itself since I'm wiping out old records and would lose the information after an hour or so. Though in retrospect, I could just save that info in memory using the monitoring service and send it up every five minutes or only once each time it changes. I could also simply edit that information only when it changes, so as to process less information on each request. Hm!
Efficiency
All ORMs, including ActiveRecord, are designed and built around certain tradeoffs. It's commonplace for ORMs to use several simple SELECT statements to do what a SQL developer would do with a single SELECT statement. You're probably not overwhelming Heroku with your queries.
There's no reasonable structural solution to this problem.
Size
Your "status_update" table should be able to hold an enormous number of rows. Heroku's hobby-dev plan allows 10,000 rows. How many servers do you seriously expect to monitor on a free plan? If I were you, I would delete old rows from it no more than once a day, or when I got a permission error. (On Heroku, certain permission errors mean you're over the row limit.)
It also seems like I'm going to start needing to store information
from status_updates in both the server and status_update, which would
cause hitting the DB multiple times on an insert.
This really makes little sense. Tweets don't require updates to the user account; status updates don't require updates about the server. This might suggest refactoring is in order, but I'd want to see either your models or your CREATE TABLE statements to be sure. (You can paste those into your question, and leave a comment here.)
Alternatives
I'd seriously consider running this Rails app on a local machine, writing data to a database on the local machine, especially if you intend to target 200 web servers. This would eliminate all Heroku row limits, and you don't really need to run it 24 hours a day if this is just a hobby. If you're doing this professionally, your income from it should easily cover the cost of a hobby-basic plan on Heroku. (Currently $9.00/month.) But even then I'd think hard about hosting this locally.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Fetching few sorted records from huge table

Let's pretend I have a huge website and a huge table (with some million entries table) with a few columns ("ID", "AuthorID", "Message", "Time" for example) to contain twitter-like messages.
I want to execute the following simple query:
SELECT * FROM HugeTable ORDER BY Time DESC LIMIT 1,10;
This query shall be executed a lot of times (tens per second). How do I make sure that this query is very fast?
I tought memcached could have been a solution, but new posts are added very quickly, and using memcached I would provide "old" messages to users.
Assume that I have only one mysql server, and it is so good that it can handle all the traffic.
My problem is that the server should take the whole table, sorting that (huge bottleneck here), and then take only the first 10. So, what is the best optimization that I could do? Partitioning maybe? Also, inside the table, newer posts are put on the bottom, so it's safe to assume that a new post will have "ID" and "Time" >= than the previous.
Thanks in advance.
P.S: I'm not an expert of MySql (even though I know the basis), and I have no clue about NoSql methods. If you believe NoSql is the way to suffice my task, then I'm open to learn using something new :)
As you surmise, caching is the way to go. Either by creating a parallel table with the ten records you want in it (each time you do an insert, you remove the oldest one), or by doing the same thing further up the stack in memory. It's about how you manage what's in the cache.
No an answer to your question, but as an answer to your problem. I wouldn't use a query but would use a websocket solution to push posts to clients when they arrive. Connected clients would always receive the latest post when they arrive and a websockets solution should have less overhead.

Web Leaderboard

I'm implementing a Leaderboard into my django web app and don't know the best way to do it. Currently, I'm just using SQL to order my users and, from that, make a Leaderboard, however, this creates two main problems:
Performance is shocking. I've only tried scaling it to a few hundred users but I can tell calculating ranking is slow and excessive caching is annoying since I need users to see their ranking after they are added to the Leaderboard.
It's near-impossible to tell a user what position they are without performing the whole Leaderboard calculation again.
I haven't deployed but I estimate about 5% updates to Leaderboard vs 95% reading (probably more, actually) the Leaderboard. So my latest idea is to calculate a Leaderboard again each time a user is added, with a position field I can easily sort by, and no need to re-calculate to display a user's ranking.
However, could this be a problem if multiple users are committing at the same time, will locking be enough or will rankings stuff up? Additionally, I plan to put this on a separate database solely for these leaderboards, which is the best? I hear good things about redis...
Any better ways to solve this problem? (anyone know how SO makes their leaderboards?)
I've written a number of leaderboards libraries that would help you out there. The one that would be of immediate use is python-leaderboard, which is based on the reference implementation leaderboard ruby gem. Using Redis sorted sets, your leaderboard will be ranked in real-time and there is a specific section on the leaderboard page with respect to performance metrics for inserting a large number of members in a leaderboard at once. You can expect to rank 1 million members in around 30 seconds if you're pipelining writes.
If you're worried about the data changing too often in real-time, you could operate Redis in a master-slave configuration and have the leaderboards pull data from the slave, which would only poll periodically from the master.
Hope this helps!
You will appreciate the concept of sorted sets in Redis.
Don't miss the paragraph which describes your problem :D
Make a table that stores user id and user score. Just pull the leader board using
ORDER BY user_score DESC
and join the Main table for the User name or whatever else you need.
Unless the total number of tests is a variable in your equation, the calculation from your ranking system should stay the same for each user so just update individual entries.

Optimising a query for Top 5% of users

On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.