Efficient way to update all rankings of a table? - sql

Hope this question makes sense. Suppose we have a table of the top 1,000,000 most visited websites on the internet. The table would look something like this.
Name
Address
Visits
Ranking
Example Site
example.com
1000000000
1
Stack Overflow
stackoverflow.com
900000000
2
...
...
...
...
Small Site
smallsite.com
100
999999
Tiny Site
tinysite.com
1
1000000
Each week we check how many site visits each site had, and depending on the site visits we update their ranking. If we had a site that blew up in the past week and went say from rank 900,000 to 1,000, what is the most efficient way to update all the rows affected in the table? As they would need to be bumped up by 1.
Is there a concept in SQL that would allow us to make large batch changes like that without locking out the table?
Obviously this doesn't take into consideration all the other changes in rankings, but wanted to keep this as simple as possible.

Related

Best Practise for SQL Replication / Load Managing

I'm currently running an Ubuntu Server with Mariadb on it. It serves all sql requests for a Website (with a good amount of requests on it).
Few times a day we import large CSV files into the Database to update our Data. The Problem is, since those csv blast the db (import takes around 15 Minutes).
It seems to be using only 1 Core from 4 but still, the website (or better its sql requests during that time) get ridiculously slow. Now the question for me is what can I do here to affect the Website less to none?
I was considering Database replication to a different server, but I'm expecting that to use the same amount of ressources during the import time so no real benefit here I guess?
The other thing I considered is to have 2 SQL Databases, and during an import all requests should be switched to the other Database Server and I would basicly do each import twice, one on Server 1 (during that time Server 2 should serve the site) once thats done, the website has to be swiched to Server 1 and the import is done on Server 2. While that would work, it seems to be quite an effort for a non perfect solution (like how are requests handled during the switch from Server 1 to 2 and so on.
So what solutions exist here, preferably somewhat affordable.
All ideas and hints are welcome.
Thanks in advance
Best Regards
Menax
Is the import replacing an entire table? If so, load it into a separate table, then swap it into place. Essentially zero downtime, namely during the RENAME TABLE. For details, see http://mysql.rjweb.org/doc.php/deletebig or possibly http://mysql.rjweb.org/doc.php/staging_table
If the import is doing something else, please provide details.
One connection uses one core, no more.
More (from Comments)
SELECT id, marchants_id
from products
WHERE links LIKE '%https://´www.merchant.com/productsite_5'
limit 1
That is hard to optimize because of the leading wildcard in the LIKE. Is that really what you need? As it stands, that query must scan the table.
SELECT id, price
from price_history
WHERE product_id = 5
order by id desc
limit 1
That would benefit from INDEX(product_id, id, price) -- in that order. With that index, the query will be as close to instantaneous as possible.
Please provide the rest of the transaction with the Update and Insert, plus SHOW CREATE TABLE. There is quite possibly a way to "batch" the actions rather than doing one product price at a time. This may speed it up 10-fold.
Flip-flopping between two servers -- only if the data is readonly. If you are otherwise modifying the tables, that would be a big nightmare.
For completely replacing a table,....
CREATE a new TABLE
Populate it
RENAME TABLE to swap the new table into place.
(But I still don't understand your processing well enough to say if this is best. When you say "switch the live database", are you referring to a server (computer), a DATABASE (schema), or just one TABLE?

Counting frequently changing values on the database

I am building a social network site for educational purposes and was wondering how should I efficiently count "tweets" of a user (if for ex. on the tweets table he has 100k entries). Should I do it via SQL Count() or should I add a "no. of tweets" field per user to fetch easily and update it when user add/delete a tweet? Or if there are better approach on this, I'd deeply appreciate your input.
How about the case of counting the total no. of characters of these "tweets"? Is SUM(CHAR_LENGTH(arg)) efficient? Or caching an updated value better?
Let's say these values (no. of tweets and no. of characters for all tweets) are always being called because they are displayed publicly on their profiles, and on average, these numbers are getting called once per second and their values change once every 30 seconds.
I am just experimenting on algorithms involving large data, so a big thanks if you can help!
Another thing, is PHP & MariaDB fit for this kind of process? Or other stack are better?
Use the count queries first. It will most likely be efficient enough given the amount of data (100K) entries. It will also be much less work and maintenance. 100K is a very small number when it comes to database size. When you get to the 100 of millions or billions or more, then things get a bit interesting.
The main consideration for database performance is the database and yes MariaDB or MySql is fine. What you use for the programming language is a larger question. Php will work fine.

Convert multiple rows into single column

I have a database table, UserRewards that has 30+ million rows. In this row, there is a userID, and a rewardID per row (along with other fields).
There is a users table (has around 4 million unique users), that has the primary key userID, and other fields.
For performance reasons, I want to move the rewardID per user in userrewards into a concatenated field in users. (new nvarchar(4000) field called Rewards)
I need a script that can do this a fast as possible.
I have a cursor which joins up the rewards using the script below, but it only processes around 100 users per minute, which would take far too long to get though the around 4 million unique users I have.
set #rewards = ( select REPLACE( (SELECT rewardsId AS [data()] from userrewards
where UsersID = #users_Id and BatchId = #batchId
FOR XML PATH('') ), ' ', ',') )
Any suggestions to optimise this? I am about to try a while loop so see how that works, but any other ideas would be greatly received.
EDIT:
My site does the following:
We have around 4 million users who have been pre assigned 5-10 "awards". This relationship is in the userrewards table.
A user comes to the site, we identify them, and lookup in the database the rewards assigned to them.
Issue is, the site is very popular, so I am having a large number of people hitting the site at the same time requesting their data. The above will reduce my joins, but I understand this may not be the best solution. My database server goes upto 100% CPU usage within 10 seconds of me turing the site on, so most people's requests timeout (they are shown an error page), or they get results, but not in a satisfactory time.
Is anyone able to suggest a better solution to my issue?
There are several reasons why I think the approach you are attempting is a bad idea. First, how are you going to maintain the comma delimited list in the users table? It is possible that the rewards are loaded in batch, say at night, so this isn't really a problem now. Even so, one day you might want to assign the rewards more frequently.
Second, what happens when you want to delete a reward or change the name of one of them? Instead of updating one table, you need to update the information in two different places.
If you have 4 million users, with thousands of concurrent accesses, then small inconsistencies due to timing will be noticeable and may generate user complaints. A call from the CEO on why complaints are increasing is probably not something you want to deal with.
An alternative is to build an index on UserRewards(UserId, BatchId, RewardsId). Presumably, each field is few bytes, so 30 million records should easily fit into 8 Gbytes of memory (be sure that SQL Server is allocated almost all the memory!). The query that you want can be satisfied strictly by this index, without having to bring the UserRewards table into memory. So, only the index needs to be cached. And, it will be optimized for this query.
One thing that might be slowing everything down is the frequency of assigning rewards. If these are being assigned at even 10% of the read rate, you could have the inserts/updates blocking the reads. You want to do the queries with READ_NOLOCK, to avoid this problem. You also want to be sure that locking is occurring at the record or page level, to avoid conflicts with the reads.
Maybe too late, but using uniqueidentifiers as keys will not only quadruple your storage space (compared to using ints as keys), but slow your queries by orders of magnitude. AVOID!!!

How to best store and aggregate daily, weekly, monthly visits for quick retrieval?

I am using SQL Server 2008 and ColdFusion 9.
I need to log visit to my web site. This will be for users who are logged in. I need to be able to retrieve how many times they have logged in this week, this, this year and as well as how many consecutive days, very much like how StackExchange does it. I want to be able to show a calendar for any month and display the days that the visitor visited.
I am not sure of the best way to store this data or retrieve it. My initial thought is to create a daily or weekly table that records every hit by every user. I would store the UserID and timestamp like this.
TABLE_VISITS_LAST_SEVEN_DAYS
UserID VistitDateTime
101 2012-10-06 01:23:00
101 2012-10-06 01:24:00
101 2012-10-07 01:25:00
102 2012-10-07 01:23:00
102 2012-10-07 01:24:00
102 2012-10-07 01:25:00
At the end of each day, I would determine who visited the site and aggregate the visits to essentially remove duplicate info. So, I will delete this above data and insert it into a table that would only store this data:
TABLE_VISITS_ALL_TIME
UserID VistitDate
101 2012-10-06
101 2012-10-07
102 2012-10-07
This data would be easy to query and wouldn't store any unnecessary data. I'd have all of the data that I need to determine how frequently the user visits my site with not much effort.
Is this a good plan? Is there an easier or better way? Does my plan have a gaping hole in it? Ideas would be appreciated.
You could change the VisitDateTime column declaration in TABLE_VISITS_LAST_SEVEN_DAYS to VisitDate as Date, and then log each visit in a manner like this:
INSERT INTO TABLE_VISITS_LAST_SEVEN_DAYS
SELECT #UserID, #VisitDate
WHERE NOT EXISTS (
SELECT 1 FROM TABLE_VISITS_LAST_SEVEN_DAYS (NOLOCK)
WHERE UserID=#UserID AND VisitDate=#VisitDate
)
(#VisitDate is a Date type variable)
I don't understand the need for the two tables. The second one is simply a de-duplicated version of the first; any aggregate queries you do will still have to do the same index scans, just on a slightly smaller table.
Personally I think it would make more sense if you created your first table, but put a unique index on userid and the yyyy-mm-dd part of visitdatetime ( though visitdate might now be more appropriate ). If you have a duplicate entry catch the exception and ignore it.
Then your first table becomes your second by definition and you don't need to do any extra work in the background.
The major problem with this method would be that if you ever wanted to count the number of time someone logged on in a single day you couldn't.
Why not just store each visit and if you need daily/weekly/whatever statistics create a query that aggregates as needed? It all depends on how many visits you're expecting and what time period you want to retain statistics for.
Edit:
It sounds like you suggesting that designing it poorly is just fine as
long as I've got a fast server. Is that right?
That's not what I'm saying at all. Your first solution is not a poor solution. Your second solution is not "better". If anything, it is somewhat denormalized.
There is no "best way" to do what you've described. There are multiple possible solutions, some of which may be adequate for your needs and some of which may not.
If you are interested in statistics, like how often individual users visit your site and how many times a day and when, your first table tells you that. This comes with some additional overhead when doing aggregation.
If all you will ever care about is whether a user visited your site on a given day, why not store just that information? Insert one row on a user's first visit that day and don't do that again until tomorrow.
Whether or not the additional overhead of recording one row per visit is too much will depend on your exact application. A small site that gets a few thousand hits per month is not the same thing as a massive site like Amazon.
Furthermore, there's multiple ways to do even the first solution. How are the indexes set up, etc. Why not just do it and see if it works? Create a table, insert what you think will be a typical amount of data and give it a try. If it's not performant enough, then worry about other aggregating tables and nightly jobs and such.
... premature optimization is the root of all evil. -- Donald Knuth

How much is performance improved when using LIMIT in a SQL sentence?

Let's suppose I have a table in my database with 1.000.000 records.
If I execute:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000 records and just do:
SELECT * FROM [Table]
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
I said 1.000.000 records, but it could be 20.000.000. That was just an example.
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
To make it generic:
Table1: X records
Table2: Y records
(X << Y)
What I want to compare is:
SELECT * FROM Table1
and
SELECT * FROM Table2 LIMIT X
Edit 2:
Here is why I'm asking this:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000 records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
So, the question came up because I was thinking about doing this (limiting to X records per query) or if storing in the database only X results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
So, that table could contain 5.000.000 records or more, and what I don't want is to show the user 1000 or so, and even like this, the query still be as slow as if it would be returning the 5.000.000 rows.
TAKE 1000 from a table of 1000000 records - will be 1000000/1000 (= 1000) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
The ORDER BY clause follows an index - the above statement is still true.
The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
it has to inspect ALL records, and sort by ORDER BY
deliver only a subset (TAKE count)
so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
If Y > X only by a little bit, then also similar speed.
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
Thanks for the clarification.
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.