Counting frequently changing values on the database - sql

I am building a social network site for educational purposes and was wondering how should I efficiently count "tweets" of a user (if for ex. on the tweets table he has 100k entries). Should I do it via SQL Count() or should I add a "no. of tweets" field per user to fetch easily and update it when user add/delete a tweet? Or if there are better approach on this, I'd deeply appreciate your input.
How about the case of counting the total no. of characters of these "tweets"? Is SUM(CHAR_LENGTH(arg)) efficient? Or caching an updated value better?
Let's say these values (no. of tweets and no. of characters for all tweets) are always being called because they are displayed publicly on their profiles, and on average, these numbers are getting called once per second and their values change once every 30 seconds.
I am just experimenting on algorithms involving large data, so a big thanks if you can help!
Another thing, is PHP & MariaDB fit for this kind of process? Or other stack are better?

Use the count queries first. It will most likely be efficient enough given the amount of data (100K) entries. It will also be much less work and maintenance. 100K is a very small number when it comes to database size. When you get to the 100 of millions or billions or more, then things get a bit interesting.
The main consideration for database performance is the database and yes MariaDB or MySql is fine. What you use for the programming language is a larger question. Php will work fine.

Related

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.

How to reuse results with a schema for end of day stock-data

I am creating a database schema to be used for technical analysis like top-volume gainers, top-price gainers etc.I have checked answers to questions here, like the design question. Having taken the hint from boe100 's answer there I have a schema modeled pretty much on it, thusly:
Symbol - char 6 //primary
Date - date //primary
Open - decimal 18, 4
High - decimal 18, 4
Low - decimal 18, 4
Close - decimal 18, 4
Volume - int
Right now this table containing End Of Day( EOD) data will be about 3 million rows for 3 years. Later when I get/need more data it could be 20 million rows.
The front end will be asking requests like "give me the top price gainers on date X over Y days". That request is one of the simpler ones, and as such is not too costly, time wise, I assume.
But a request like " give me top volume gainers for the last 10 days, with the previous 100 days acting as baseline", could prove 10-100 times costlier. The result of such a request would be a float which signifies how many times the volume as grown etc.
One option I have is adding a column for each such result. And if the user asks for volume gain in 10 days over 20 days, that would require another column. The total such columns could easily cross 100, specially if I start adding other results as columns, like MACD-10, MACD-100. each of which will require its own column.
Is this a feasible solution?
Another option being that I keep the result in cached html files and present them to the user. I dont have much experience in web-development, so to me it looks messy; but I could be wrong ( ofc!) . Is that a option too?
Let me add that I am/will be using mod_perl to present the response to the user. With much of the work on mysql database being done using perl. I would like to have a response time of 1-2 seconds.
You should keep your data normalised as much as possible, and let the RDBMS do its work: efficiently performing queries based on the normalised data.
Don't second-guess what will or will not be efficient; instead, only optimise in response to specific, measured inefficiencies as reported by the RDBMS's query explainer.
Valid tools for optimisation include, in rough order of preference:
Normalising the data further, to allow the RDBMS to decide for itself how best to answer the query.
Refactoring the specific query to remove the inefficiencies reported by the query explainer. This will give good feedback on how the application might be made more efficient, or might lead to a better normalisation of relations as above.
Creating indexes on attributes that turn out, in practice, to be used in a great many transactions. This can be quite effective, but it is a trade-off of slowdown on most write operations as indexes are maintained, to gain speed in some specific read operations when the indexes are used.
Creating supplementary tables to hold intermediary pre-computed results for use in future queries. This is rarely a good idea, not least because it totally breaks the DRY principle; you now have to come up with a strategy of keeping duplicate information (the original data and the derived data) in sync, when the RDBMS will do its job best when there is no duplicated data.
None of those involve messing around inside the tables that store the primary data.

C# with SQL database schema question

Is it acceptable to dynamically generate the total of the contents of a field using up to 10k records instead of storing the total in a table?
I have some reasons to prefer on-demand generation of a total, but how bad is the performance price on an average home PC? (There would be some joins -ORM managed- involved in figuring the total.)
Let me know if I'm leaving out any info important to deciding the answer.
EDIT: This is a stand-alone program on a user's PC.
If you have appropriate indexing in place, it won't be too bad to do on demand calculations. The reason that I mention indexing is that you haven't specified whether the total is on all the values in a column, or on a subset - if it's a subset, then the fields that make up the filter may need to be indexed, so as to avoid table scans.
Usually it is totally acceptable and even recommended to recalculate values. If you start storing calculated values, you'll face some overhead ensuring that they are always up to date, usually using triggers.
That said, if your specific calculation query turns out to take a lot of time, you might need to go that route, but only do that if you actually hit a performance problem, not upfront.
Using a Sql query you can quickly and inexpensively get the total number of records using the max function.
It is better to generate the total then keep it as a record, the same way as you would keep a persons birth date and determine their age then keep their age.
How offten and by what number of users u must get this total value, how offten data on which total depends are updated.
Maybe only thing you need is to make this big query once a day (or once at all) and save it somewhere in db and then update it when data, on which your total consist, are changed
You "could" calculate the total with SQL (I am assuming you do not want total number of records ... the price total or whatever it is). SQL is quite good at mathematics when it gets told to do so :) No storing of total.
But, as it is all run on the client machine, I think my preference would be to total using C#. Then the business rules for calculating the total are out of the DB/SQL. By that I mean if you had a complex calculation for total that reuired adding say 5% to orders below £50 and the "business" changed it to add 10% to orders below £50 it is done in your "business logic" code rather than in your storage medium (in this case SQL).
Kindness,
Dan
I think that it should not take long, probably less than a second, to generate a sum from 8000-10000 records. Even on a single PC the query plan for this query should be dominated by a single table scan, which will generate mostly sequential I/O.
Proper indexing should make any joins reasonably efficient unless the schema is deeply flawed and unless you have (say) large blob fields in the table the total data volume for the rows should not be very large at all. If you still have performance issues going through an O/R mapper, consider re-casting the functionality as a report where you can control the SQL.

Web users searching for too much data

We currently have a search on our website that allows users to enter a date range. The page calls a stored procedure that queries for the date range and returns the appropriate data. However, a lot of our tables contain 30m to 60m rows. If a user entered a date range of a year (or some large range), the database would grind to a halt.
Is there any solution that doesn't involve putting a time constraint on the search? Paging is already implemented to show only the first 500 rows, but the database is still getting hit hard. We can't put a hard limit on the number of results returned because the user "may" need all of them.
If the user inputed date range is to large, have your application do the search in small date range steps. Possibly using a slow start approach: first search is limited to, say one month range and if it bings back less than the 500 rows, search the two preceding months until you have 500 rows.
You will want to start with most recent dates for descending order and with oldest dates for ascending order.
It sounds to me like this is a design and not a technical problem. No one ever needs millions of records of data on the fly.
You're going to have to ask yourself some hard questions: Is there another way of getting people their data than the web? Is there a better way you can ask for filtering? What exactly is it that the users need this information for and is there a way you can provide that level of reporting instead of spewing everything?
Reevaluate what it is that the users want and need.
We can't put a hard limit on the
number of results returned because the
user "may" need all of them.
You seem to be saying that you can't prevent the user from requesting large datasets for business reasons. I can't see any techical way around that.
Index your date field and force a query to use that index:
CREATE INDEX ix_mytable_mydate ON mytable (mydate)
SELECT TOP 100 *
FROM mytable WITH (INDEX ix_mytable_mydate)
WHERE mydate BETWEEN #start and #end
It seems that the optimizer chooses FULL TABLE SCAN when it sees the large range.
Could you please post the query you use and execution plan of that query?
Don't know which of these are possible
Use a search engine rather than a database?
Don't allow very general searches
Cache the results of popular searches
Break the database into shards on separate servers, combine the results on your application.
Do multiple queries with smaller date ranges internally
It sounds like you really aren't paging. I would have the stored procedure take a range (which you calculated) for the pages and then only get those rows for the current page. Assuming that the data doesn't change frequently, this would reduce the load on the database server.
How is your table data physically structured i.e. partitioned, split across Filegroups and disk storage etc. ?
Are you using table partitioning? If not you should look into using aligned partitioning. You could partition your data by date, say a partition for each year as an example.
Where I to request a query spanning three years, on a multiprocessor system, I could concurrently access all three partitions at once, thereby improving query performance.
How are you implementing the paging?
I remember I faced a problem like this a few years back and the issue was to do with how I implemented the paging. However the data that I was dealing with was not as big as yours.
Parallelize, and put it in ram (or a cloud). You'll find that once you want to access large amounts of data at the same time, rdbms become the problem instead of the solution. Nobody doing visualizations uses a rdbms.