We are looking at restructuring our database. We currently list about 60,000 boats tracking views per boat in months this is updated on each boat pageview. The current database is like;
BoatID Year Month Views
1554 2013 2 124
1554 2013 3 1542
We would like to store information daily in this kind of structure(see below) will this have any strain on the database?(in one year we will have a minimum of 60,000 x 365 = 21,360,000 rows)
BoatID Date Views
1554 01/02/2013 20
1554 02/02/2013 142
About our site - we receive around 6,000,000 to 7,000,000 page views a month. We have a dedicated database server running sql2008 - quad-core 2.2 x 2, 24gb ram.
The new design looks fine.
If I calculated correctly, you are getting (on average) less than 1 request per second. Even if we double it (some people sleep during nights), that's 2 requests per second.
I have a strong feeling, that the database server you mentioned will be able to handle it.
Some ideas:
check whether your application can cache view hits and perform inserts/updates every, say 10 views
make sure you index your table correctly to minimize query time, it will take little time to update a row
if you have low traffic during night, you can make a job which would insert new entries for each boat daily; for heavily indexed tables inserts may be very costly; it's an idea, I haven't tested it or used it, ever...
The problem with your existing structure is that you are limited to providing a high level overview of the views for each boat. You can only show the views based on month/year. If you ever want to drill-down into the data to see what days have more views or activity, you can't.
The second structure gives you more flexibility when it comes to reporting, date comparisons, etc.
As far as performance of your queries you will need to use the execution plan and query tuning to determine how queries will perform.
Related
I work on a database where we store sales of about 300 stores. There is 1 table per store and the total amount of lines is about 120 million (4 million for the biggest table).
The machine is a windows server 2008 R2 on a citrix virtual machine with 65Gb memory and SQL Server version is 2014.
Lines are added from the stores to the database via a webservice every minute so that customers (the store ownsers) can view their stats almost almost in real time.
Christmas is close and the amount of sales per day is increasing, it is now something like 100k lines per day.
The monitor says there is about 100-200 queries per second, they are all before their statistics and therefore query a lot of data.
Database I/O says about 0.1Mb/s ~ 0.5Mb/s.
CPU goes from 10% to 50%.
Often, the database server stop responding (no more connection possible) for about 30 sec ~ 2 min and I don't know why.
Is there any way I can find out why ?
Should I upsize or do something else ?
As data is not relational at all, may I go to a nosql solution for better availability ?
We use SQL Server and it can handle that much data. The profiler should give you some useful information.
If the data is not relational nosql will be faster. Depending on your needs the most recent version of MongoDB is worth checking out.
Actually, it was a hardware problem.
Everything is back to normal after changing the hard drive.
I'd like some input on designing the SQL data layer for a service that should store and provide the latest N entries for a specific user. The idea is to track each user (id), the time of an event and then the event id.
The service should only respond with the last X numbers of events for each user, and also only contain the events that occured during the last Y number of days.
The service also needs to scale to large amounts of updates and reads.
I'm considering just a simple table with the fields:
ID | USERID | EVENT | TIMESTAMP
============================================
1 | 1 | created file Z | 2014-03-20
2 | 2 | deleted dir Y | 2014-03-20
3 | 1 | created dir Y | 2014-03-20
But how would you consider solving the temporal requirements? I see two alternatives here:
1) On insert and/or reads for a user, also remove outdated and all but the last X events for a user. Affects latency as you need to perform both select,delete and insert on each request. But it keeps the disk size to minimum.
2) Let the service filter on query and do pruning as a separate batch job with some sql that:
First removes all obsolete events irrespective of users based on the timestamp.
Then do some join that removes all but the last X events for each user.
I have looked for design principles regarding these requirements which seems like fairly common ones. But I haven't yet found a perfect match.
It is at the moment NOT a requirement to query for all users that have performed a specific type of events.
Thanks in advance!
Edit:
The service is meant to scale to millions of requests / hour so I've been playing around with the idea of denormalizing this for performance reasons. Given that the requirements are set in stone:
10 last events
No events older than 10 days
I'm actually considering a pivoted table like this:
USERID | EV_1 | TS_1 | EV_2 | TS_2 | EV_3 | TS_3 | etc up to 10...
======================================================================
1 | Create | 2014.. | Del x | 2013.. | etc.. | 2013.. |
This way I can probably shift the events with a MERGE with SELECT and I get eviction for "free". Then I only have to purge all records where TS_1 is older than 10 days. I can also filter in my application logic to only show the events that are newer than 10 days after doing the trivial selects.
The caveat is if events comes in "out of order". The idea above works if I can always guarantee that the events are ordered from "left to right". Probably have to think a bit on that one..
Aside from the fact that it is basically a big cut in the relational data model, do you think I'm on the right track here if it comes to prioritize performance above all?
Your table design is good. Consider also the indexes you want to use. In practice, you will need a multi-column index on (userid, timestamp) to quickly respond to queries that query the last N events having a certain userid. Then you need a single-column index on (timestamp) to efficiently delete old events.
How many events you're planning to store and how many events you're planning to retrieve per query? I.e. does the size of the table exceed the RAM available? Are you using traditional spinning hard disks or solid-state disks? If the size of the table exceeds the RAM available and you are using traditional HDDs, note that each row returned for the query takes about 5-15 milliseconds due to slow seek time.
If your system supports batch jobs, I would use a batch job to delete old events instead of deleting old events at each query. The reason is that batch jobs do not slow down the interactive code path, and can perform more work at once provided that you execute the batch job rarely enough.
If your system doesn't support batch jobs, you could use a probabilistic algorithm to delete old events, i.e. delete only with 1% probability if events are queried. Or alternatively, you could have a helper table into which you store the timestamp of the last deleting of old events, and then check that timestamp and if it's old enough then perform new delete job and update the timestamp. The helper table should be so small that it will always stay in the cache.
My inclination is not to delete data. I would just store the data in your structure and have an interface (perhaps a view or table functions) that runs a query such as;
select s.*
from simple s
where s.timestamp >= CURRENT_DATE - interval 'n days' and
s.UserId = $userid
order by s.timestamp desc
fetch first 10 row only;
(Note: this uses standard syntax because you haven't specified the database, but there is similar functionality in any database.)
For performance, you want an index on simple(UserId, timestamp). This will do most of the work.
If you really want, you can periodically delete older rows. However, keeping all the rows is advantageous for responding to changing requirements ("Oh, we now want 60 days instead of 30 days") or other purposes, such as investigations into user behaviors and changes in events over time.
There are situations that are out-of-the-ordinary where you might want a different approach. For instance, there could be legal restrictions on the amount of time you could hold the data. In that case, use a job that deletes old data and run it every day. Or, if your database technology were an in-memory database, you might want to restrict the size of the table so old data doesn't occupy much memory. Or, if you had really high transaction volumes and lost of users (like millions of users with thousands of events), you might be more concerned with data volume affecting performance.
So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.
I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.
I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...
It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.
This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.
Thank you very much & Best Regards,
Lisa
A year later update:
Please note some big developments since this situation:
Querying prices are 85% down.
GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.
BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.
Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.
That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.
In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.
There are ways to make these same queries consume much less data:
Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.
For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.
Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.
Github archive table contains data for all time. Partition it if you only want to see one month data.
For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!
SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'
Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.
(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)
I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.