Best practice for potentially large amount of data in single table SQL

Best practice for potentially large amount of data in single table SQL - sql

I have a table with three columns: user_id, post_id and voting.
Each post in the application can get a voting by multiple users but only one voting per user.
Imagine we end up with 1 billion posts and an average of 1000 votings on each. Would it still be a good practice to store all these information in a single table? (performancewise)
Is it maybe better to split the information in different tables, maybe userwise? (potentially multiple thousand users, so multiple tables)

For best practices create a non-clustered index on user_id the reason I say this is it will help you to run operations like - get all posts of a particular user etc.
If in future record is very big like half to what you mentioned you can do DB shading it will improve overall performance. The idea is to distribute data that can’t fit on a single node onto a cluster of database nodes. You can refer to your vendor's documentation to achieve these.

Related

Auto-deleting particular records in one table of Oracle database using SQL

I've got question concerning auto deleting particular records in one table of Oracle database using SQL.
I am making small academic project of database for private clinic and I have to design Oracle database and client application in Java.
One of my ideas is to arrange table "Visits" which stores all patients visits which took place in the past for history purposes. Aforementioned table will grow pretty fast so it will have weak searching performance.
So the idea is to make smaller table called "currentVisits" which holds only appointments for future visits because it will be much faster to search through ~1000 records than few millions after few years.
My question is how to implement auto deleting records in SQL from temporary table "currentVisits" after they took place.
Both tables will store fields like dateOfVisit, patientName, doctorID etc.
Is there any possibility to make it work in simple way? For example using triggers?
I am quite new in this topic so thanks for every answer.

Don't worry about the data size. Millions of records is not particularly large for a database on modern computing hardware. You will need an appropriate data structure, however.
In this case, you will want an index on the column that indicates current records. In all likelihood, the current records will be appended onto the end of the table, so they will tend to be congregating on a handful of data pages. This is a good thing.
If you have a heavy deletion load on the table, or you are using a clustered index, then the pages with the current records might be spread throughout the database. In that case, you want to include the "current" column in the clustered index.

Hard limit on number of tables in a BQ project

I've got some highly partitionable data that I'd like to store in BigQuery, where each partition would get its own table. My question is if BQ will support the number of tables I'll need.
With my data set, I'd be creating approximately 2,000 new tables daily. All tables would have a 390 day (13 month) expiration, so eventually there'd be a constant count of ~ 2,000 tables * 390 days = ~780,000 tables in this particular project.
I'd test this myself, but BQ only supports a max of 10,000 load jobs per project per day.
Does anyone have experience with this sort of table count? Is there any official table limit provided by Google?

There are projects with that number of distinct tables today. There is not currently a hard cap on the number of distinct tables.
Some related considerations that come to mind when you're contemplating representations that use that many tables:
A query (including referenced views) can currently only reference 1000 tables.
Datasets with large numbers of tables may exhibit problematic behavior when using table wildcard functions.
You may be oversharding. Rather than lots of individual tables, you may simply want to use a wider schema and fewer tables.
If you're heavily dependent on time intervals as a sharding consideration, you may also want to look at table decorators as a way of limiting the scope of data scans.
You may also want to collapse data over time into fewer, larger tables as they age and are less frequently accessed. For example, copy jobs can append multiple source tables into a single destination table.

Most limits can be raised in BigQuery, as long as you are using BigQuery right - the limits are there to prevent abuse and misuse.
A critical question here - how much data will each table handle? Having 780,000 tables with 10 rows isn't a good idea.
How many tables do you want to handle per query? There's a hard limit of 1,000 tables per query.
If you have an interesting use case that requires higher limits, getting a support contract and their advice is the best way of having default limits raised.
https://cloud.google.com/support/

Partitioning by date?

We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???

BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x

Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?

Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.

This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.

The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas