I'm writing a web application that needs to periodically collect data from an API and perform analysis on these stats to produce a dashboard for unique users. There are 236 unique 'stats' coming in from the API per user which are essentially key value pairs, where the value consists of either a string or number (or time duration or percent).
I'm trying to figure out how best to store this data. One option I thought of which would be the simplest approach was to store the raw JSON response against a userId and perform all analysis from that JSON. The obvious issue with this is that I need to be able to query the data easily and do things like ordering different users by one of the 236 unique stats. The other option would be in a relational database.
If I were to go the relational route, how is it best to store snapshots of data like this? I imagine creating a column for each of the 236 stats would be a bit of a mess, and annoying to add to in the future. I've looked at other relatively similar questions but haven't found anything right for me.
My thoughts so far:
Create a StatsType(id, typename) containing 236 rows,
and a UserStats(statid, userid, typeid, value, date_added) table, containing 236 rows for each user update from the API.
Would this end up being too huge as the app grows? (Think 200,000+ users) Thoughts would be much appreciated
Different value types is an argument for different columns. Your requirement to order users also prompts to have a single row for a user.
You may create kind of data dictionary to keep your code clean and udaptable minding future changes.
Related
I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.
I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.
Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.
I have an application that sends data based on user interaction (not user input). The data sent could be an Integer, String, Date, or Boolean value. There are 140 keys. We may get anywhere from 1 key value pair to all 140 at a time.
We want to store everything but will only be using 20 out of 140 keys within the application. The remaining will be used for an audit trail later on - so we still need to store them.
This data is used by the application to decide where the user needs to go so it needs to access the record by student id and pull the 20 or so options within milliseconds. There could be billions of rows of data (it is an upgrade to an existing application with over 20,000 users) so performance is critical. The user generates a new row each time they access the application.
EXAMPLE DATA:
Score:1
ID:3212
IsLast:False
Action:Completed
I have 2 ideas on how to do this and looking for some help on which is best or is a third option a better choice.
OPTION 1:
My first idea is to use a column for the value as a string then have a look-up table of possible data types to use when the value needs to be Cast for use.
value | dataType
-----------------------
"1" | int
"Completed" | string
While the data being sent is not user generated I know there must be a gotcha somewhere in this method. The only reason for doing this is that we don't know what key:pair will be sent (outside of date and id) and trying to avoid more than a few columns.
The SO Question
How to Handle Unknown Data Type in one Table uses a similar idea.
OPTION 2:
The other solution is to have 140 columns - one for each key. However, the amount of data generated is very large (billions of rows) so that calling this data will not be fast enough - I don't think.
Technical Details:
This is using SQL Server 2008 - not R2 with DotNet C# and Reporting Services.
Am I missing something here - what is the best way to create this table for performance?
Vertically segment your data. Put the 20 keys that are necessary for navigational control in one table, all 20 in one row, with PK that identifies the user Interaction (Callit say, InteractionId). Put the other 120 values in another table, with composite Primary Key, based on the PK of the first table (InteractionId, plus the KeyTypeId identifying which of the 120 possible key value pairs the value is for. Store all the values in this second table as strings. In a third lookup table called, say, KeyTypes, store the KeyTypeId, KeyTypeName, and KeyValueDataType to allow your code to know how to cast the string value to output it properly as either a string, datetime, an integer, or a decimal value or whatever...
The first table will be accessed much more often, and so it contains only those values which the application's navigational functionality needs more frequent access to, keeping the rows narrower, which allows more rows per page, and minimizes disk IO. Putting all 20 values in one row will keep the row count smaller (~ 1/20th as large), minimizng the depth of the index seeks that will need to be performed for each access.
The other table with all the other 120 key-values will not be accessed as frequently, so it's structure can probably be optimized for logical simplicity rather than for performance.
Actually, you might merge the suggestions offered so far:
Create a table with the 20 keys necessary for navigational control, plus one column for a Primary Key, plus one column that is an XML data type to store the rest of the possible data. You could then create a DTD that handles the data types for each key, plus constraints on certain keys as needed.
Well it should be simple enough to test both ideas, but a variation on option 1 looks favoured to me. RDBMSs like SQL Server prefer long, narrow tables (i.e. fewer columns but lots of rows).
I won't go any further because it appears Charles has beat to it, with a perfectly sensible suggestion.
I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order).
Currently I am using two tables:
Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta
I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application?
Question update: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow?
SELECT data FROM Urls ORDER BY date DESC LIMIT 100
Peter Rietzler answer on hbase-user mail list:
Hi
In our project we are handling event
lists where we have similar
requirements. We do ordering by
choosing our row keys wisely. We use
the following key for our events (they
should be ordered by time in ascending
order):
eventListName/yyyyMMddHHmmssSSS-000[-111]
where eventListName is the name of the
event list and 000 is a three digit
instance id to disambiguate between
different running instances of
application, and -111 is optional to
disambiguate events that occured in
the same millisecond on one instance.
We additionally insert and artifical
row for each day with the id
eventListName/yyyyMMddHHmmssSSS
This allows us to start scanning at
the beginning of each day without
searching through the event list.
You need to be aware of the fact that
if you have a very high load of
inserts, then always one hbase region
server is busy inserting while the
others are idle ... if that's a
problem for you, you have to find
different keys for your purpose.
You could also use an HBase index
table but I have no experience with it
and I remember an email on the mailing
list that this would double all
requests because the API would first
lookup the index table and then the
original table ??? (please correct me
if this is not right ...)
Kind regards, Peter
Thanks Peter.