Feed aggregator using hbase. How to design the schema? - schema

I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order).
Currently I am using two tables:
Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta
I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application?
Question update: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow?
SELECT data FROM Urls ORDER BY date DESC LIMIT 100

Peter Rietzler answer on hbase-user mail list:
Hi
In our project we are handling event
lists where we have similar
requirements. We do ordering by
choosing our row keys wisely. We use
the following key for our events (they
should be ordered by time in ascending
order):
eventListName/yyyyMMddHHmmssSSS-000[-111]
where eventListName is the name of the
event list and 000 is a three digit
instance id to disambiguate between
different running instances of
application, and -111 is optional to
disambiguate events that occured in
the same millisecond on one instance.
We additionally insert and artifical
row for each day with the id
eventListName/yyyyMMddHHmmssSSS
This allows us to start scanning at
the beginning of each day without
searching through the event list.
You need to be aware of the fact that
if you have a very high load of
inserts, then always one hbase region
server is busy inserting while the
others are idle ... if that's a
problem for you, you have to find
different keys for your purpose.
You could also use an HBase index
table but I have no experience with it
and I remember an email on the mailing
list that this would double all
requests because the API would first
lookup the index table and then the
original table ??? (please correct me
if this is not right ...)
Kind regards, Peter
Thanks Peter.

Related

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Structuring a large DynamoDB table with many searchable attributes?

I've been struggling with the best way to structure my table. Its intended to have many, many GBs of data (I haven't been given a more detailed estimate). The table will be claims data (example here) with a partition key being the resourceType and a sort key being the id (although these could be potentially changed). The end user should be able to search by a number of attributes (institution, provider, payee, etc totaling ~15).
I've been toying with combining global and local indices in order to achieve this functionality on the backend. What would be the best way to structure the table to allow a user to search the data according to 1 or more of these attributes in essentially any combination?
If you use resourceType as a partition key you are essentially throwing away the horizontal scaling features that DynamoDB provides out of the box.
The reason to partition your data is such that you distribute it across many nodes in order to be able to scale without incurring a performance penalty.
It sounds like you're looking to put all claim documents into a single partition so you can do "searches" by arbitrary attributes.
You might be better off combining your DynamoDB table with something like ElasticSearch for quick, arbitrary search capabilities.
Keep in mind that DynamoDB can only accommodate approximately 10GB of data in a single partition and that a single partition is limited to up to 3000 reads per second, and up to 1000 writes per second (reads + 3 * writes <= 3000).
Finally, you might consider storing your claim documents directly into ElasticSearch.

Storing key value pairs in SQL / Stats Aggregator

I'm writing a web application that needs to periodically collect data from an API and perform analysis on these stats to produce a dashboard for unique users. There are 236 unique 'stats' coming in from the API per user which are essentially key value pairs, where the value consists of either a string or number (or time duration or percent).
I'm trying to figure out how best to store this data. One option I thought of which would be the simplest approach was to store the raw JSON response against a userId and perform all analysis from that JSON. The obvious issue with this is that I need to be able to query the data easily and do things like ordering different users by one of the 236 unique stats. The other option would be in a relational database.
If I were to go the relational route, how is it best to store snapshots of data like this? I imagine creating a column for each of the 236 stats would be a bit of a mess, and annoying to add to in the future. I've looked at other relatively similar questions but haven't found anything right for me.
My thoughts so far:
Create a StatsType(id, typename) containing 236 rows,
and a UserStats(statid, userid, typeid, value, date_added) table, containing 236 rows for each user update from the API.
Would this end up being too huge as the app grows? (Think 200,000+ users) Thoughts would be much appreciated
Different value types is an argument for different columns. Your requirement to order users also prompts to have a single row for a user.
You may create kind of data dictionary to keep your code clean and udaptable minding future changes.

Pulling 50 RSS feeds and checking uniqueness of posts against database [Efficiency Advice]

I want to pull several RSS feeds into a database as efficiently as possible.
My site will scrape 50 RSS feeds every 4 hours - I only want to add unique posts to the database. I am a bit stuck on how to check the post uniqueness. I am thinking I can compare the URL value and when the first post is not unique I can break out of the loop.
So my question is - is it better to pull all of the posts and check each feeds posts against an array until one duplicate comes up (and then break). Or is it better to search the database for the first post from each RSS feed - then only store that in an array and check the newly scraped posts against it.
I guess while I'm here I might as well ask if I should create a table for each separate feed. I am only storing 5 posts at a time so for now it is only one table and I just have a column to identify each feed.
Thanks.
Let the DBMS handle uniqueness: insert all the potential records using where not exists to screen out the existing ones.
Nothing in what you said suggests to me that you need a table for each feed.
No one can answer an efficiency question without knowing the DBMS in question. As a rule of thumb, though, you're better off with bigger chunks of work and fewer transactions. Storing 5 rows at a time may be a lot slower than, say, 5000 rows at a time.

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.