What is the best way to store market data for Algorithmic Trading setup? - sql

I am making an Algorithmic Trading setup for trades automation. Currently, I have a broker API which helps me get historical data for all stocks that I'm interested in.
I am wondering how to store all the data, whether in a file system or database (SQL based or NoSQL). Data comes through REST API if thats relevant.
My use case here would be to query historical data to make trading decisions in live market. I would also have to develop a backtesting framework that will query Historical Data to check performance of strategy historically.
I am looking at a frequency of 5 mins - 1 hr candles and mostly Intraday trading strategies. Thanks

As you say, there are many options and as STLDeveloper says this is kind of off topic since it is opinion based... anyway...
A simple strategy which I used in my own Python back-testing engine is to use Python Pandas DataFrame objects, and save/load to disk in an HD5 file using to_hdf() and read_hdf(). The primary advantage (for me) of HD5 is that it loads/saves far more quickly than CSV.
Using the above approach I easily manage several years of 1 minute data for back testing purposes, and data access certainly is not my performance bottleneck.
You will need to determine for yourself if your chosen data management approach is fast enough for live trading, but in general I think if your strategy is based on 5-min candles then any reasonable database approach is going to be sufficiently performant for your purposes.

Related

Data model guidance, database choice for aggregations on changing filter criteria

Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)

Are there any REAL advantages to NoSQL over RDBMS for structured data on one machine?

So I've been trying hard to figure out if NoSQL is really bringing that much value outside of auto-sharding and handling UNSTRUCTURED data.
Assuming I can fit my STRUCTURED data on a single machine OR have an effective 'auto-sharding' feature for SQL, what advantages do any NoSQL options offer? I've determined the following:
Document-based (MongoDB, Couchbase, etc) - Outside of it's 'auto-sharding' capabilities, I'm having a hard time understanding where the benefit is. Linked objects are quite similar to SQL joins, while Embedded objects significantly bloat doc size and causes a challenge regarding to replication (a comment could belong to both a post AND a user, and therefore the data would be redundant). Also, loss of ACID and transactions are a big disadvantage.
Key-value based (Redis, Memcached, etc) - Serves a different use case, ideal for caching but not complex queries
Columnar (Cassandra, HBase, etc ) - Seems that the big advantage here is more how the data is stored on disk, and mostly useful for aggregations rather than general use
Graph (Neo4j, OrientDB, etc) - The most intriguing, the use of both edges and nodes makes for an interesting value-proposition, but mostly useful for highly complex relational data rather than general use.
I can see the advantages of Key-value, Columnar and Graph DBs for specific use cases (Caching, social network relationship mapping, aggregations), but can't see any reason to use something like MongoDB for STRUCTURED data outside of it's 'auto-sharding' capabilities.
If SQL has a similar 'auto-sharding' ability, would SQL be a no-brainer for structured data? Seems to me it would be, but I would like the communities opinion...
NOTE: This is in regards to a typical CRUD application like a Social Network, E-Commerce site, CMS etc.
If you're starting off on a single server, then many advantages of NoSQL go out the window. The biggest advantages to the most popular NoSQL are high availability with less down time. Eventual consistency requirements can lead to performance improvements as well. It really depends on your needs.
Document-based - If your data fits well into a handful of small buckets of data, then a document oriented database. For example, on a classifieds site we have Users, Accounts and Listings as the core data. The bulk of search and display operations are against the Listings alone. With the legacy database we have to do nearly 40 join operations to get the data for a single listing. With NoSQL it's a single query. With NoSQL we can also create indexes against nested data, again with results queried without Joins. In this case, we're actually mirroring data from SQL to MongoDB for purposes of search and display (there are other reasons), with a longer-term migration strategy being worked on now. ElasticSearch, RethinkDB and others are great databases as well. RethinkDB actually takes a very conservative approach to the data, and ElasticSearch's out of the box indexing is second to none.
Key-value store - Caching is an excellent use case here, when you are running a medium to high volume website where data is mostly read, a good caching strategy alone can get you 4-5 times the users handled by a single server. Key-value stores (RocksDB, LevelDB, Redis, etc) are also very good options for Graph data, as individual mapping can be held with subject-predicate-target values which can be very fast for graphing options over the top.
Columnar - Cassandra in particular can be used to distribute significant amounts of load for even single-value lookups. Cassandra's scaling is very linear to the number of servers in use. Great for heavy read and write scenarios. I find this less valuable for live searches, but very good when you have a VERY high load and need to distribute. It takes a lot more planning, and may well not fit your needs. You can tweak settings to suite your CAP needs, and even handle distribution to multiple data centers in the box. NOTE: Most applications do emphatically NOT need this level of use. ElasticSearch may be a better fit in most scenarios you would consider HBase/Hadoop or Cassandra for.
Graph - I'm not as familiar with graph databases, so can't comment here (beyond using a key-value store as underlying option).
Given that you then comment on MongoDB specifically vs SQL ... even if both auto-shard. PostgreSQL in particular has made a lot of strides in terms of getting unstrictured data usable (JSON/JSONB types) not to mention the power you can get from something like PLV8, it's probably the most suited to handling the types of loads you might throw at a document store with the advantages of NoSQL. Where it happens to fall down is that replication, sharding and failover are bolted on solutions not really in the box.
For small to medium loads sharding really isn't the best approach. Most scenarios are mostly read so having a replica-set where you have additional read nodes is usually better when you have 3-5 servers. MongoDB is great in this scenario, the master node is automagically elected, and failover is pretty fast. The only weirdness I've seen is when Azure went down in late 2014, and only one of the servers came up first, the other two were almost 40 minutes later. With replication any given read request can be handled in whole by a single server. Your data structures become simpler, and your chances of data loss are reduced.
Again in my own example above, for a mediums sized classifieds site, the vast majority of data belongs to a single collection... it is searched against, and displayed from that collection. With this use case a document store works much better than structured/normalized data. The way the objects are stored are much closer to their representation in the application. There's less of a cognitive disconnect and it simply works.
The fact is that SQL JOIN operations kill performance, especially when aggregating data across those joins. For a single query for a single user it's fine, even with a dozen of them. When you get to dozens of joins with thousands of simultaneous users, it starts to fall apart. At this point you have several choices...
Caching - caching is always a great approach, and the less often your data changes, the better the approach. This can be anything from a set of memcache/redis instances to using something like MongoDB, RethinkDB or ElasticSearch to hold composite records. The challenge here comes down to updating or invalidating your cached data.
Migrating - migrating your data to a data store that better represents your needs can be a good idea as well. If you need to handle massive writes, or very massive read scenarios no SQL database can keep up. You could NEVER handle the likes of Facebook or Twitter on SQL.
Something in between - As you need to scale it depends on what you are doing and where your pain points are as to what will be the best solution for a given situation. Many developers and administrators fear having data broken up into multiple places, but this is often the best answer. Does your analytical data really need to be in the same place as your core operational data? For that matter do your logins need to be tightly coupled? Are you doing a lot of correlated queries? It really depends.
Personal Opinions Ahead
For me, I like the safety net that SQL provides. Having it as the central store for core data it's my first choice. I tend to treat RDBMS's as dumb storage, I don't like being tied to a given platform. I feel that many people try to over-normalize their data. Often I will add an XML or JSON field to a table so additional pieces of data can be stored without bloating the scheme, specifically if it's unlikely to ever be queried... I'll then have properties in my objects in the application code that store in those fields. A good example may be a payment... if you are currently using one system, or multiple systems (one for CC along with Paypal, Google, Amazon etc) then the details of the transaction really don't affect your records, why create 5+ tables to store this detailed data. You can even use JSON for primary storage and have computed columns derived and persisted from that JSON for broader query capability and indexing where needed. Databases like postgresql and mysql (iirc) offer direct indexing against JSON data as well.
When data is a natural fit for a document store, I say go for it... if the vast majority of your queries are for something that fits better to a single record or collection, denormalize away. Having this as a mirror to your primary data is great.
For write-heavy data you want multiple systems in play... It depends heavily on your needs here... Do you need fast hot-query performance? Go with ElasticSearch. Do you need absolute massive horizontal scale, HBase or Cassandra.
The key take away here is not to be afraid to mix it up... there really isn't a one size fits all. As an aside, I feel that if PostgreSQL comes up with a good in the box (for the open-source version) solution for even just replication and automated fail-over they're in a much better position than most at that point.
I didn't really get into, but feel I should mention that there are a number of SaaS solutions and other providers that offer hybrid SQL systems. You can develop against MySQL/MariaDB locally and deploy to a system with SQL on top of a distributed storage cluster. I still feel that HBase or ElasticSearch are better for logging and analitical data, but the SQL on top solutions are also compelling.
More: http://www.mongodb.com/nosql-explained
Schema-less storage (or schema-free). Ability to modify the storage (basically add new fields to records) without having to modify the storage 'declared' schema. RDBMSs require the explicit declaration of said 'fields' and require explicit modifications to the schema before a new 'field' is saved. A schema-free storage engine allows for fast application changes, just modify the app code to save the extra fields, or rename the fields, or drop fields and be done.
Traditional RDBMS folk consider the schema-free a disadvantage because they argue that on the long run one needs to query the storage and handling the heterogeneous records (some have some fields, some have other fields) makes it difficult to handle. But for a start-up the schema-free is overwhelmingly alluring, as fast iteration and time-to-market is all that matter (and often rightly so).
You asked us to assume that either the data can fit on a single machine, OR your database has an effective auto-sharding feature.
Going with the assumption that your SQL data has an auto-sharding feature, that means you're talking about running a cluster. Any time you're running a cluster of machines you have to worry about fault-tolerance.
For example, let's say you're using the simplest approach of sharding your data by application function, and are storing all of your user account data on server A and your product catalog on server B.
Is it acceptable to your business if server A goes down and none of your users can login?
Is it acceptable to your business if server B goes down and no one can buy things?
If not, you need to worry about setting up data replication and high-availability failover. Doable, but not pleasant or easy for SQL databases. Other types of sharding strategies (key, lookup service, etc) have the same challenges.
Many NoSQL databases will automatically handle replication and failovers. Some will do it out of the box, with very little configuration. That's a huge benefit from an operational point of view.
Full disclosure: I'm an engineer at FoundationDB, a NoSQL database that automatically handles sharding, replication, and fail-over with very little configuration. It also has a SQL layer so you you don't have to give up structured data.

Optimized storage for large integer series

I'm currently designing the back-end of a start-up from scratch. We scrape time series from the Internet. We scrape a large amount of integers every minute and store them in rows with a timestamp in csv files.
We didn't start to exploit the data properly as we are still on the design phase. I was wondering, what would be the optimal storage for several years of integer series ? We started to look toward loading it in Postgres, but is sql suited for exploiting time series ?
I was expecting to find a miracle software that would be optimal for handling this kind of specific datasets, and would be glad to hear any suggestion that would enable :
Persistent large storage
Averaging/grouping calculation, possibly other R-like features
Gain in performance, power or ease of use compared to raw sql database storage
Every minute, 8,000 values translates into 11.5 million values per day or 4 billion rows per year. This is a heavy load. Just the insert load (using whatever ACID-compliant method) is noticeable -- over 100 inserts per second. This is definitely manageable in modern database systems, but it is not trivial.
It is quite likely that Postgres can handle this load, with appropriate indexes and partitioning schemes. The exact nature of this solution depends on the queries that you need to run, but Postgres does have the underlying tools to support it.
However, your requirements are (in my opinion) bigger than Stack Overflow can provide. If you are designing such a system, you should enlist the help of a professional Postgres DBA. I might add that you could consider looking at cloud-based solutions such as Amazon Redshift or Microsoft Azure because these allow you to easily scale the system "just" by paying more money.

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

In terms of today's technology, are these meaningful concerns about data size?

We're adding extra login information to an existing database record on the order of 3.85KB per login.
There are two concerns about this:
1) Is this too much on-the-wire data added per login?
2) Is this too much extra data we're storing in the database per login?
Given todays technology, are these valid concerns?
Background:
We don't have concrete usage figures, but we average about 5,000 logins per month. We hope to scale to larger customers, howerver, still in the 10's of 1000's per month, not 1000's per second.
In the US (our market) broadband has 60% market adoption.
Assuming you have ~80,000 logins per month, you would be adding ~ 3.75 GB per YEAR to your database table.
If you are using a decent RDBMS like MySQL, PostgreSQL, SQLServer, Oracle, etc... this is a laughable amount of data and traffic. After several years, you might want to start looking at archiving some of it. But by then, who knows what the application will look like?
It's always important to consider how you are going to be querying this data, so that you don't run into performance bottlenecks. Without those details, I cannot comment very usefully on that aspect.
But to answer your concern, do not be concerned. Just always keep thinking ahead.
How many users do you have? How often do they have to log in? Are they likely to be on fast connections, or damp pieces of string? Do you mean you're really adding 3.85K per time someone logs in, or per user account? How long do you have to store the data? What benefit does it give you? How does it compare with the amount of data you're already storing? (i.e. is most of your data going to be due to this new part, or will it be a drop in the ocean?)
In short - this is a very context-sensitive question :)
Given that storage and hardware are SOOO cheap these days (relatively speaking of course) this should not be a concern. Obviously if you need the data then you need the data! You can use replication to several locations so that the added data doesn't need to move over the wire as far (such as a server on the west coast and the east coast). You can manage your data by separating it by state to minimize the size of your tables (similar to what banks do, choose state as part of the login process so that they look to the right data store). You can use horizontal partitioning to minimize the number or records per table to keep your queries speedy. Lots of ways to keep large data optimized. Also check into Lucene if you plan to do lots of reads to this data.
In terms of today's average server technology it's not a problem. In terms of your server technology it could be a problem. You need to provide more info.
In terms of storage, this is peanuts, although you want to eventually archive or throw out old data.
In terms of network (?) traffic, this is not much on the server end, but it will affect the speed at which your website appears to load and function for a good portion of customers. Although many have broadband, someone somewhere will try it on edge or modem or while using bit torrent heavily, your site will appear slow or malfunction altogether and you'll get loud complaints all over the web. Does it matter? If your users really need your service, they can surely wait, if you are developing new twitter the page load time increase is hardly acceptable.