I'm creating an app that has to store historical financial data for various stocks.
I currently have a stock table where the columns are stock symbol, stockname along with numerical data which I'm trying to decide how to store.
For example, for the column stockprice, I want to store an entire hash where the key is the date as a string and the value is the stock price. This information should be easily accessible(fast random access). I've read a bit about serializing, but I wonder if this is the best option(or if it's even applicable at all). Is there a way to instead automatically generate an sqlite table for each stock entered and create columns representing the date and rows representing the stockprice?
I appreciate all insight into this matter and perhaps some clarification on whether this is exactly where I should use serialization or whether there is a better alternative
EDIT 1: Is ActiveModel Serialization relevant? (http://railscasts.com/episodes/409-active-model-serializers)
EDIT 2: Is it advisable to consider instead creating a Stockprice model & table where the model belongs_to a stock and a stock has_many stockprices. THe stockprice table would have a regular id, a stock_id(for which it belongs) to and a date column and a stockprice value column. I'd appreciate some analysis on the run-time memory-time usage of this in comparison to serialization and how to analyze it in the future
You are correct, it is possible to store it as a hash. I don't have any metrics for serialize but I would suggest starting this way and optimizing your data storage later if you begin to notice a significant impact on your application.
You're migration would look something like this (be sure to use the text data type):
def self.up
add_column :stocks, :price, :text
end
In your model you will need to add
serialize :price
You will be able to create price as a hash and store it directly.
stock = Stock.new
stock.price = { :date => "#{Time.now}", :amount => 25.2 }
stock.save
EDIT: I would start with the serialization unless you have designed functionality that is specific to stock_price. Since the convention in Rails is to have a model per each table, you would end up with a class for stock_price. It isn't necessary to dedicate a class for stock_price unless you have specific methods in mind for that class. Also, depending on your design, you may be able to keep the stocks class more cohesive by keeping the stock price as an attribute of stocks.
you mentioned 'This information should be easily accessible(fast random access)' - in which case serialized column is not a good option. Lets say you keep 20 years of data, then it would be 20*365 key value pairs in the serialized price. But you might be interested in only a subset of this for an usecase - say plotting last 6 months. If you go with the serialize option, that entire data (for price field) will be transported from db to ruby process and gets de-serialized.Then again you need to filter the price hash in ruby process. Where as in case of a seperate table for price, db can do the filtering for you and you can have a fast response with good indices.
Have you explored any time series dbs?
Related
I'm currently retrieving data from a vehicle fleet (fuel used, distance traveled...) through the manufacturer API. For each set of data, there is the date when the metric has been mesured. I will retrieve the data everyday through a rcon call and store it in my DB. The purpose is to keep the history for each vehicle so that I could get every metrics mesured for a vehicle between X date and X date later on.
I first thought of this :
But it doesn't seem right because of the "1,1" cardinality, so i transformed the 3 way relationship into 2 normal relationship :
At this point i wondered wether i could not simply store the field date in the metric entity (because I noticed the API would give me a datetime, so it's very unlikely that two metrics will be mesured at the same time):
And finally, i was wondering if putting everything in a data entity would not even be easier (but it feels kinda wrong):
So i'm completely lost as to what would be the best way to do this. Could someone tell me which way is the best or even if there is a better way and why ?
What is the best way to model the following scenario? User has multiple portfolios, each with multiple stocks.
I have come up with the following:
Stocks will be in a hash as below
stk:1 {name:A, ticker:val, sector:val ..}
stk:2 {name:B, ticker:val, sector:val ..}
Users can be a hash as below: (is it better to store portfolios for a user seperately as a set?)
user:1 {k1:val1, k2:val2, portfolios:"value|growth|short term"}
user:2 {k1:val3, k2:val4, portfolios:"value|defensive|penny"}
Stocks in a portfolio can be sets
user:1:value (1,3)
user:2:value (2,3,4)
user:1:short term (1,5)
user:2:penny (4)
In order to add/remove a portfolio for a user, its required to 'HGET user:n portfolios' followed by a HSET
Is this a good way to model as the number of users and portfolios grow?
If a user can have multiple portfolio types then it would be best to separate them into their own sets.
sadd user:1:portfolios value growth "short term"
This makes removing a portfolio from a user as simple as calling srem user:1:portfolios value on the set (and of course removing the "user:ID:TYPE" set).
When you want to lookup stocks for a user based on portfolio type you can do so using the sunionstore and sort command (example in Ruby):
keys = redis.smembers('user:1:portfolios').map do |type|
"user:1:#{type}"
end
redis.multi do |r|
r.sunionstore "user:1:stocks:_tmp", *keys
r.sort "user:1:stocks:_tmp", get: ["stk:*->name", "stk:*->ticket"]
r.del "user:1:stocks:_tmp"
end
stk:*->name will return only the hash values for name. If you want to get all entries in the hash specify each of them using the 'KEY->HASHKEY' syntax.
http://redis.io/commands/sort
There is no best way to model something: it all depends on your access paths.
For instance, your proposal will work well if you systematically access the data from the user perspective. It will be very poor if you want to know which users have a certain stock in their portfolios. So my suggestion would be to list all the expected access paths and check they are covered by the data structure.
Supposing you only need the user perspective, I would rather materialize the portfolios as a separate set, instead of storing a serialized list in the user hash: they will be easier to maintain. Because you can use pipelining (or scripting) to run multiple commands in a single roundtrip, there is no real overhead.
I need when I add a new document in my collection X to get the last document that was inserted in that same collection, because some values of that document must influence the document I am currently inserting.
Basically as a simple example I would need to do that:
class X
include Mongoid::Document
include Mongoid::Timestamps
before_save :set_sum
def set_sum
self.sum = X.last.sum + self.misc
end
field :sum, :type => Integer
field :misc, :type => Integer
end
How can I make sure that type of process will never break if there are concurrent insert? I must make sure that when self.sum = X.last.sum + self.misc is calculate that X.last.sum absolutely represents that last possible document inserted in the collection ?
This is critical to my system. It needs to be thread safe.
Alex
ps: this also needs to be performant, when there are 50k documents in the collections, it can't take time to get the last value...
this kind of behavior is equivalent to having an auto increment id.
http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field
The cleanest way is to have a side collection with one (or more) docs representing the current total values.
Then in your client, before inserting the new doc, do a findAndModify() that atomically updates the totals AND retrieves the current total doc.
Part of the current doc can be an auto increment _id, so that even if there are concurrent inserts, your document will then be correctly ordered as long as you sort by _id.
Only caveat: if your client app dies after findAndModify and before insert, you will have a gap in there.
Either that's ok or you need to add extra protections like keeping a side log.
If you want to be 100% safe you can also get inspiration from 2-phase commit
http://www.mongodb.org/display/DOCS/two-phase+commit
Basically it is the proper way to do transaction with any db that spans more than 1 server (even sql wouldnt help there)
best
AG
If you need to keep a running sum, this should probably be done on another document in a different collection. The best way to keep this running sum is to use the $inc atomic operation. There's really no need to perform any reads while calculating this sum.
You'll want to insert your X document into its collection, then also $inc a value on a different document that is meant for keeping this tally of all the misc values from the X documents.
Note: This still won't be transactional, because you're updating two documents in two different collections separately, but it will be highly performant, and fully thread safe.
Fore more info, check out all the MongoDB Atomic Operations.
I have User model with many fields and I would like to display a
table as a matrix of 2 of those fields:
- created_at
- type
For the created_at I simply used a group_by as so:
(User.where(:type => "blabla" ).all.group_by { |item|
item.send(:created_at).strftime("%Y-%m-%d") }).sort.each do |
creation_date, users|
This gives me a nice array of all the users per creation_date, so the
lines on my table are ok. However I want to display multiple lines,
each representing the sub selection of the users per type.
So for the moment, I am performing one request per line (per type,
simply replacing the "blabla").
For the moment it's ok because I have
just a few type, but this number will soon increase a lot more, and at
this will not be efficient I am afraid.
Any suggestion on how I could achieve my expected results ?
Thanks,
Alex
The general answer here is to perform a Map / Reduce. Generally, you do not perform the map-reduce in real time due to performance constraints. Instead you run the map-reduce on a schedule and query against the results directly.
Here's a primer on map-reduce for Ruby. Here's another example using Mongoid specifically.
What are your thoughts on this? I'm working on integrating some new data that's in a tab-delimited text format, and all of the decimal columns are kept as single integers; in order to determine the decimal amount you need to multiply the number by .01. It does this for things like percentages, weight and pricing information. For example, an item's price is expressed as 3259 in the data files, and when I want to display it I would need to multiply it in order to get the "real" amount of 32.59.
Do you think this is a good or bad idea? Should I be keeping my data structure identical to the one provided by the vendor, or should I make the database columns true decimals and use SSIS or some sort of ETL process to automatically multiply the integer columns into their decimal equivalent? At this point I haven't decided if I am going to use an ORM or Stored Procedures or what to retrieve the data, so I'm trying to think long term and decide which approach to use. I could also easily just handle this in code from a DTO or similar, something along the lines of:
public class Product
{
// ...
private int _price;
public decimal Price
{
get
{
return (this._price * .01);
}
set
{
this._price = (value / .01);
}
}
}
But that seems like extra and unnecessary work on the part of a class. How would you approach this, keeping in mind that the data is provided in the integer format by a vendor that you will regularly need to get updates from.
"Do you think this is a good or bad idea?"
Bad.
"Should I be keeping my data structure identical to the one provided by the vendor?"
No.
"Should I make the database columns true decimals?"
Yes.
It's so much simpler to do what's right. Currently, the data is transmitted with no "." to separate the whole numbers from the decimals; that doesn't any real significance.
The data is decimal. Decimal math works. Use the decimal math provided by your language and database. Don't invent your own version of Decimal arithmetic.
Personally I would much prefer to have the data stored correctly in my database and just do a simple conversion every time an update comes in.
Pedantically: they aren't kept as ints either. They are strings that require parsing.
Philisophically: you have information in the file and you should write data into the database. That means transforming the information in any ways necessary to make it meaningful/useful. If you don't do this transform up front, then you'll be doomed to repeat the transform across all consumers of the database.
There are some scenarios where you aren't allowed to transform the data, such as being able to answer the question: "What was in the file?". Those scenarios would require the data to be written as string - if the parse failed, you wouldn't have an accurate representation of the file.
In my mind the most important facet of using Decimal over Int in this scenario is maintainability.
Data stored in the tables should be clearly meaningful without need for arbitrary manipulation. If manipulation is required is should be clearly evident that it is (such as from the field name).
I recently dealt with data where days of the week were stored as values 2-8. You can not imagine the fall out this caused (testing didn't show the problem for a variety of reasons, but live use did cause political explosions).
If you do ever run in to such a situation, I would be absolutely certain to ensure data can not be written to or read from the table without use of stored procedures or views. This enables you to ensure the necessary manipulation is both enforced and documented. If you don't have both of these, some poor sod who follows you in the future will curse your very name.