Is denormalizing acceptable in this case?

Is denormalizing acceptable in this case? - sql

I have the following locations table:
----------------------------------------------------------
| ID | zoneID | storeID | address | latitude | longitude |
----------------------------------------------------------
and the phones table:
-----------------------
| locationID | number |
-----------------------
Now, keep in mind that for any giving store it can be up to five phone numbers, top. Order doesn't matter.
Recently we needed to add another table which would contain stores related info which would also include phone numbers.
Now, to this new table doesn't apply locationID so we can't store the phones in the previous phone table.
Keeping the DB normalized would require, in the end, 2 new tables and a total of 4 joins to retrieve the data. Denormalizing it would render the old table like:
----------------------------------------------------------------------------------
| ID | zoneID | storeID | address | latitude | longitude | phone1 | ... | phone5 |
----------------------------------------------------------------------------------
and having a total of 2 tables and 2 joins.
I'm not a fan of having data1, data2, data3 fields as it can be a huge pain. So, what's your opinion.

My opinion, for what it's worth, is that de-normalisation is something you do to gain performance if, and only if, you actually have a performance problem. I always design for 3NF and only revert if absolutely necessary.
It's not something you do to make your queries look nicer. Any decent database developer would not fear a moderately complex SQL statement although I do have to admit I've seen some multi-hundred-line statements that gave me the shivers - mind you, these were from customers who had no control over the schema: a DBA would have first re-engineered the schema to avoid such a monstrosity.
But, as long as you're happy with the limitations imposed by de-normalisation, you can do whatever you want. It's not as if there's a band of 3NF police roaming the planet looking for violators :-)
The immediate limitations (there may be others) that I can see are:
You'll be limited (initially, without a schema change) to five phone numbers per location. From your description, it doesn't appear you see this as a problem.
You'll waste space storing data that doesn't have to be there. In other words, every row uses space for five numbers regardless of what they actually have, although this impact is probably minimal (e.g., if they're varchar and nullable).
Your queries to look up a phone number will be complicated since you'll have to check five different columns. Whether that's one of your use cases, I don't know, so it may be irrelevant.
You should probably choose one way or the other though (I'm not sure if that's your intent here). I'd be particularly annoyed if I came across a schema that had phone numbers in both the store table and a separate phone numbers table, especially if they disagreed with each other. Even when I de-normalise, I tend to use insert/update triggers to ensure data consistency is maintained.

I think your problem stems from an erroneous model.
Why do you have a location id and a store id? Can a store occupy more than one location?
Is the phone number tied to a geographic location?
Just key everything by StoreId and your problems will disappear.

just try to relate your new table with old location table, as both the tables represent the store you should be able to find someway to relate both. if you can do that your problem is solved, because than you can keep using phone table as before.
Related the new table with old location table will help you beyond getting phone numbers

Related

Monitor the activity of a table (number of select on a specific row)

I have a large dataset
Table: id | info1 | info2 | ...
There are multiples process accessing the data heavily.
Is there a built-in way in postgresql or an extension to know the number of time a row is accessed with SELECT ?
I was thinking of making a stored procedure and manually manage it.

No, there is none in any DB I know. It would be hard to implement too. Sometimes you don't need to access physically the data on the drive because you either find it in cache or take it from the index. Such procedure - if existed and worked according to requirement - would hit the performance of the DB hard.

Best practice for tables with varying content

Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.

Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.

I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.

Efficient SQL schema and query design for top recent user audit table

I'd like some input on designing the SQL data layer for a service that should store and provide the latest N entries for a specific user. The idea is to track each user (id), the time of an event and then the event id.
The service should only respond with the last X numbers of events for each user, and also only contain the events that occured during the last Y number of days.
The service also needs to scale to large amounts of updates and reads.
I'm considering just a simple table with the fields:
ID | USERID | EVENT | TIMESTAMP
============================================
1 | 1 | created file Z | 2014-03-20
2 | 2 | deleted dir Y | 2014-03-20
3 | 1 | created dir Y | 2014-03-20
But how would you consider solving the temporal requirements? I see two alternatives here:
1) On insert and/or reads for a user, also remove outdated and all but the last X events for a user. Affects latency as you need to perform both select,delete and insert on each request. But it keeps the disk size to minimum.
2) Let the service filter on query and do pruning as a separate batch job with some sql that:
First removes all obsolete events irrespective of users based on the timestamp.
Then do some join that removes all but the last X events for each user.
I have looked for design principles regarding these requirements which seems like fairly common ones. But I haven't yet found a perfect match.
It is at the moment NOT a requirement to query for all users that have performed a specific type of events.
Thanks in advance!
Edit:
The service is meant to scale to millions of requests / hour so I've been playing around with the idea of denormalizing this for performance reasons. Given that the requirements are set in stone:
10 last events
No events older than 10 days
I'm actually considering a pivoted table like this:
USERID | EV_1 | TS_1 | EV_2 | TS_2 | EV_3 | TS_3 | etc up to 10...
======================================================================
1 | Create | 2014.. | Del x | 2013.. | etc.. | 2013.. |
This way I can probably shift the events with a MERGE with SELECT and I get eviction for "free". Then I only have to purge all records where TS_1 is older than 10 days. I can also filter in my application logic to only show the events that are newer than 10 days after doing the trivial selects.
The caveat is if events comes in "out of order". The idea above works if I can always guarantee that the events are ordered from "left to right". Probably have to think a bit on that one..
Aside from the fact that it is basically a big cut in the relational data model, do you think I'm on the right track here if it comes to prioritize performance above all?

Your table design is good. Consider also the indexes you want to use. In practice, you will need a multi-column index on (userid, timestamp) to quickly respond to queries that query the last N events having a certain userid. Then you need a single-column index on (timestamp) to efficiently delete old events.
How many events you're planning to store and how many events you're planning to retrieve per query? I.e. does the size of the table exceed the RAM available? Are you using traditional spinning hard disks or solid-state disks? If the size of the table exceeds the RAM available and you are using traditional HDDs, note that each row returned for the query takes about 5-15 milliseconds due to slow seek time.
If your system supports batch jobs, I would use a batch job to delete old events instead of deleting old events at each query. The reason is that batch jobs do not slow down the interactive code path, and can perform more work at once provided that you execute the batch job rarely enough.
If your system doesn't support batch jobs, you could use a probabilistic algorithm to delete old events, i.e. delete only with 1% probability if events are queried. Or alternatively, you could have a helper table into which you store the timestamp of the last deleting of old events, and then check that timestamp and if it's old enough then perform new delete job and update the timestamp. The helper table should be so small that it will always stay in the cache.

My inclination is not to delete data. I would just store the data in your structure and have an interface (perhaps a view or table functions) that runs a query such as;
select s.*
from simple s
where s.timestamp >= CURRENT_DATE - interval 'n days' and
s.UserId = $userid
order by s.timestamp desc
fetch first 10 row only;
(Note: this uses standard syntax because you haven't specified the database, but there is similar functionality in any database.)
For performance, you want an index on simple(UserId, timestamp). This will do most of the work.
If you really want, you can periodically delete older rows. However, keeping all the rows is advantageous for responding to changing requirements ("Oh, we now want 60 days instead of 30 days") or other purposes, such as investigations into user behaviors and changes in events over time.
There are situations that are out-of-the-ordinary where you might want a different approach. For instance, there could be legal restrictions on the amount of time you could hold the data. In that case, use a job that deletes old data and run it every day. Or, if your database technology were an in-memory database, you might want to restrict the size of the table so old data doesn't occupy much memory. Or, if you had really high transaction volumes and lost of users (like millions of users with thousands of events), you might be more concerned with data volume affecting performance.

Database modeling for stock prices

I have recently been given the assignment of modelling a database fit to
store stock prices for over 140 companies. The data will be collected
every 15 min for 8.5 h each day from all these companies. The problem I'm
facing right now is how to setup the database to achieve fast search/fetch
given this data.
One solution would be to store everything in one table with the following columns:
| Company name | Price | Date | Etc... |
Or I could create a table for each company and just store the price and the date for
when the data was collected (and other parameters not known atm).
What is your thought about these kind of solutions? I hope the problem was explained
in sufficient detail, else please let me know.
Any other solution would be greatly appreciated!

I take it you're concerned about performance given the large number of records your likely to generate - 140 companies * 4 data points / hour * 8.5 hours * 250 trading days / year means you're looking at around 1.2 million data points per year.
Modern relational database systems can easily handle that number of records - subject to some important considerations - in a single table - I don't see an issue with storing 100 years of data points.
So, yes, your initial design is probably the best:
Company name | Price | Date | Etc... |
Create indexes on Company name and date; that will allow you to answer questions like:
what was the highest share price for company x
what was the share price for company x on date y
on date y, what was the highest share price
To help prevent performance problems, I'd build a test database, and populate it with sample data (tools like dbMonster make this easy), and then build the queries you (think you) will run against the real system; use the tuning tools for your database system to optimize those queries and/or indices.

On top of what has already been said, I'd like to say the following thing: Don't use "Company name" or something like "Ticker Symbol" as your primary key. As you're likely to find out, stock prices have two important characteristics that are often ignored:
some companies can be quoted on multiple stock exchanges, and therefore have different quote prices on each stock exchange.
some companies are quoted on multiple times on the same stock exchange, but in different currencies.
As a result, a properly generic solution should use the (ISIN, currency, stock exchange) triplet as identifier for a quote.

The first, more important question is what are the types and usage patterns of the queries that will be executed against this table. Is this an Online Transactional Processing (OLTP) application, where the great majority of queries are against a single record, or at most a small set of records? or is to an Online Analytical Processing application, where most queries will need to read, and process, significantly large sets of data to generate aggregations and do analysis. These two very different types of systems should be modeled in different ways.
If it is the first type of app, (OLTP), your first option is a better one, but the usage patterns and types of queries would still be important to determine the types of indices to place on the table.
If it is an OLAP application, (and a system storing billions of stock prices sounds more like an OLAP app) then the data structure you set up might be better organized to store pre-aggregated data values, or even go all the way an use a multi-dimensional database like an OLAP cube, based on a star schema.

Put them into a single table. Modern DB engines can easily handle those volumes you specified.
rowid | StockCode | priceTimeInUTC | PriceCode | AskPrice | BidPrice | Volume
rowid: Identity UniqueIdentifier.
StockCode instead of Company. Companies have multiple types of socks.
PriceTimeInUTC is to standardize any datetime into a specific timezone.
Also datetime2 (more accurate).
PriceCode is used to identify what of price it is: Options/Futures/CommonStock, PreferredStock, etc
AskPrice is the Buying price
BidPrice is the Selling price.
Volume (for buy/sell) might be useful for you.
Separately, have a StockCode table and a PriceCode table.

That is a Brute Force approach. The second you add searchable factors it can change everything. A more flexible and elegant option is a star schema, which can scale to any
amount of data. I am a private party working on this myself.

How can I properly implement commerce data relationships in a sql database?

First, I'd like to start out expressing that I am not trying to just have someone create my table schema for me. I have spent some time weighing the options between the two possibilities in my design and I wanted to get some advice before I go and run wild with my current idea.
Here is my current schema, I will put a ? next to columns I'm considering using.
Key:
table_name
----------
col1 | col2 | col3
tax_zone
---------
tax_zone_id | tax_rate | description
sales_order
-----------
sales_order_id | tax_zone_id (FK) or tax_rate (?)
sales_order_item
-----------
sales_order_item_id | sales_order_id (FK) | selling_price | amount_tax or tax_rate (?)
So, if it wasn't already clear, the dilemma is whether or not I should store the tax data in the individual rows for an order, or use a join to pull the tax_zone information and then do something in my query like (tz.tax_rate * so.order_amount) as order_total.
At present, I was thinking of using the method I just described. There is a problem I see with this methodology though that I can't seem to figure out how to remedy. Tax rates for specific zones are subject to change. This means that if a tax rate changes for a zone and I'm using a foreign key reference, the change in the rate will reflect in past orders that were done with a different rate. This causes an issue because at present I'm using the data in this table to store both orders that have been processed and orders that are still open, therefore if someone were to go re-print a past order, the total amount for the order will have changed.
My problem with storing the specific rate or tax amount is that it means every time someone was going to edit an order, I would have to update that row again with the changes to those values.
In the process of writing this, I'm starting to move towards the latter idea being the better of the two.
Perhaps if someone can just provide me the answer to the following questions so I can go research them myself some more.
Is this a known problem in database modeling?
Are there any well known "authorities" on the subject that have published a book / article?
Any help is much appreciated, thanks!

Well, versioning and history is a well known problem in database modelling. Your solution is very common.
For a simple enumeration like VAT-rates a simple "foreign key tax_id referencing taxtable(id)" will do. The tax-table should never be updated, once a tax_id is enterered, it should stay there forever. If the tax rates are changed at the end of the year, new record should be entered into the tax_table even if records with the new value already exist.
The best search phrase for search engines is probably "temporal database".
UPDATE:
http://www.google.nl/url?sa=t&source=web&cd=2&ved=0CCMQFjAB&url=http%3A%2F%2Fwww.faapartners.com%2Fdownloads%2Foverige-publicaties%2Fpresentatie-over-tijd-in-databases%2Fat_download%2Ffile&rct=j&q=veldwijk%20temporal&ei=HQdxTtimCcKr-QansM28CQ&usg=AFQjCNEg9puU8WR1KIm90voSDp13WmE0-g&cad=rja

In the situation you describe, you will eventually have to store the tax rate in the orders table, because you will need the rate at which the order was closed.
Therefore the cleanest solution has to be to calculate the tax rate each time an order is updated unless it is closed. You could use a trigger to do this.
(Ben's answer popped up as I was writing this - seems we disagree, which is probably not helpful :-)

2 points. You're going to have to store the tax rate somewhere or you're not going to be able to add it to sales_order, or anywhere else. Secondly the tax rate can change over time so you don't want to update each time.
So you have two options.
Store tax rate in a reference table and update each order with the correct tax rate at the time of entry into the table.
Calculate everything every time you access it.
Personally I would go for option 1 BUT have a start time as part of the Primary Key in the reference table as if you ever do need to change the tax-rate you may need to know what the correct rate was at the time the order was placed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas