Database model to describe IT environment [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking at writing a Django app to help document fairly small IT environments. I'm getting stuck at how best to model the data as the number of attributes per device can vary, even between devices of the same type. For example, a SAN will have 1 or more arrays, and 1 or more volumes. The arrays will then have an attribute of Name, RAID Level, Size, Number of disks, and the volumes will have attributes of Size and Name. Different SANs will have a different number of arrays and volumes.
Same goes for servers, each server could have a different number of disks/partitions, all of which will have attributes of Size, Used space, etc, and this will vary between servers.
Another device type may be a switch, which won't have arrays or volumes, but will have a number of network ports, some of which may be gigabit, others 10/100, others 10Gigabit, etc.
Further, I would like the ability to add device types in the future without changing the model. A new device type may be a phone system, which will have its own unique attributes which may vary between different phone systems.
I've looked into EAV database designs but it seems to get very complicated very quickly, and I'm unclear on whether it's the best way to go about this. I was thinking something along the lines of the model as shown in the picture.
http://i.stack.imgur.com/ZMnNl.jpg
A bonus would be the ability to create 'snapshots' of environments at a particular time, making it possible to view changes to the environment over time. Adding a date column to the attributes table may be a way to solve this.
For the record, this app won't need to scale very much (at most 1000 devices), so massive scalability isn't a big concern.

Since your attributes are per model instance and are different for each instance,
I would suggest going with completely free schema
class ITEntity(Model):
name = CharField()
class ITAttribute(Modle)
name = CharField()
value = CharField()
entity = ForeignKey(ITEntity, related_name="attrs")
This is very simple model and you can do the rest, like templates (i.e. switch template, router template, etc) in you app code - its much more straight-forward then using complicated model like EAV (I do like EAV, but this does not seem the usage case for this).
Adding history is also simple - just add timestamp to ITAttribute. When changing attribute - create new one instead. Then fetching attribute pick the one with the latest timestamp. That way you can always have point-in-time view of your environment.

If you are more comfortable with something along the lines of the image you posted, below is a slightly modified version (sorry I can't upload an image, don't have enough rep).
+-------------+
| Device Type |
|-------------|
| type |--------+
+-------------+ |
^
+---------------+ +--------------------+ +-----------+
| Device |----<| DeviceAttributeMap |>----| Attribute |
|---------------| |--------------------| |-----------|
| name | | Device | | name |
| DeviceType | | Attribute | +-----------+
| parent_device | | value |
| Site | +--------------------+
+---------------+
v
+-------------+ |
| Site | |
|-------------| |
| location |--------+
+-------------+
I added a linker table DeviceAttributeMap so you can have more control over an Attribute catalog, allowing queries for devices with the same Attribute but differing values. I also added a field in the Device model named parent_device intended as a self-referential foreign key to capture a relationship between a device's parent device. You'll likely want to make this field optional. To make the foreign key parent_device optional in Django set the field's null and blank attributes to True.

You could try a document based NoSQL database, like MongoDB. Each document can represent a device with as many different fields as you like.

Related

Adapting Machine Learning Algorithms to my Problem

i'm working on a project and need your ideas, advices.
First of all, let me tell my problem.
There is power button and some other keys of a machine and
there is only one user has authentication to use this machine.There are
no other authentication methods, the machine is in public area in a company.
the machine is working with the combination of pressing power button and some other keys.
The order of pressing keys is secret but we don't trust it, anybody can learn the password and can access the machine.
i have the capability of managing the key hold time and also some other metrics to
measure the time differences between the key such as horizantal or vertical key press times (differences). and also i can measure the hold time etc.
These all means i have some inputs,
Now i'm trying to get a user profile by analysing these inputs.
My idea is to get the authenticated user to press the password n times and create a threshold or something similar to that.
This method also can be said BIOMETRICS, anyone else who knows the machine button combination, can try the password but if he is out of this range can not get access it.
How can i adapt these into my algorithms? where should i start ?
i don't want to delve deep into machine learning, and also i can see that in my first try i can get false positive and false negative values really high, but i can manage it by changing my inputs.
thanks.
To me this seems like a good candidate for a classification problem. You have two classes (correct password input / incorrect), your data could be the time (from time 0) that buttons were pressed. You could teach a learning algorithm but having several examples of correct password data and incorrect password data. Once your classifier is trained and working satisfactorily, you could try it out to predict new password input attempts for correctness.
You could try out several classifiers from Weka, a GUI based machine learning tool http://www.cs.waikato.ac.nz/ml/weka/
What you need is your data to be in a simple table format for experimenting in weka, something like the following:
Attempt No | 1st button time | 2nd button time | 3rd button time | is_correct
-----------|-----------------|-----------------|-----------------|------------
1 | 1.2 | 1.5 | 2.4 | YES
2 | 1.3 | 1.8 | 2.2 | YES
3 | 1.1 | 1.9 | 2.0 | YES
4 | 0.8 | 2.1 | 2.9 | NO
5 | 1.2 | 1.9 | 2.2 | YES
6 | 1.1 | 1.8 | 2.1 | NO
This would be a training set. The outcome (which is known) is the class is_correct. You would run this data through weka selecting a classifier (Naive Bayes' for example). This would produce a claffier ( for example a set of rules) which could be used to predict future entries.
The key to this sort of problem is devising good metrics. Once you have a vector of input values, you can use one of a number of machine learning algorithms to classify it as authorised or declined. So the first step should be to determine which metrics (of those you mention) will be the most useful and pick a small number of them (5-10). You can probably benefit by collapsing some by means of averaging (for example, the average length of any key press, rather than a separate value for every key). Then you will need to pick an algorithm. A good one for classifying vectors of real number is Support vector machines - at this point you should read up on it, particularly on what the "kernel" function is so you can choose one to use. Then you will need to gather a set of learning examples (vectors with a known result), train the algorithm with them, and test the trained svm on a fresh set of examples to see how it performs. If the performance is poor with a simple kernel (e.g. linear), you may choose to use a higher dimensional one. Good luck!

Should these be 3 SQL tables or one?

This is a new question which arose out of this question
Due to answers, the nature of the question changed, so I think posting a new one is ok(?).
You can see my original DB design below. I have 3 tables, and now I need a query to get all the records for a specific user for running_balances calculations.
Transactions are between users, like mutual credit. So units get swapped between users.
Inventarizations are physical stuff brought into the system; a user gets units for this.
Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields).
The question is: "Can/should this be in one table or be multiple tables (as I have it for now)?" I like the 3 tables solution because it makes semantically more sense. But then I need such a complicated select statement (with possibly negative performance results) for the running_balances. The original question in the link above asked for this statement, here I am asking if the db design is appropriate (apologies four double posting, hope it's ok).
This same question arises when you try to implement a general ledger system for single entry bookkeeping. What you have called "transactions" corresponds to "transfers", like from savings to checking. What you have called "inventarizations" corresponds to "income", like depositing a paycheck. What you have called "consumations" corresponds to "expenses", like when you pay the electric bill. The only difference is that in bookkeeping, everything has been reduced to dollar (or other currency) value. So you don't have to worry about identifying assets, because one dollar is as good as another.
So the question arises whether you need to have separate columns for "debit amount" and "credit amount" or alternatively, whether you can just have one column for "amount", and enter a positive number for debits and a negative amount for credits. Essentially the same question arises if you are implementing double entry bookkeeping rather than single entry bookkeeping.
In terms of internal arithmetic and internal data handling, things are far simpler when you adopt the single column approach. For example, to test whether a given transaction is in balance, all you have to do ask whether sum (amount) is equal to zero.
The complications arise when people require the traditional bookeeping format for data entry forms, on screen retrievals, and published reports. The traditional format requires two separate columns, marked "Debit" and "Credit", which contain only positive numbers or blank, with the constraint that every item must have an entry in either debit or credit but not both, and the other column must be left blank. These transformations require a certain amount of programming between the external format and the internal format.
It's really a matter of choice. Is it better to retain the traditional bookkeeping format of side by side debit and credit coulmns, or is it better to move forward to a format that uses negative numbers in a meaningful way? There are some circumstances that favor each of these design choices.
In your case, it's going to depend on how you intend to use the data. I would build prototypes with each of the two designs, and then start working on the fundamental CRUD processing for each. Whichever one works out easier in your environment is the one to choose.
You said that amount will be different units, then i think you should keep each table for itself.
I personally hate a DB design that has "different rules" for filling a table based on the type of entity that stored in a row. It just gets messy, and its hard to keep your constraints alive propperly on a table like that.
Just create an indexed view that will answer your balance questions to keep your queries "simple"
There's no definitive answer to this and answers will largely be down to the database design methodologies adopted by the answerer.
My advice would be to trial both ways and see which one has the best compromise between querying, performance and maintenance/usability.
You could always set up a view that returns all 3 tables as one table for querying and has a type field for the type of process a row relates to.

Should I initialize my AUTO_INCREMENT id column to 2^32+1 instead of 0?

I'm designing a new system to store short text messages [sic].
I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.
Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.
Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?
If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.
Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)
What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.
Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.
If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.
Probably just starting at 1 would be sufficient.
(But if you did start at the lower bound, it would extend into the start of 2092.)
Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.
If you want to ensure that messages have an order, implement the ordering like a linked list:
---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway
(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)
One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.
An example Message table:
Name | Data Type
-------------------------------------
Id | BigInt - Primary Key
Code | Guid
Message | Text
DateCreated | DateTime
Then your data looks like:
Id | Code Message DateCreated
-------------------------------------------------------------------------------
1 | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2 | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3 | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4 | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5 | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00
Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.
Don't want to be the next Twitter, eh? lol
If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.
They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.
Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).
If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).

Designing file processing that handles many file formats, parsing, validation, and persistence

If you had to design a file processing component/system, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?
NOTE : 95% of the time 1 line of input data will equal one record in the database, but not always.
Currently I'm using some custom software I designed to parse/validate/store customer data to our database. The system identifies a file by location in the file system(from an ftp drop) and then loads an XML "definition" file. (The correct XML is loaded based on where the input file was dropped off at).
The XML specifies things like file layout (Delimited or Fixed Width) and field specific items (Length, Data Type(numeric, alpha, alphanumeric), and what DB column to store the field to).
<delimiter><![CDATA[ ]]></delimiter>
<numberOfItems>12</numberOfItems>
<dataItems>
<item>
<name>Member ID</name>
<type>any</type>
<minLength>0</minLength>
<maxLength>0</maxLength>
<validate>false</validate>
<customValidation/>
<dbColumn>MembershipID</dbColumn>
</item>
Because of this design the input files must be text (fixed width or delimited) and have a 1 to 1 relation from input file data field to DB column.
I'd like to extend the capabilities of our file processing system to take in Excel, or other file formats.
There are at least a half dozen ways I can proceed but I'm stuck right now because I don't have anyone to really bounce the ideas off of.
Again : If you had to design a file processing component, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?
Well, a straightforward design is something like...
+-----------+
| reader1 |
| |---
+-----------+ \---
\--- +----------------+ +-------------+
\--| validation | | DB |
/---| |---------------| |
+-----------+ /----- +----------------+ +-------------+
| reader2 |----
| |
+-----------+
Readers take care of file validation(does the data exist?) and parsing, the Validation section takes care of any business logic, and the DB...is a DB.
So part of what you'd have to design is the Generic ReaderToValidator data container. That's more of a business logic kind of container. I suspect you want the same kind of data regardless of the input format, so G.R.2.V. is not going to be too hard.
You can polymorphic this by designing a GR2V superclass with the Validator method and the data members, then each reader subclasses off of GR2V and fills up the data with its own ReadParseFile method. That's going to introduce a bit more coupling though than having a strict procedural approach. I'd go procedural for this, since data is being procedurally processed in the conceptual design.
You may want to start a blog, then perhaps if you are on something like LinkedIn you can point the discussion to your blog, or start a discussion on LinkedIn, as some of the discussions there go on for a while.
SO is good for specifics, it seems like true discussion is not so easily done here. Comments are too small for interchange of ideas. I would tend to go elsewhere.
Although such discussions should be technology-agnostic, I suspect that you'll probably find that the Java and .Net camps don't meet too much. I would look at The Server Side but I do Java and hence look for Java stuff.

What the best way to handle categories, sub-categories - hierachical data?

Duplicate:
SQL - how to store and navigate hierarchies
If I have a database where the client requires categories, sub-categories, sub-sub-categories and so on, what's the best way to do that? If they only needed three, and always knew they'd need three I could just create three tables cat, subcat, subsubcat, or the like. But what if they want further depth? I don't like the three tables but it's the only way I know how to do it.
I have seen the "sql adjacency list" but didn't know if that was the only way possible. I was hoping for input so that the client can have any level of categories and subcategories. I believe this means hierarchical data.
EDIT: Was hoping for the sql to get the list back out if possible
Thank you.
table categories: id, title, parent_category_id
id | title | parent_category_id
----+-------+-------------------
1 | food | NULL
2 | pizza | 1
3 | wines | NULL
4 | red | 3
5 | white | 3
6 | bread | 1
I usually do a select * and assemble the tree algorithmically in the application layer.
You might have a look at Joe Celko's book, or this previous question.
creating a table with a relation to itself is the best way for doing the same. its easy and flexible to the extent you want it to be without any limitation. I dont think i need to repeat the structure that you should put since that has already been suggested in the 1st answer.
I have worked with a number of methods, but still stick to the plain "id, parent_id" intra-table relationship, where root items have parent_id=0. If you need to query the items in a tree a lot, especially when you only need 'branches', or all underlying elements of one node, you could use a second table: "id, path_id, level" holding a reference to each node in the upward path of each node. This might look like a lot of data, but it drastically improves the branch-lookups when used, and is quite manageable to render in triggers.
Not a recommended method, but I have seen people use dot-notation on the data.
Food.Pizza or Wines.Red.Cabernet
You end up doing lots of Like or midstring queries which don't use indices terribly well. And you end up parsing things alot.