Database : best way to model a spreadsheet - spreadsheet

I am trying to figure out the best way to model a spreadsheet (from the database point of view), taking into account :
The spreadsheet can contain a variable number of rows.
The spreadsheet can contain a variable number of columns.
Each column can contain one single value, but its type is unknown (integer, date, string).
It has to be easy (and performant) to generate a CSV file containing the data.
I am thinking about something like :
class Cell(models.Model):
column = models.ForeignKey(Column)
row_number = models.IntegerField()
value = models.CharField(max_length=100)
class Column(models.Model):
spreadsheet = models.ForeignKey(Spreadsheet)
name = models.CharField(max_length=100)
type = models.CharField(max_length=100)
class Spreadsheet(models.Model):
name = models.CharField(max_length=100)
creation_date = models.DateField()
Can you think about a better way to model a spreadsheet ? My approach allows to store the data as a String. I am worried about it being too slow to generate the CSV file.

from a relational viewpoint:
Spreadsheet <-->> Cell : RowId, ColumnId, ValueType, Contents
there is no requirement for row and column to be entities, but you can if you like

Databases aren't designed for this. But you can try a couple of different ways.
The naiive way to do it is to do a version of One Table To Rule Them All. That is, create a giant generic table, all types being (n)varchars, that has enough columns to cover any forseeable spreadsheet. Then, you'll need a second table to store metadata about the first, such as what Column1's spreadsheet column name is, what type it stores (so you can cast in and out), etc. Then you'll need triggers to run against inserts that check the data coming in and the metadata to make sure the data isn't corrupt, etc etc etc. As you can see, this way is a complete and utter cluster. I'd run screaming from it.
The second option is to store your data as XML. Most modern databases have XML data types and some support for xpath within queries. You can also use XSDs to provide some kind of data validation, and xslts to transform that data into CSVs. I'm currently doing something similar with configuration files, and its working out okay so far. No word on performance issues yet, but I'm trusting Knuth on that one.
The first option is probably much easier to search and faster to retrieve data from, but the second is probably more stable and definitely easier to program against.
It's times like this I wish Celko had a SO account.

You may want to study EAV (Entity-attribute-value) data models, as they are trying to solve a similar problem.
Entity-Attribute-Value - Wikipedia

The best solution greatly depends of the way the database will be used. Try to find a couple of top use cases you expect and then decide the design. For example if there is no use case to get the value of a certain cell from database (the data is always loaded at row level, or even in group of rows) then is no need to have a 'cell' stored as such.

That is a good question that calls for many answers, depending how you approach it, I'd love to share an opinion with you.
This topic is one the various we searched about at Zenkit, we even wrote an article about, we'd love your opinion on it: https://zenkit.com/en/blog/spreadsheets-vs-databases/

Related

Is using comma separated field good or not

I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.

DataBase design for store anketing data

my English is not well, so sorry for it.
I want to write the web-app for anketing. I mean, that it must be a site, where user may give answers on different questions. For example, it can be question with text type of answer, or checkbox, or lookup (comboBox).
And my problem is in data base architecture. I read a lot about Entity Attribute Value db pattern, One True Lookup Table I also read. But these patterns has problem (with ms sql) when building a sql-query for data selecting (report).
I hope somebody give me a good suggestion, and tell, what can I do with this proplem.
Thanks!
Almost everything can be represented as a string. Why not store a string in the database e.g. Text Type Answer or "true" "false" or ComboBox Value etc. Then simply convert the value from the database if necessary at runtime or in SQL if writing a query?
I feel Entity Attribute Value pattern is meant more for Entities which can have dynamic fields added etc, not so much for the problem you've posed here.
If necessary you could also add an additional column to the database table to specify the "type" of data being stored. You could then use that column to base your query convert statements on etc.

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.

Importing/Pasting Excel data with changing fields into SQL table

I have a table called Animals. I pull data from this table to populate another system.
I get Excel data with lists of animals that need to go in the Animals table.
The Excel data will also have other identifiers, like Breed, Color, Age, Favorite Toy, Veterinarian, etc.
These identifiers will change with each new excel file. Some may repeat, others are brand new.
Because the fields change, and I never know what new fields will come with each new excel file, my Animals table only has Animal Id and Animal Name.
I've created a Values table to hold all the other identifier fields. That table is structured like this:
AnimalId
Value
FieldId
DataFileId
And then I have a Fields table that holds the key to each FieldId in the Values table.
I do this because the alternative is to keep a big table with fields that may not even be used each time I need to add data. A big table with a lot of null columns.
I'm not sure my way is a good way either. It can seem overly complex.
But, assuming it is a good way, what is the best way to get this excel data into my Values table? The list of animals is easy to add to my Animals table. But for each identifier (Breed, Color, etc.) I have to copy or import the values and then update the table to assign a matching FieldId (or create a new FieldId in the Fields table if it doesn't exist yet).
It's a huge pain to load new data if there are a lot of identifiers. I'm really struggling and could use a better system.
Any advice, help, or just pointing me in a better direction would be really appreciated.
Thanks.
Depending on your client (eg, I use SequelPro on a Mac), you might be able to import CSVs. This is generally pretty shaky, but you can also export your Excel document as a CSV... how convenient.
However, this doesn't really help with your database structure. Granted, using foreign keys is a good idea, but importing that data unobtrusively (and easily) is something that will need to likely be done a row at a time.
However, you could try modifying something like this to suit your needs, by first exporting your Excel document as a CSV, removing the header row (the first one), and then using regular expressions on it to change it into a big chunk of SQL. For example:
Your CSV:
myval1.1,myval1.2,myval1.3,myval1.4
myval2.1,myval2.2,myval2.3,myval2.4
...
At which point, you could do something like:
myCsvText.replace(/^(.+),(.+),(.+)$/mg, 'INSERT INTO table_name(col1, col2, col3) VALUES($1, $2, $3)')
where you know the number of columns, their names, and how their values are organized (via the regular expression & replacement).
Might be a good place to start.
Your table looks OK. Since you have a variable number of fields, it seems logical to expand vertically. Although you might want to make it easier on yourself by changing DataFileID and FieldID into FieldName and DataFileName, unless you will use them in a lot of other tables too.
Getting data from Excel into SQL Server is unfortunately not so easy as you would expect from two Microsoft products interacting with eachother. There are several routes that I know of that you can take:
Work with CSV files instead of Excel files. Excel can edit CSV files just as easily as Excel files, but CSV is an infinitely more reliable datasource when it comes to importing. You don't get problems with different file formats for different Excel versions, Excel having to be installed on the computer that will run the script or quirks with automatic datatype recognition. A CSV can be read with the BCP commandline tool, the BULK INSERT command or with SSIS. Then use stored procedures to convert the data from a horizontal bulk of columns into a pure vertical format.
Use SSIS to read the data directly from the Excel file(s). It is possible to make a package that loops over several Excel files. A drawback is that the column format and the sheet name of the Excel file has to be known beforehand, and so a different template (with a separate loop) has to be made each time a new Excel format arrives. There exist third-party SSIS components that claim to be more flexible, but I haven't tested them out yet.
Write a Visual C# program or PowerShell script that grabs the Excel file, extracts the data and outputs into your SQL table. Visual C# is a pretty easy language with powerful interfaces into Office and SQL Server. I don't know how big the learning curve is to get started, but once you do, it will be a pretty easy program to write. I have also heard good things about Powershell.
Create an Excel Macro that uses VB code to open other Excel files, loop through their data and write the results either in a predefined sheet or as CSV to disk. Once everything is in a standard format it will be easy to import the data using one of the above methods.
Since I have had headaches with 1) and 2) before, I would advise on either 3) or 4). Because of my greater experience with VBA than Visual C# or Powershell, I´d go for 4) if I was in a hurry. But I think 3) is the better investment for the long term.
(You could also go adventurous and use another script language, such as Python as I once did because Python is cool, unfortunately Python offers pretty slow and limited interfaces to SQL server and Excel)
Good luck!

Manipulate data in the DB query or in the code

How do you decide on which side you perform your data manipulation when you can either do it in the code or in the query ?
When you need to display a date in a specific format for example. Do you retrieve the desired format directly in the sql query or you retrieve the date then format it through the code ?
What helps you to decide : performance, best practice, preference in SQL vs the code language, complexity of the task... ?
All things being equal I prefer to do any manipulation in code. I try to return data as raw as possible so its usuable by a larger base of consumers. If its very specialized, maybe a report, then I may do manipulation on the SQL side.
Another instance where I prefer to do manipulation on the SQL side is if it can be done set based.
If its not set based, and looping would be involved, then I would do the manipulation in code.
Basically let the database do what its good at, otherwise do it in code.
Formatting is a UI issue, it is not 'manipulation'.
My answer is the reverse of everyone else's.
If you are going to have to apply the same formatting logic (the same holds true for calculation logic) in more than one place in your application, or in separate applications, I would encapsulate the formatting in a view inside the database and SELECT from the view. You do not need to hide the original data, that can also be available. But by putting the logic into the database view you're making it trivially easy to have consistent formatting across modules and applications.
For instance, a Customer table would have an associated view CustomerEx with a MailingAddress derived column that would format the various parts of the address as required, combining city, state, and zip and compressing out blank lines, etc. My application code SELECTs against the CustomerEx view for addresses. If I extend my data model with, say, an Apt# field or to handle international addresses, I only need to change that single view. I do not need to change, or even recompile, my application.
I would never (ever) specify any formatting in the query itself. That is up to the consumer to decide how to format. All data manipulation should be done at the client side, except for bulk operations.
If it is just formatting and will not always need to be the same formatting, I'd do it in the application which is likely to do this faster.
However the fastest formatting is the one that is done only once, so if it is a standard format that I alawys want to use (say displaying American phone numbers as (###)###-#### ) then I'll store the data in the database in that format (this still may involve the application code, but onthe insert not the select). This is especially true if you might need to reformat a million records for a report. If you have several formats, you might considered calculated columns (we have one for full name and one for lastname, firstname and our raw data is firstname, middlename, lastname, suffix) or triggers to persist the data. In general I say store the data the way you need to see it if you can keep it in the appropriate data type for the real manipulations you need to do such as datemath or regular math for money values.
About the only thing that I do in a query that could probably be done in code also is converting the datetimes to the user's time zone.
MySQL's CONVERT_TZ() function is easy to use and accurate. I store all of my datetimes in UTC, and retrieve them in the user's time zone. Daylight savings rules change. This is especially important for client applications since relying on the native library relies on the fact that the user has updated their OS.
Even for server side code, like a web server, I only have to update a few tables to get the latest time zone data instead of updating the OS on the server.
Other than those types of issues, it's probably best to distribute most functions to the application server or client rather than making your database the bottleneck. Application servers are easier to scale than database servers.
If you can write a stored procedure or something that might start with a large dataset, do some inexpensive calculations or simple iteration to return a single row or value, then it probably makes sense to do it on the server to save from sending large datasets over the wire. So, if the processing is inexpensive, why not have the database return just what you need?
In the case of the date column, I'd save the full date in the DB and when I return it I specify in code how I'd like to show it to the user. This way you can ignore the time part or even change the order of the date parts when you show it in a datagrid for example: mm/dd/yyyy, dd/mm/yyyy or only mm/yyyy.