How do you decide on which side you perform your data manipulation when you can either do it in the code or in the query ?
When you need to display a date in a specific format for example. Do you retrieve the desired format directly in the sql query or you retrieve the date then format it through the code ?
What helps you to decide : performance, best practice, preference in SQL vs the code language, complexity of the task... ?
All things being equal I prefer to do any manipulation in code. I try to return data as raw as possible so its usuable by a larger base of consumers. If its very specialized, maybe a report, then I may do manipulation on the SQL side.
Another instance where I prefer to do manipulation on the SQL side is if it can be done set based.
If its not set based, and looping would be involved, then I would do the manipulation in code.
Basically let the database do what its good at, otherwise do it in code.
Formatting is a UI issue, it is not 'manipulation'.
My answer is the reverse of everyone else's.
If you are going to have to apply the same formatting logic (the same holds true for calculation logic) in more than one place in your application, or in separate applications, I would encapsulate the formatting in a view inside the database and SELECT from the view. You do not need to hide the original data, that can also be available. But by putting the logic into the database view you're making it trivially easy to have consistent formatting across modules and applications.
For instance, a Customer table would have an associated view CustomerEx with a MailingAddress derived column that would format the various parts of the address as required, combining city, state, and zip and compressing out blank lines, etc. My application code SELECTs against the CustomerEx view for addresses. If I extend my data model with, say, an Apt# field or to handle international addresses, I only need to change that single view. I do not need to change, or even recompile, my application.
I would never (ever) specify any formatting in the query itself. That is up to the consumer to decide how to format. All data manipulation should be done at the client side, except for bulk operations.
If it is just formatting and will not always need to be the same formatting, I'd do it in the application which is likely to do this faster.
However the fastest formatting is the one that is done only once, so if it is a standard format that I alawys want to use (say displaying American phone numbers as (###)###-#### ) then I'll store the data in the database in that format (this still may involve the application code, but onthe insert not the select). This is especially true if you might need to reformat a million records for a report. If you have several formats, you might considered calculated columns (we have one for full name and one for lastname, firstname and our raw data is firstname, middlename, lastname, suffix) or triggers to persist the data. In general I say store the data the way you need to see it if you can keep it in the appropriate data type for the real manipulations you need to do such as datemath or regular math for money values.
About the only thing that I do in a query that could probably be done in code also is converting the datetimes to the user's time zone.
MySQL's CONVERT_TZ() function is easy to use and accurate. I store all of my datetimes in UTC, and retrieve them in the user's time zone. Daylight savings rules change. This is especially important for client applications since relying on the native library relies on the fact that the user has updated their OS.
Even for server side code, like a web server, I only have to update a few tables to get the latest time zone data instead of updating the OS on the server.
Other than those types of issues, it's probably best to distribute most functions to the application server or client rather than making your database the bottleneck. Application servers are easier to scale than database servers.
If you can write a stored procedure or something that might start with a large dataset, do some inexpensive calculations or simple iteration to return a single row or value, then it probably makes sense to do it on the server to save from sending large datasets over the wire. So, if the processing is inexpensive, why not have the database return just what you need?
In the case of the date column, I'd save the full date in the DB and when I return it I specify in code how I'd like to show it to the user. This way you can ignore the time part or even change the order of the date parts when you show it in a datagrid for example: mm/dd/yyyy, dd/mm/yyyy or only mm/yyyy.
Related
I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.
I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!
I've written code in SQL Server to create an XML output. However, this exports with no carriage returns.
I initially built a workaround with a replace statement around the entire XML output code that would embed carriage returns between the nodes, but because that only allows me to export a small amount of data at a time, it's not sufficient long-term. When I try to run this on larger datasets, it truncates the text around 65000 characters.
I've tried to cast the entire statement as nvarchar(max) to increase the output size but that doesn't seem to work either. Does anybody have any recommendations for how to do this that isn't just find+replace once the file has already been output from SQL?
First, I would educate the client first. I would imagine it is to make it human readable, but it also expands the size of the returned set. They will likely stick to their guns, but education often stops people from spending money on stupid crap.
Second, I would not do this in SQL Server. This is a user interface type of task (including service endpoints as "user" interface here) and not a task to be done in the database. Doing it outside of SQL Server gives you better access to the XML DOM, which can help if they are truly CRLF and not the &#__; numeric equivalents. If the later, you will have to do a replace function.
If you HAVE to do this in SQL Server, grab the XML result and then replace. I would do this the easy way and replace > with >CRLF and see if that is acceptable, as it is less time consuming. Without the DOM it is difficult to know the difference between open tags and end tags. You can find the right tag using regex, if you want to go that far, but SQL Server's implementation is not as good as many programming languages, so this will be time consuming.
Ultimately, if they are willing to pay you for something that does not make a difference, then that is their baby, but it is a useless exercise IMO.
Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.
I am trying to figure out the best way to model a spreadsheet (from the database point of view), taking into account :
The spreadsheet can contain a variable number of rows.
The spreadsheet can contain a variable number of columns.
Each column can contain one single value, but its type is unknown (integer, date, string).
It has to be easy (and performant) to generate a CSV file containing the data.
I am thinking about something like :
class Cell(models.Model):
column = models.ForeignKey(Column)
row_number = models.IntegerField()
value = models.CharField(max_length=100)
class Column(models.Model):
spreadsheet = models.ForeignKey(Spreadsheet)
name = models.CharField(max_length=100)
type = models.CharField(max_length=100)
class Spreadsheet(models.Model):
name = models.CharField(max_length=100)
creation_date = models.DateField()
Can you think about a better way to model a spreadsheet ? My approach allows to store the data as a String. I am worried about it being too slow to generate the CSV file.
from a relational viewpoint:
Spreadsheet <-->> Cell : RowId, ColumnId, ValueType, Contents
there is no requirement for row and column to be entities, but you can if you like
Databases aren't designed for this. But you can try a couple of different ways.
The naiive way to do it is to do a version of One Table To Rule Them All. That is, create a giant generic table, all types being (n)varchars, that has enough columns to cover any forseeable spreadsheet. Then, you'll need a second table to store metadata about the first, such as what Column1's spreadsheet column name is, what type it stores (so you can cast in and out), etc. Then you'll need triggers to run against inserts that check the data coming in and the metadata to make sure the data isn't corrupt, etc etc etc. As you can see, this way is a complete and utter cluster. I'd run screaming from it.
The second option is to store your data as XML. Most modern databases have XML data types and some support for xpath within queries. You can also use XSDs to provide some kind of data validation, and xslts to transform that data into CSVs. I'm currently doing something similar with configuration files, and its working out okay so far. No word on performance issues yet, but I'm trusting Knuth on that one.
The first option is probably much easier to search and faster to retrieve data from, but the second is probably more stable and definitely easier to program against.
It's times like this I wish Celko had a SO account.
You may want to study EAV (Entity-attribute-value) data models, as they are trying to solve a similar problem.
Entity-Attribute-Value - Wikipedia
The best solution greatly depends of the way the database will be used. Try to find a couple of top use cases you expect and then decide the design. For example if there is no use case to get the value of a certain cell from database (the data is always loaded at row level, or even in group of rows) then is no need to have a 'cell' stored as such.
That is a good question that calls for many answers, depending how you approach it, I'd love to share an opinion with you.
This topic is one the various we searched about at Zenkit, we even wrote an article about, we'd love your opinion on it: https://zenkit.com/en/blog/spreadsheets-vs-databases/