Using integers and requiring multiplication vs. using decimals as a data type - what are your thoughts? - sql

What are your thoughts on this? I'm working on integrating some new data that's in a tab-delimited text format, and all of the decimal columns are kept as single integers; in order to determine the decimal amount you need to multiply the number by .01. It does this for things like percentages, weight and pricing information. For example, an item's price is expressed as 3259 in the data files, and when I want to display it I would need to multiply it in order to get the "real" amount of 32.59.
Do you think this is a good or bad idea? Should I be keeping my data structure identical to the one provided by the vendor, or should I make the database columns true decimals and use SSIS or some sort of ETL process to automatically multiply the integer columns into their decimal equivalent? At this point I haven't decided if I am going to use an ORM or Stored Procedures or what to retrieve the data, so I'm trying to think long term and decide which approach to use. I could also easily just handle this in code from a DTO or similar, something along the lines of:
public class Product
{
// ...
private int _price;
public decimal Price
{
get
{
return (this._price * .01);
}
set
{
this._price = (value / .01);
}
}
}
But that seems like extra and unnecessary work on the part of a class. How would you approach this, keeping in mind that the data is provided in the integer format by a vendor that you will regularly need to get updates from.

"Do you think this is a good or bad idea?"
Bad.
"Should I be keeping my data structure identical to the one provided by the vendor?"
No.
"Should I make the database columns true decimals?"
Yes.
It's so much simpler to do what's right. Currently, the data is transmitted with no "." to separate the whole numbers from the decimals; that doesn't any real significance.
The data is decimal. Decimal math works. Use the decimal math provided by your language and database. Don't invent your own version of Decimal arithmetic.

Personally I would much prefer to have the data stored correctly in my database and just do a simple conversion every time an update comes in.

Pedantically: they aren't kept as ints either. They are strings that require parsing.
Philisophically: you have information in the file and you should write data into the database. That means transforming the information in any ways necessary to make it meaningful/useful. If you don't do this transform up front, then you'll be doomed to repeat the transform across all consumers of the database.
There are some scenarios where you aren't allowed to transform the data, such as being able to answer the question: "What was in the file?". Those scenarios would require the data to be written as string - if the parse failed, you wouldn't have an accurate representation of the file.

In my mind the most important facet of using Decimal over Int in this scenario is maintainability.
Data stored in the tables should be clearly meaningful without need for arbitrary manipulation. If manipulation is required is should be clearly evident that it is (such as from the field name).
I recently dealt with data where days of the week were stored as values 2-8. You can not imagine the fall out this caused (testing didn't show the problem for a variety of reasons, but live use did cause political explosions).
If you do ever run in to such a situation, I would be absolutely certain to ensure data can not be written to or read from the table without use of stored procedures or views. This enables you to ensure the necessary manipulation is both enforced and documented. If you don't have both of these, some poor sod who follows you in the future will curse your very name.

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

What is the benefit of the one "DATE" datatype over another in Laravel/SQLite?

In my app, I'm just using a SQLlite database for development. Now in the migration, I declare a DATE datatype which laravel seems to handle without any problem, and in the database itself creates it as a varchar.
According to this nice article (http://www.sqlitetutorial.net/sqlite-date/) SQLite has basically got three options for handling dates:
Using the TEXT storage class for storing SQLite date and time Using
REAL storage class to store SQLite date and time values
Using INTEGER to store SQLite date and time values
So as I'm trying to formulate my approach, I'm thinking ahead that I will likely end up, at some point, need to step up and move to a higher performance SQL database (mySQL / Postgres / etc. ) And then may have datatype translation challenges.
But then also, at the application layer, Laravel itself has some manipulations.
Now, the question I'm asking is this, What is the benefit of one type over another? Is there some kind of reason to choose one type over another? My thinking is that TEXT is nice and human-readable for backend support, but it may require addiotnal coding to manipulate strings.
INTEGERS are probably more efficient, and would be translatable to a bigger SQL server easier than text.
Does anyone know of a comparison of the pro's and con's of various choices?
Any advice? Thanks in advance.
The size of integer is 4 bytes. The size of a letter in text is 1 byte.
To represent date and time you need 1 UTC number when you use integer. So its much better to user 4 bytes of integer than using 8 bytes of text. I dont see how real can be better than integer for the exact same reason. I would say you should use integer.

How to speed up table-retrieval with MATLAB and JDBC?

I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB.
The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.
Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.
What I am doing at the moment is the following:
First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:
QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];
QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();
Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.
m=0; % Variable containing rownumber
while RESULTSET.next()
m = m+1;
for n = 1:length(FNAMES)
if COLTYPE(n)==1 % Columntype is a FLOAT
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
else
DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
end
end
end
When I am done with my request I close the statement and the connection.
I don´t have the MATLAB database toolbox so I am looking for solutions without it.
I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?
Summary
To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.
(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)
Details
Push loops down to Java to get block fetching
The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.
The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.
Pseudocode for the Matlab layer:
colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
cursor.fetchBlock();
for iCol = 1:nCols
colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
end
end
for iCol = 1:nCols
colResults{iCol} = cat(2, colBufs{iCol}{:});
end
Twiddle JDBC DBMS driver knobs
Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.
Optimize column datatype transfer
In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)
After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.
You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.
If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.
Miscellaneous
If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.
Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own #connection and #preparedstatement M-code classes is a decent way to do this. So it looks like this.
QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);
This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.
Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.
It occurs to me that to speed up the query to the table, you have to remove if, for that in the JDBC ResulSetMetadata there that give you the information about the data type of the column and the same name, so you save time of if, else if
  ResultSetMetaData RsmD = (ResultSetMetaData) rs.getMetaData ();
        int cols=rsMd.getColumnCount();
        
  while (rs.next)
for i=1 to cols
row[i]=rs.getString(i);
My example is pseudocode becouse i´m not matlab programmer.
I hope you find it useful JDBC, anything let me know!

How best to represent rational numbers in SQL Server?

I'm working with data that is natively supplied as rational numbers. I have a slick generic C# class which beautifully represents this data in C# and allows conversion to many other forms. Unfortunately, when I turn around and want to store this in SQL, I've got a couple solutions in mind but none of them are very satisfying.
Here is an example. I have the raw value 2/3 which my new Rational<int>(2, 3) easily handles in C#. The options I've thought of for storing this in the database are as follows:
Just as a decimal/floating point, i.e. value = 0.66666667 of various precisions and exactness.
Pros: this allows me to query the data, e.g. find values < 1.
Cons: it has a loss of exactness and it is ugly when I go to display this simple value back in the UI.
Store as two exact integer fields, e.g. numerator = 2, denominator = 3 of various precisions and exactness.
Pros: This allows me to precisely represent the original value and display it in its simplest form later.
Cons: I now have two fields to represent this value and querying is now complicated/less efficient as every query must perform the arithmetic, e.g. find numerator / denominator < 1.
Serialize as string data, i.e. "2/3". I would be able to know the max string length and have a varchar that could hold this.
Pros: I'm back to one field but with an exact representation.
Cons: querying is pretty much busted and pay a serialization cost.
A combination of #1 & #2.
Pros: easily/efficiently query for ranges of values, and have precise values in the UI.
Cons: three fields (!?!) to hold one piece of data, must keep multiple representations in sync which breaks D.R.Y.
A combination of #1 & #3.
Pros: easily/efficiently query for ranges of values, and have precise values in the UI.
Cons: back down to two fields to hold one piece data, must keep multiple representations in sync which breaks D.R.Y., and must pay extra serialization costs.
Does anyone have another out-of-the-box solution which is better than these? Are there other things I'm not considering? Is there a relatively easy way to do this in SQL that I'm just unaware of?
If you're using SQL Server 2005 or 2008, you have the option to define your own CLR data types:
Beginning with SQL Server 2005, you
can use user-defined types (UDTs) to
extend the scalar type system of the
server, enabling storage of CLR
objects in a SQL Server database. UDTs
can contain multiple elements and can
have behaviors, differentiating them
from the traditional alias data types
which consist of a single SQL Server
system data type.
Because UDTs are accessed by the
system as a whole, their use for
complex data types may negatively
impact performance. Complex data is
generally best modeled using
traditional rows and tables. UDTs in
SQL Server are well suited to the
following:
Date, time, currency, and extended numeric types
Geospatial applications
Encoded or encrypted data
If you can live with the limitations, I can't imagine a better way to map data you're already capturing in a custom class.
I would probably go with Option #4, but use a calculated column for the 3rd column to avoid the sync/DRY issue (and also means you actually only store 2 columns, avoiding the "three fields" issue).
In SQL server, calculated column is defined like so:
CREATE TABLE dbo.Whatever(
Numerator INT NOT NULL,
Denominator INT NOT NULL,
Value AS (Numerator / Denominator) PERSISTED
)
(note you may have to do some type conversion and verification that Denominator is not zero, etc).
Also, SQL 2005 added a PERSISTED calculated column that would get rid of the calculation at query time.
How much precision do you need?
The language, C# or otherwise, will round 2/3rds at a given position in the precision. If it's acceptable for whatever you are working on to use decimal values of say scientific notation of 10, then set the precision accordingly in the db.
If the precision is really a concern, then separate the numerator & denominator. This would ensure you always have access to whatever precision you want, and you can use a computed column to represent the value for quick filtering:
numerator INT,
denominator INT,
result AS CASE WHEN denominator > 0 THEN numerator / denominator ELSE NULL END
I have experimented a little bit with using the geometry data type in SQL Server 2008 to store and manipulate rational numbers. Basically, I assume that the numerator goes in the X slot and the denominator goes in the Y slot of a fictitious geometry point.
This was good for my needs, but it might be useless for yours. That will depend on what your priorities are (performance, code readability, etc.). I personally found that T-SQL for geometry data manipulation is hard to write and read.
how much precision are you looking at ? double/float provide decent precision(in my opinion). Am pretty sure scientific/astronomical data need a lot more precision that that. I do know that libraries like matlab and mathematica are good at these. I found that you can use mathematica with your .net program. Here is the link
Edit: adding more links and quotes
"When Mathematica operates on rational numbers, it gives an exact result no matter how many digits are required" from here
Another good read, but you would have to implement it I guess

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server