Is condensing the number of columns in a database beneficial? - sql

Say you want to record three numbers for every Movie record...let's say, :release_year, :box_office, and :budget.
Conventionally, using Rails, you would just add those three attributes to the Movie model and just call #movie.release_year, #movie.box_office, and #movie.budget.
Would it save any database space or provide any other benefits to condense all three numbers into one umbrella column?
So when adding the three numbers, it would go something like:
def update
...
#movie.umbrella = params[:movie_release_year]
+ "," + params[:movie_box_office] + "," + params[:movie_budget]
end
So the final #movie.umbrella value would be along the lines of "2015,617293,748273".
And then in the controller, to access the three values, it would be something like
#umbrella_array = #movie.umbrella.strip.split(',').map(&:strip)
#release_year = #umbrella_array.first
#box_office = #umbrella_array.second
#budget = #umbrella_array.third
This way, it would be the same amount of data (actually a little more, with the extra commas) but stored only in one column. Would this be better in any way than three columns?

There is no benefit in squeezing such attributes in a single column. In fact, following that path will increase the complexity of your code and will limit your capabilities.
Here's some of the possible issues you'll face:
You will not be able to add indexes to increase the performance of lookup of records with a specific attribute value or sort the filtering
You will not be able to query a specific attribute value
You will not be able to sort by a specific column value
The values will be stored and represented as Strings, rather than Integers
... and I can continue. There are no advantages, only disadvantages.

Agree with comments above, as an example try to use pg_column_size() to compare results:
WITH test(data_txt,data_int,data_date) AS ( VALUES
('9999'::TEXT,9999::INTEGER,'2015-01-01'::DATE),
('99999999'::TEXT,99999999::INTEGER,'2015-02-02'::DATE),
('2015-02-02'::TEXT,99999999::INTEGER,'2015-02-02'::DATE)
)
SELECT pg_column_size(data_txt) AS txt_size,
pg_column_size(data_int) AS int_size,
pg_column_size(data_date) AS date_size
FROM test;
Result is :
txt_size | int_size | date_size
----------+----------+-----------
5 | 4 | 4
9 | 4 | 4
11 | 4 | 4
(3 rows)

Related

How to sort string data that represents numbers

My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.

Spitting long column values to managable size for presenting data neatly

Hi I was wondering if there is a way to split long column values in this case I am using SSRS to get the distinct values with the number of product ID against a category into a matrix/pivot table in SSRS. The problem lies with the amount of distinct category makes it a nightmare to make the report look pretty shall we say. Is there a dynamic way to split the columns in say groups of 10 to make the table look nicer and easy to read. I was thinking of using in operator then the list of values but that means managing the data every time a new category gets added. Is there a dynamic way to present the data in the best way possible? There are 135 distinct category values
Also I am open to suggestions to make the report to nicer if anyone has any thoughts. I am new to SSRS and trying to get to grips with its.
Here is an example of my problem
enter image description here
Are your column names coming back from the database under the SubCat field you note in the comments above? If so I imagine your dataset looks something like this
Subcat | Logno
---------+---------------
SubCatA | 34
SubCatB | 65
SubCatC | 120
SubCatD | 8
SubCatE | 19
You can edit this so that there is an index of each individual category being returned also, using the Row_Number() function. Add the field
ROW_NUMBER() OVER (ORDER BY SubCat ASC) AS ColID
To your query. This will result in the following.
Subcat | LogNo | ColID
-----------+--------------+----------
SubCatA | 34 | 1
SubCatB | 65 | 2
SubCatC | 120 | 3
SubCatD | 8 | 4
SubCatE | 19 | 5
Now there is a numeric identifier for each column you can perform some logic on it to arrange itself nicely on the page.
This solution involves a Tablix, nested inside a Matrix nested inside a Matrix as follows
First create a Matrix (Matrix1), and set it’s datasource to your dataset. Set the Row Group Properties to group on the following expression where ‘4’ is the number of columns you wish to display horizontally.
=CInt(Floor((Fields!ColID.Value - 1) / 4))
Then in the data section of the Matrix (bottom right corner) insert a rectangle and on this insert a new Matrix (Matrix 2). Remove the leftmost row. Set the column header to be the Column Name SubCat. This will automatically set the column grouping to be SubCat.
Finally, in the Data Section of Matrix 2 add a new Rectangle and Add a Tablix on it. Remove the Header Row, and set it to be one column wide only. Set the Data to be the information you wish to display, i.e. LogNo.
Finally, delete the Leftmost and Topmost rows/columns from Matrix 1 to make it look tidier (Note Delete Column Row only! Not associated groups!)
Then when the report is run it should look similar to the following. Note in my example SubCat = ColName, and LogNo = NumItems, and I have multiple values per SubCat.
Hopefully you find this helpful. If not, please ask for clarification.
Can you do something like this:
The following gives the steps (in two columns, down then across)

comma separated column in linq where clause

i have string of value like "4,3,8"
and i had comma separated column in table as below.
ID | PrdID | cntrlIDs
1 | 1 | 4,8
2 | 2 | 3
3 | 3 | 3,4
4 | 4 | 5,6
5 | 5 | 10,14,18
i want only those records from above table which match in above mention string
eg.
1,2,3 this records will need in output because its match with the passing string of "4,3,8"
Note : i need this in entity framework LINQ Query.
string[] arrSearchFilter = "4,3,8".Split(',');
var query = (from prdtbl in ProductTables
where prdtbl.cntrlIDs.Split(',').Any(x=> arrSearchFilter.Contains(x))
but its not working and i got below error
LINQ to Entities does not recognize the method 'System.String[] Split(Char[])' method, and this method cannot be translated into a store expression.
LINQ to Entities tries to convert query expressions to SQL. String.Split is not one of the supported methods. See http://msdn.microsoft.com/en-us/library/vstudio/bb738534(v=vs.100).aspx
Assuming you are unable to redesign the database structure, you have to bypass the SQL filter and obtain ALL records and then apply the filter. You can do this by using ProductTables.ToList() and then using this in second query with the string split, e.g.
string[] arrSearchFilter = "4,3,8".Split(',');
var products = ProductTables.ToList();
var query = (from prdtbl in products
where prdtbl.cntrlIDs.Split(',').Any(x=> arrSearchFilter.Contains(x))
This is not a good idea if the Product table is large, as you are losing a key benefit of SQL and loading ALL the data before filtering.
Redesign
If that is a problem and you can change the database structure, you should create a child table that replaces the comma-separated values with a proper normalised structure. Comma separated variables might look like a convenient shortcut but they are not a good design and as you have found, are not easy to work with in SQL.
SQL
If the design cannot be changed and the table is large, then your only other option is to hand-roll the SQL and execute this directly, but this would lose some of the benefits of having Linq.

SQL: Find highest number if its in nvarchar format containing special characters

I need to pull the record containing the highest value, specifically I only need the value from that field. The problem is that the column is nvarchar format that contains a mix of numbers and special characters. The following is just an example:
PK | Column 2 (nvarchar)
-------------------
1 | .1.1.
2 | .10.1.1
3 | .5.1.7
4 | .4.1.
9 | .10.1.2
15 | .5.1.4
Basically, because of natural sort, the items in column 2 are sorted as strings. So instead of returning the PK for the row containing ".10.1.2" as the highest value i get the PK for the row that contains ".5.1.7" instead.
I attempted to write some functions to do this but it seems what I've written looked way more complicated than it should be. Anyone got something simple or complicated functions are the only way?
I want to make clear that I'm trying to grab the PK of the record that contains the highest Column 2 value.
This query might return what you desire
SELECT MAX(CAST(REPLACE(Column2, '.', '') as INT)) FROM table

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.