My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.
Related
Say you want to record three numbers for every Movie record...let's say, :release_year, :box_office, and :budget.
Conventionally, using Rails, you would just add those three attributes to the Movie model and just call #movie.release_year, #movie.box_office, and #movie.budget.
Would it save any database space or provide any other benefits to condense all three numbers into one umbrella column?
So when adding the three numbers, it would go something like:
def update
...
#movie.umbrella = params[:movie_release_year]
+ "," + params[:movie_box_office] + "," + params[:movie_budget]
end
So the final #movie.umbrella value would be along the lines of "2015,617293,748273".
And then in the controller, to access the three values, it would be something like
#umbrella_array = #movie.umbrella.strip.split(',').map(&:strip)
#release_year = #umbrella_array.first
#box_office = #umbrella_array.second
#budget = #umbrella_array.third
This way, it would be the same amount of data (actually a little more, with the extra commas) but stored only in one column. Would this be better in any way than three columns?
There is no benefit in squeezing such attributes in a single column. In fact, following that path will increase the complexity of your code and will limit your capabilities.
Here's some of the possible issues you'll face:
You will not be able to add indexes to increase the performance of lookup of records with a specific attribute value or sort the filtering
You will not be able to query a specific attribute value
You will not be able to sort by a specific column value
The values will be stored and represented as Strings, rather than Integers
... and I can continue. There are no advantages, only disadvantages.
Agree with comments above, as an example try to use pg_column_size() to compare results:
WITH test(data_txt,data_int,data_date) AS ( VALUES
('9999'::TEXT,9999::INTEGER,'2015-01-01'::DATE),
('99999999'::TEXT,99999999::INTEGER,'2015-02-02'::DATE),
('2015-02-02'::TEXT,99999999::INTEGER,'2015-02-02'::DATE)
)
SELECT pg_column_size(data_txt) AS txt_size,
pg_column_size(data_int) AS int_size,
pg_column_size(data_date) AS date_size
FROM test;
Result is :
txt_size | int_size | date_size
----------+----------+-----------
5 | 4 | 4
9 | 4 | 4
11 | 4 | 4
(3 rows)
I need to select x unique rows randomly from a table with n rows, while excluding 1 specific row. (x is small, 3 for example) This can be done in several queries if needed and I can also compute anything in programming language (Java). The one important thing is that it must be done faster than O(n), consuming O(x) memory and indefinite looping (retrying) is also undesirable.
Probability of selection should be equal for all rows (except the one which is excluded, of course)
Example:
| id | some data |
|————|———————————|
| 1 | … |
| 2 | … |
| 3 | … |
| 4 | … |
| 5 | … |
The algorithm is ran with arguments (x = 3, exclude_id = 4), so:
it should select 3 different random rows from rows with id in 1,2,3,5.
(end of example)
I tried the following approach:
get row count (= n);
get the position of the excluded row by something like select count(*) where id < excluded_id, assuming id is monotonically increasing;
select the numbers from 0..n, obeying all the rules, by using some "clever" algorithms, it's something like O(x), in other words fast enough;
select these x rows one by one by using limit(index, 1) SQL clause.
However, it turned out that it's possible for rows to change positions (I'm not sure why), so the auto-generated ids are not monotonically increasing. And in this case the second step (get the position of the excluded row) produces wrong result and the algorithm fails to do its job correctly.
How can this be improved?
Also, if this is vastly easier with a different SQL-like database system, it would be interesting, too. (the DB is on a server and I can install any software there as long as it's compatible with Ubuntu 14.04 LTS)
(I'm sorry for a bit of confusion;) the algorithm I used is actually correct it the id is monotonically increasing, I just forgot that it was not itself auto-generated, it was taken from another table where it's auto-generated, and it was possible to add these rows in different order.
So I added another id for this table, which is auto-generated, and used it for row selection, and now it works as it should.
I know that the "select record corresponding to the maximum value for a field" has been exhaustively answered, but I was wondering why nobody suggested using an ORDER BY clause to get the right row.
For example, I have this table:
| other_field | target_field |
| 1 | 15 |
| 2 | 25 |
| 3 | 20 |
and I want to find the other_field value corresponding to the maximum target_field (e.g. in this case, I want to find 2).
Many people suggested using GROUP and JOIN, however my first idea was to use:
SELECT other_field FROM table ORDER by target_field DESC LIMIT 1;
Is there anything wrong with this? The only problem I can think of is that maybe ordering takes longer then just find the maximum (although on the other hand the JOIN might also take a while).
Thanks!
EDIT: sorry guys for the late replies, I'm new here and I was expecting to get some e-mails for notifications :)
Yes.
It actually has to sort every record before it can return any data. It's highly inefficient. It will return what you want, but not in the best possible way. Aggregate functions tend to do it much better, and much quicker.
With your current query, once you reached a much higher data load, it would take ages to process and materialize. (With smaller data sets, you should be fine)
If you need single value from one or more than one tables then you have to go for Max and GroupBy
if you are only one table and requires multiple columns then it is ok to use Order By Desc.
if you again need a single value from single table then MAX is preferred here too.
I hope you got my points
You can try to use the following query :
select top 1 other_field from tester order by target_field desc;
It works well in Sybase. Not sure of other databases.
I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).
I'm working now for a while on a reporting applications where I use hibernate to define my queries. However, more and more I get the feeling that for reporting use cases this is not the best approach.
The queries only result partial columns, and thus not typed objects
(unless you cast all fields in java).
It is hard to express queries without going straight into sql or
hql.
My current problem is that I want to get the top N per group, for example the last 5 days per element in a group, where on each day I display the amount of visitors.
The result should look like:
| RowName | 1-1-2009 | 2-1-2009 | 3-1-2009 | 4-1-2009 | 5-1-2009
| SomeName| 1 | 42 | 34 | 32 | 35
What is the best approach to transform the data which is stored per day per row to an output like this? Is it time to fall back on regular sql and work with untyped data?
I really want to use typed objects for my results but java makes my life pretty hard for that. Any suggestions are welcome!
Using the Criteria API, you can do this:
Session session = ...;
Criteria criteria = session.createCriteria(MyClass.class);
criteria.setFirstResult(1);
criteria.setMaxResults(5);
... any other criteria ...
List topFive = criteria.list();
To do this in vanilla SQL (and to confirm that Hibernate is doing what you expect) check out this SO post: