here is how my tables are currently setup:
Dataset
|
- Dataset_Id - Int
|
- Timestamp - Timestamp
Flowrate
|
-Flowrate_id - int
|
-Dataset_id - ALL NULL (INT)
|
-TimeStamp - TimeStamp
|
-FlowRate - FLoat
I want to update the flowrate dataset_id column so that its ids corespond to the dataset dataset_ids. The Dataset table has over close to 400000 rows.... How can I do this so that it does not take forever. This data came from different data loggers and that's why I need to link them with their timestamps....
UPDATE
Flowrate JOIN Dataset ON (Flowrate.TimeStamp = Dataset.Timestamp)
SET Flowrate.Dataset_id = Dataset.Dataset_Id
completely independent from Python of course (what a weird tag to put here -- as if MySql cared what language you're using to send fixed SQL statements to it?!). Will be fast if and only if the tables are properly indexed, of course.
Absolutely weird capitalization irregularities you have in your schema, BTW -- would drive me absolutely bonkers if anybody used lowercase vs uppercase at random spots of column names that are so obviously "meant to" be identical! Nevertheless I've tried to reproduce it exactly, but I hope you reconsider this absurd style choice.
Related
My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.
Short background: I have an SQLite database, a couple of gb in size and growing. It contains a bunch of very simple tables. Each table consists of a 64-bit integer primary index field (TStamp) and a value field (Val). The field TStamp is actually an long-int representation of a date-time. The tables have widely varying row-counts and somewhat variable content types, but that shouldn't matter. A master table (tbIndDate) holds a full range of dates, has the same primary index (TStamp) as the other tables, and holds human-readable date-time in its Val field. For instance,
The master index table, named tbIndDate:
TStamp Val
634082688000000000 5/1/2010 0:00:00
634082691000000000 5/1/2010 0:05:00
634082694000000000 5/1/2010 0:10:00
634082697000000000 5/1/2010 0:15:00
etc etc
A sample table for automation tag 6FI1.PV, named tb6FI1%PV:
TStamp Val
634085793000000000 41.7
634085796000000000 42.83
634085799000000000 41.44
634085802000000000 40.43
634085805000000000 39.78
etc etc
Getting data into the tables is handled by a little vb.net program, and when a new automation tag is added to the capture list then the program creates a new table using the automation tag name, and begins populating it. That all works real slick.
OK. I've started building a tool for getting data out of the database. It works great for inner joins:
SELECT [tbIndDate].[Val] AS 'Timestamp',[tb6FI1%PV].[Val] AS '6FI1.PV',
[tb6FI34%PV].[Val] AS '6FI34.PV',[tb6AI32%PV].[Val] AS '6AI32.PV'
FROM [tbIndDate],[tb6FI1%PV],[tb6FI34%PV],[tb6AI32%PV]
WHERE [tbIndDate].[TStamp]=[tb6FI1%PV].[TStamp]
AND [tbIndDate].[TStamp]=[tb6FI34%PV].[TStamp]
AND [tbIndDate].[TStamp]=[tb6AI32%PV].[TStamp];
This returns:
Timestamp 6FI1.PV 6FI34.PV 6AI32.PV
1/1/2013 0:00:00 42.4679 1.499 0.8439
1/1/2013 0:05:00 40.3628 1.5048 0.8435
1/1/2013 0:10:00 38.2652 1.5028 0.8436
1/1/2013 0:15:00 37.8582 1.5029 0.8436
Yay! :)
I've also gotten some averaging and time-interval queries working.
However, because tag data is not all available for all dates, I'd like to create the option to list all dates in the master index even if some of the tag tables do not have matching data.
A SELECT query with a left outer join, in other words. Everyone knows that. The data might look like:
Timestamp 6FI1.PV 6FI34.PV 6AI32.PV
1/1/2013 0:00:00 42.4679 1.499 NULL
1/1/2013 0:05:00 40.3628 1.5048 NULL
1/1/2013 0:10:00 38.2652 NULL NULL
1/1/2013 0:15:00 37.8582 NULL 0.8436
Trouble is, none of the SQL I've tried has worked. Here's one that didn't go:
SELECT [tbIndDate].[Val] AS 'Timestamp',[tb6FI1%PV].[Val] AS '6FI1.PV',
[tb6FI34%PV].[Val] AS '6FI34.PV'
FROM [tbIndDate],[tb6FI1%PV],[tb6FI34%PV]
LEFT JOIN [tbIndDate] ON [tbIndDate].[TStamp]=[tb6FI1%PV].[TStamp]
LEFT JOIN [tbIndDate] ON [tbIndDate].[TStamp]=[tb6FI34%PV].[TStamp];
The error was "SQL error or missing database, ambiguous column name: tbIndDate.Val"
I've tried copying the syntax from several examples, but none are exactly the same and my attempts fail.
Am I doing the aliases wrong? The square brackets to accommodate special characters in table names? I'm a complete SQL beginner, so don't hold back with the advice.
It looks like the problem is that you're trying to join [tbIndDate] several times. Try this:
SELECT [tbIndDate].[Val] AS 'Timestamp',[tb6FI1%PV].[Val] AS '6FI1.PV',
[tb6FI34%PV].[Val] AS '6FI34.PV'
FROM [tbIndDate]
LEFT JOIN [tb6FI1%PV] ON [tbIndDate].[TStamp]=[tb6FI1%PV].[TStamp]
LEFT JOIN [tb6FI34%PV] ON [tbIndDate].[TStamp]=[tb6FI34%PV].[TStamp];
Suppose I have a large table to store ranges of integers. I can do this with two fields:
start|end
10 |210 (represents 10 to 210)
5 |55 (represents 5 to 55)
(quick to select by end column), or:
start|length
10 | 200 (represents 10 to 210)
5 | 50 (represents 5 to 55)
(quick to select by length column).
What if sometimes I need to select by end, and sometimes by length, and both queries need to be fast? I could store both:
start|length|end
10 | 200 |210
5 | 50 |55
But then this is not normalised and everyone has to remember to update both fields, and is just bad design.
I know I can select by start + length or end - start but for a very large table, isn't this extremely slow?
How can I query by calculated values quickly without storing redundant data - or should I just store the extra column?
Depending on the database type you are using, you might want to use a trigger to calculate the derived field. That way, they can never get out of synch.
This means that the field (length) could be re-calculated every time start or end changes.
I'd store the length, but I'd make sure the calculation was done in my insert and update sprocs so that as long as everyone uses your sprocs there is no more overhead for them.
Unfortunately neither of your target databases support computed columns. I would do the following:
First, determine whether you really have a performance problem. It is true that WHERE end - start = ? will perform more slowly than WHERE length = ?, but you don't define what a "really big table" is in your application, nor what the required performance is. No need to optimize away a problem that may not exist.
Determine whether you can support any latency in your searches. If so, you can add the calculated column to the table but dedicate a separate task, running every five minutes, each hour, or whatever, to fill in the values.
In PostgreSQL you could consider a materialized view, which I believe are supported at the engine level. (See Catcall's comment, below).
Finally, if all else fails, consider using a trigger to maintain the calculated column.
I'm going to start work on a medium sized application, and i'm planning it's db design.
One thing that I'm not sure about is this.
I will have many tables which will need internationalization, such as: "membership_options, gender_options, language_options etc"
Each of these tables will share common i18n fields, like:
"title, alternative_title, short_description, description"
In your opinion which is the best way to do it?
Have an i18n table with the same fields for each of the tables that will need them?
or do something like:
Membership table Gender table
---------------- --------------
id | created_at id | created_at
1 - 22.03.2001 1 - 14.08.2002
2 - 22.03.2001 2 - 14.08.2002
General translation table
-------------------------
record_id | table_name | string_name | alternative_title| .... |id_language
1 - membership regular null 1 (english)
1 - membership normale null 2 (italian)
1 - gender man null 1(english)
1 -gender uomo null 2(italian)
This would avoid me repeating something like:
membership_translation table
-----------------------------
membership_id | name | alternative_title | id_lang
1 regular null 1
1 normale null 2
gender_translation table
-----------------------------
gender_id | name | alternative_title | id_lang
1 man null 1
1 uomo null 2
and so on, so i would probably reduce the number of db tables, but i'm not sure about performance.I'm not much of a DB designer, so please let me know.
The most common way I've seen this done is with two tables, membership and membership_ml, with one storing the base values and the ml table storing the localized strings. This is similar to your second option. Most of the systems I see like this are made that way because they weren't designed with internationalization in mind from the get go, so the extra _ml tables were "tacked on" later.
What I think is a better option is similar to your first option, but a little bit different. You would have a central table for storing all the translations, but instead of putting the table name and field name in there, you would use tokens and a central "Content" table to store all the translations. That way you can enforce some kind of RI between the tokens in the base table and the translations in the Content table if you want as well.
I actually asked a question about this very thing a while back, so you can have a look at that for some more info (rather than repasting the schema examples here).
I also think the best solution is to keep translations on different table. This approach use Open Cart which is open source and you can take a look the way it deals with the problem. Another source of information is here "http://www.gsdesign.ro/blog/multilanguage-database-design-approach/" especially on the comments sections
I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).