I have a table T of records and fields. I want to create a new field and populate it with the result of a lookup of another table L. This means that I will use one or more fields in T as a foreign key. In SQL, I can UPDATE the newly created field in table T using a JOIN with table L. Conversely, Matlab has no updating of an existing table when doing a join; that creates a whole new table, which is then used to replace the original table T. It seems like a lot of data replication to populate one field. Is this avoided under-the-hood? Is there a code design pattern or idiom that avoids this, but is still reasonably readable and doesn't compromise on code compactness too much?
While I asked this question in the context of join, I'd be interested in general in strategies for avoiding table replication in all variations of Matlab joins.
I'll describe an example of how, for each record in Table1, the ForeignKey is used to look up Data in Table2.
Table1
-----------------------------
SomeField NewField ForeignKey
--------- -------- ----------
someData1 dummy a
someData2 dummy b
someData3 dummy a
someData4 dummy b
someData5 dummy a
Table2
--------
Key Data
--- ----
a apple
b banana
The following SQL code performs the lookup. The entry in the Data field is then concatenated with the content of field SomeField in Table1 and stored into field NewField.
UPDATE Table1 INNER JOIN Table2
ON Table1.ForeignKey = Table2.Key
SET Table1.NewField = Table1.SomeField & Table2.Data
The updated Table1 is:
Table1
------------------------------------
SomeField NewField ForeignKey
--------- --------------- ----------
someData1 someData1apple a
someData2 someData2banana b
someData3 someData3apple a
someData4 someData4banana b
someData5 someData5apple a
Interesting to note that Table INNER JOIN Table2 isn't actually created. It is only "virtually" created to enable the calculation with which to update Table1. In contrast, Matlab's JOIN creates the actual joined table, and a separate operation is needed to do the calculation.
The operation that you're doing must look something like this:
table1 = table({'someData1';'someData2';'someData3';'someData4';'someData5'},...
{'a';'b';'a';'b';'a'},'VariableNames',{'SomeField','ForeignKey'});
table2 = table({'a';'b'},{'apple';'banana'},'VariableNames',{'Key','Data'});
table3 = join(table1,table2,'LeftKeys','ForeignKey','RightKeys','Key')
This produces the following table:
SomeField ForeignKey Data
___________ __________ ________
'someData1' 'a' 'apple'
'someData2' 'b' 'banana'
'someData3' 'a' 'apple'
'someData4' 'b' 'banana'
'someData5' 'a' 'apple'
And then you are applying some sort of operation on the columns SomeField and Data.
I think this join function has been added to make use easy for those familiar with SQL but less familiar with MATLAB syntax.
If you are still worried about copying large amounts of data (as I mentioned, this is not the case because of lazy copying), you can obtain the column Data above using the following set-based operations:
[~,index] = ismember(table1.ForeignKey,table2.Key);
data = table2.Data(index);
Here, data is a cell array, identical to table3.Data. In any case, the index values created here are what SQL would create internally for this JOIN operation. If a table1.ForeignKey is not in table2.Key, the corresponding index value is 0 (MATLAB indexes starting at 1). In that case, you cannot use index directly to index, you would need to use an additional level of indexing to get only the valid rows:
[valid,index] = ismember(table1.ForeignKey,table2.Key);
data1 = table1.SomeField(valid)
data2 = table2.Data(index(valid));
Note that table/join uses ismember in this exact same way, then copies the left table (which causes the copy to reference the data in the input table due to lazy copying), and adds columns to it for the right table.
I’m no expert on tables, but I can give insight into how MATLAB operates with this type of data.
A MATLAB table object contains a matrix for each column.
Copying a matrix in MATLAB does not copy the data. MATLAB uses lazy copying. This means that the copy references the same data as the original (until the copy or original is changed, at which point a copy is made). This behavior is well documented (1), (2).
Thus, creating a new table using whole columns from other tables would cause matrices to be copied, but these copies don’t incur any actual copying of matrix contents, the new table references data in the original tables.
But if any of the values in a column is changed, the whole column will need to be copied to avoid the other table to see the same change. The reference is internal and temporary, and invisible to the user. For all intends and purposes, it looks like the new table contains a copy of the original data.
However, if the join operation causes rows to be swapped or removed, all of this is a moot point. The data will be copied.
Related
I want to update numerical columns of one table based on matching string columns from another table.i.e.,
I have a table (let's say table1) with 100 records containing 5 string (or text) columns and 10 numerical columns. Now I have another table that has the same structure (columns) and 20 records. In this, few records contain updated data of table1 i.e., numerical columns values are updated for these records and rest are new (both text and numerical columns).
I want to update numerical columns for records with the same text columns (in table1) and insert new data from table2 into table1 where text columns are also new.
I thought of taking an intersect of these two tables and then update but couldn't figure out the logic as how can I update the numerical columns.
Note: I don't have any primary or unique key columns.
Please help here.
Thanks in advance.
The simplest solution would be to use two separate queries, such as:
UPDATE b
SET b.[NumericColumn] = a.[NumericColumn],
etc...
FROM [dbo].[SourceTable] a
JOIN [dbo].[DestinationTable] b
ON a.[StringColumn1] = b.[StringColumn1]
AND a.[StringColumn2] = b.[StringColumn2] etc...
INSERT INTO [dbo].[DestinationTable] (
[NumericColumn],
[StringColumn1],
[StringColumn2],
etc...
)
SELECT a.[NumericColumn],
a.[StringColumn1],
a.[StringColumn2],
etc...
FROM [dbo].[SourceTable] a
LEFT JOIN [dbo].[DestinationTable] b
ON a.[StringColumn1] = b.[StringColumn1]
AND a.[StringColumn2] = b.[StringColumn2] etc...
WHERE b.[NumericColumn] IS NULL
--assumes that [NumericColumn] is non-nullable.
--If there are no non-nullable columns then you
--will have to structure your query differently
This will be effective if you are working with a small dataset that does not change very frequently and you are not worried about high contention.
There are still a number of issues with this approach - most notably what happens if either the source or destination table is accessed and/or modified while the update statement is running. Some of these issues can be worked around other ways but so much depends on the context of how the tables are used that it is difficult to provide a more effective generically-applicable solution.
I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.
For example, if the initial table is:
id initial_color
-- -------------
1 blue
2 red
3 yellow
I am modifying the table so that the result is:
id initial_color modified_color
-- ------------- --------------
1 blue blue_green
2 red red_orange
3 yellow yellow_brown
I have code that will read the initial_color column and update the value.
Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:
Copy the column and update the rows in the new column
Create an empty column and insert new values
I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.
The columns types are either character varying or character.
Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.
Any downsides of using data type "text" for storing strings?
Given that my table has 32 million rows and that I have to apply this
procedure on five of the 31 columns, what is the most efficient way to do this?
If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):
BEGIN;
LOCK TABLE tbl_org IN SHARE MODE; -- to prevent concurrent writes
CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);
ALTER tbl_new ADD COLUMN modified_color text
, ADD COLUMN modified_something text;
-- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
, myfunction(initial_color) AS modified_color -- etc
FROM tbl_org;
-- ORDER BY tbl_id; -- optionally order rows while being at it.
-- Add constraints and indexes like in the original table here
DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;
If you have depending objects, you need to do more.
Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.
Related cases with more details, links and explanation:
Updating database rows without locking the table in PostgreSQL 9.2
Best way to populate a new column in a large table?
Optimizing bulk update performance in PostgreSQL
While creating a new table you might also order columns in an optimized fashion:
Calculating and saving space in PostgreSQL
Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:
CREATE TABLE
This would create a new table and filling could be done using
CREATE TABLE .. AS SELECT.. for filling with creation or
using a separate INSERT...SELECT... later on
Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.
So, I' not seeing any way to realize a procedure that follows your choice 1.
Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.
Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.
create new column (modified colour), it will have a value of NULL or blank on all records,
run an update statement, assuming your table name is 'Table'.
update table
set modified_color = 'blue_green'
where initial_color = 'blue'
if I am correct this can also work like this
update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';
once you have done this you can do another update (assuming you have another column that I will call modified_color1)
update table set 'modified_color1'= 'modified_color'
I need to create a package to migrate a large amount of data from a database table into a different database table. The source table will continuously have new data in like 4,5 days so I will run my package again and again.
I need to migrate all data from this table to another table but I don't want to migrate those data that I already migrated. What kind of transformation I need to use or what SQL command I need to write to do this?
The usual way this is done is by having "audit" timestamps on the source table and migating only records updated or inserted after the last migration.
for example:
Table Sales
sale_id
sale_date
sale_amount
...............
dw_create_date
dw_update_date
Your source extraction could be something along the lines of..
select sales.sale_id,
sales.sale_date,
....
from sales
where dw_updated_date > {last_migration_date}
last_migration_date is usually read from a config file or table.
Other approaches
There are a few other approaches that you could use, but all of these have bigger performance problems as your data size grows.
1) Do a (target-source) data, to get changed rows in the souurce.
select *
from source
minus
select * from target
You could do the same using a join between source and target.
select source.*
from src
left join tgt on (src.id=tgt.id)
where (src.column1 <> tgt.column1 or
src.column2 <> tgt.column2
............
)
Note that either one of these approaches does not take care of deletes in the source. If you want the tables to be in sync, the only way to do that would be do a (source-target) to get insert/update changes and (target-source) to get deleted rows and do the same in the target.
2. Insert and ignore the primary constraint error:
This has serious issues if the data can change in the source and you want the updates propagated to the target. You'd also be querying the entire source each time. It is usually better to use Merge/Upsert along with filtered source data, instead.
I would assume both tables have some unique identifier, no?
Table A has:
1
2
3
4
You're moving that to Table B, but keeping the data in Table A at the same time, yes?
So you've run your job once. Now Table B has:
1
2
3
4
Table A gets updated. It now has:
1
2
3
4
5
6
7
You run your job again, but you only want to send over 5,6,7.
SELECT *
FROM TableA
LEFT OUTER JOIN TableB ON TableA.ID = TableB.ID
WHERE TableB.ID = NULL.
If you have some sample data it would help. Does this give you a good idea?
See joins: http://i.stack.imgur.com/1UKp7.png
In my talned job I am copying data from Excel to SQL table.
For this job to maintain the foreign key constraint I had to do a look up before copying the data.
The task goes like this.
I have to copy data in Table2 (id keys value).
My excel sheet has data for id and keys column. Table 1 has two columns id and value.
For value column's data I want to look at Table1's corresponding entry with the id of the current record in Table2. I have to copy the data from Table1's value column to Table2's value column.
Excel (id 1 2 3, keys a b c)
Table_1 (id 1 2 3, value 123 456 789)
desired output: Table_2 (id 1 2 3, keys a b c, value 123 456 789)
current output: Table_2 (id 1 2 3, keys a b c, value null null null)
How do I properly map this?
You've set the job up exactly as needs to be done really so it's not your job layout that's the problem here.
If you're expecting every record in your Excel document to have a matching record in your database table then you should use an inner join condition in your tMap join like so:
And this then allows you to have an output that grabs everything that isn't joining (which is your issue here):
This should show you everything in your source (not lookup) file that isn't matching. I suspect that everything is failing to match on your join condition (essentially WHERE ExcelDoc.id = Table.Id) even if it looks like it should. This could be down to a mismatch of datatypes as they are read into Talend (so the Java/Talend datatypes such as int/Integer or String rather than the DB types) or because one of your id columns has extraneous whitespace padding.
If you find that your id column in one of your sources does in fact have any padding then you should be able to convert it in a previous step before your join or even just make the transformation in the join:
The only other thing I'd recommend is that you name your flows between components (just click on them to change the name), especially when you are joining anything in a tMap.
I have a very large database I would like to split up into tables. I would like to make it so when I run a distinct, it will make a table for every distinct name. The name of the table will be the data in one of the fields.
EX:
A --------- Data 1
A --------- Data 2
B --------- Data 3
B --------- Data 4
would result in 2 tables, 1 named A and another named B. Then the entire row of data would be copied into that field.
select distinct [name] from [maintable]
-make table for each name
-select [name] from [maintable]
-copy into table name
-drop row from [maintable]
Any help would be great!
I would advise you against this.
One solution is to create indexes, so you can access the data quickly. If you have only a handful of names, though, this might not be particularly effective because the index values would have select almost all records.
Another solution is something called partitioning. The exact mechanism differs from database to database, but the underlying idea is the same. Different portions of the table (as defined by name in your case) would be stored in different places. When a query is looking only for values for a particular name, only that data gets read.
Generally, it is bad design to have multiple tables with exactly the same data columns. Here are some reasons:
Adding a column, changing a type, or adding an index has to be done times instead of one time.
It is very hard to enforce a primary key constraint on a column across the tables -- you lose the primary key.
Queries that touch more than one name become much more complicated.
Insertions and updates are more complex, because you have to first identify the right table. This often results in overuse of dynamic SQL for otherwise basic operations.
Although there may be some simplifications (security comes to mind), most databases have other mechanisms that are superior to splitting the data into separate tables.
what you want is
CREATE TABLE new_table
AS (SELECT .... //the data that you want in this table);