Use SSIS to split a single field value into multiple rows in a second table - sql

So the situation is I am writing an SSIS package to migrate data from an old database to a new database. In the old database we have a Text column called comments that is filled with sometimes 30MB of text. Most of these are comment threads that have time stamps. I would like to use the timestamps by using a regex or some such thing to split the data up and move it to a second child table called comments. It then needs to reference the PK of the original record as well. Thanks!
So
Table1 [Profile]
PK | Comments
1 | '<timestamp> blah <timestamp> blah blah'
will turn into
Table1 [Profile]
PK | Comments
1 | ''
Table2 [Comments]
PK | FK | Comment
1 | 1 | '<timestamp> blah'
2 | 1 | '<timestamp> blah blah'

As wp78de suggested I resolved this by creating a script task and modified the output as it copies.

Related

Sequential update statements

When using multiple SETs on a single update query like
update table set col1=value1,col2=col1
is there an order of execution that will decide the outcome, when the same column is left or right of an equals sign? As far as I've tested so far, it seems when a column is used to the right of an equals as a data source, then its value is used from BEFORE it gets a new value within the same update statement, by being to the left of an equals sign elsewhere.
I believe that SQL Server always uses the old values when performing an UPDATE. This would best be explained by showing some sample data for your table:
col1 | col2
1 | 3
2 | 8
3 | 10
update table set col1=value1,col2=col1
At the end of this UPDATE, the table should look like this:
col1 | col2
value1 | 1
value1 | 2
value1 | 3
This behavior for UPDATE is part of the ANSI-92 SQL standard, as this SO question discusses:
SQL UPDATE read column values before setting
Here is another link which discusses this problem with an example:
http://dba.fyicenter.com/faq/sql_server/Using_Old_Values_to_Define_New_Values_in_UPDATE_Statements.html
You can assume that in general SQL Server puts some sort of lock on the table during an UPDATE, and uses a snapshot of the old values throughout the entire UPDATE statement.

What is the best way to change the type of a column in a SQL Server database, if there is data in said column?

If I have the following table:
| name | value |
------------------
| A | 1 |
| B | NULL |
Where at the moment name is of type varchar(10) and value is of type bit.
I want to change this table so that value is actually a nvarchar(3) however, and I don't want to lose any of the information during the change. So in the end I want to end up with a table that looks like this:
| name | value |
------------------
| A | Yes |
| B | No |
What is the best way to convert this column from one type to another, and also convert all of the data in it according to a pre-determined translation?
NOTE: I am aware that if I was converting, say, a varchar(50) to varchar(200), or an int to a bigint, then I can just alter the table. But I require a similar procedure for a bit to a nvarchar, which will not work in this manner.
The best option is to ALTER bit to varchar and then run an update to change 1 to 'Yes' and 0 or NULL to 'No'
This way you don't have to create a new column and then rename it later.
Alex K's comment to my question was the best.
Simplest and safest; Add a new column, update with transform, drop existing column, rename new column
Transforming each item with a simple:
UPDATE Table
SET temp_col = CASE
WHEN value=1
THEN 'yes'
ELSE 'no'
END
You should be able to change the data type from a bit to an nvarchar(3) without issue. The values will just turn from a bit 1 to a string "1". After that you can run some SQL to update the "1" to "Yes" and "0" to "No".
I don't have SQL Server 2008 locally, but did try on 2012. Create a small table and test before trying and create a backup of your data to be safe.

Add dynamic columns to table after appending word to column name and checking for existing column?

I am using SQl-server, and have been working on this for a while for work but running into a lot of issues.
I started with a table “Logs”
| ChangeID | UserID | LogDate | Status | Fields
| 123 | 001 | 7-12-12 | Open | (raw data)
| 456 | 001 | 7-9-14 | Complete | (raw data)
| 789 | 002 | 5-8-15 | Open | (raw data)
The column “Fields” contains data from a form in JSON format. Basically, it contains a field name, a before value and an after value.
For every row in Fields, I am able to parse the JSON in order to get a temporary table #fieldTable. So for example, one row of raw data in the Fields column would produce the following table:
|Field |Before |After
|User |ZZZ |YYY
|requestDue |7-2-13 |7-5-14
|Assigned |No |Yes
There can be any number of values for Field, and the names of the fields are not known beforehand.
What I need is for there to be a final table which combines all of the temporary tables generated with the field values as new columns, like this:
| ChangeID | UserID | LogDate | Status | Fields | UserBefore | UserAfter | requestDueBefore | requestDueAfter | … |
where, if the same field name appeared in two different rows of the JSON (and consequently the table that formats the data from it), then a new column won’t be added but rather just the data will be updated. So, for example, if the row with ChangeID 123 had the raw data
[{“field":"reqId", “before”: “000”,"after":"111"},{"field":"affected",”before”:no,"after":"yes"},{"field":"application",”before”:xxx,"after":"yyy"}]
and the ChangeID 789 had the raw data in its Fields row as
[{“field":"attachments", “before”: “null”,"after":"zzzzzzz"},{"field":"affected",”before”:no,"after":"yes}]
, then because the field “affected” from ChangeID123 should result in columns affectedBefore and affectedAfter to the final table, when this field is seen again from Change789, new columns won’t be added.
If there is no data for some column of a particular row, it should just be null.
The way I thought to do this was to first try to dynamically pivot the temporary tables when they are generated so that I get the following result for the before results
|User |requestDue |Assigned
|ZZZ |7-2-13 |No
and another for the after results
|User |requestDue |Assigned
|YYY |7-5-14 |Yes
by using the following code:
declare #cols as nvarchar(max),
#query as nvarchar(max)
select #cols = stuff((select ',' + QUOTENAME(field)
from #fieldTable
group by field--, id
--order by id
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
set #query = N'SELECT ' + #cols + N' from
(
select before, field
from #fieldTable
) x
pivot
(
max(before)
for field in (' + #cols + N')
) p '
exec sp_executesql #query;
(and then the same thing again with different variables for the result with the after values)
I think that there may be some way to dynamically concatenate the dynamic column name with a “before” or “after”, and then to somehow add these columns to a final table outside of the scope of the procedure. However, I’m not sure how, and I’m also not sure if this is even the best approach to the problem. I tried to use aliases, but I think you need to know the column name to make that work, and the same goes for altering a table to add more columns.
Also, I have seen that for many sort of similar issues, people were advised to user openrowset, which I am unable to use.
This answer belongs better in a comment as it doesn't directly address your question - however, I don't have reputation yet to directly comment.
For what you are describing, you might consider a SQLXML data field. SQLXML allows you to store documents of arbitrary schema and SQL-Server natively supports queries against it.
It would allow to avoid what I think you are already finding to be considerable complexity with trying to dynamically create schemas based on models that are not foreknown.
Hope that helps.

SQLite table with some rows missing a column

I have a table in a SQLite database that looks something like this, but with more columns and rows:
| Field1 | Field2 |
|---------|---------|
| A | 1 |
| B | 2 |
| C | |
What I need to do is run a SQL query like this:
SELECT * FROM <tablename> WHERE <conditions> ORDER BY Field2
The problem is, I'm getting the error: no such column: Field2
So now I've been asked to set all the missing values to 99. But when I run
UPDATE <tablename> SET Field2='99' WHERE Field2 IS NULL;
I get the same error. How do I fix this and update all those missing cells?
EDIT: I should also add that the missing values don't seem to be null since if I add a new column in my database GUI browser, all the cells show as [NULL], though this column doesn't.
This turned out to be caused by a very subtle problem in the table:
Several of the column names (the ones that were causing me problems) ended in a newline (\n). Removing the newline solved all my problems!

Making PostgreSQL a little more error tolerant?

This is sort of a general question that has come up in several contexts, the example below is representative but not exhaustive. I am interested in any ways of learning to work with Postgres on imperfect (but close enough) data sources.
The specific case -- I am using Postgres with PostGIS for working with government data published in shapefiles and xml. Using the shp2pgsql module distributed with PostGIS (for example on this dataset) I often get schema like this:
Column | Type |
------------+-----------------------+-
gid | integer |
st_fips | character varying(7) |
sfips | character varying(5) |
county_fip | character varying(12) |
cfips | character varying(6) |
pl_fips | character varying(7) |
id | character varying(7) |
elevation | character varying(11) |
pop_1990 | integer |
population | character varying(12) |
name | character varying(32) |
st | character varying(12) |
state | character varying(16) |
warngenlev | character varying(13) |
warngentyp | character varying(13) |
watch_warn | character varying(14) |
zwatch_war | bigint |
prog_disc | bigint |
zprog_disc | bigint |
comboflag | bigint |
land_water | character varying(13) |
recnum | integer |
lon | numeric |
lat | numeric |
the_geom | geometry |
I know that at least 10 of those varchars -- the fips, elevation, population, etc., should be ints; but when trying to cast them as such I get errors. In general I think I could solve most of my problems by allowing Postgres to accept an empty string as a default value for a column -- say 0 or -1 for an int type -- when altering a column and changing the type. Is this possible?
If I create the table before importing with the type declarations generated from the original data source, I get better types than with shp2pgsql, and can iterate over the source entries feeding them to the database, discarding any failed inserts. The fundamental problem is that if I have 1% bad fields, evenly distributed over 25 columns, I will lose 25% of my data since a given insert will fail if any field is bad. I would love to be able to make a best-effort insert and fix any problems later, rather than lose that many rows.
Any input from people having dealt with similar problems is welcome -- I am not a MySQL guy trying to batter PostgreSQL into making all the same mistakes I am used to -- just dealing with data I don't have full control over.
Could you produce a SQL file from shp2pgsql and do some massaging of the data before executing it? If the data is in COPY format, it should be easy to parse and change "" to "\N" (insert as null) for columns.
Another possibility would be to use shp2pgsql to load the data into a staging table where all the fields are defined as just 'text' type, and then use an INSERT...SELECT statement to copy the data to your final location, with the possibility of massaging the data in the SELECT to convert blank strings to null etc.
I don't think there's a way to override the behaviour of how strings are converted to ints and so on: possibly you could create your own type or domain, and define an implicit cast that was more lenient... but this sounds pretty nasty, since the types are really just artifacts of how your data arrives in the system and not something you want to keep around after that.
You asked about fixing it up when changing the column type: you can do that too, for example:
steve#steve#[local] =# create table test_table(id serial primary key, testvalue text not null);
NOTICE: CREATE TABLE will create implicit sequence "test_table_id_seq" for serial column "test_table.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_table_pkey" for table "test_table"
CREATE TABLE
steve#steve#[local] =# insert into test_table(testvalue) values('1'),('0'),('');
INSERT 0 3
steve#steve#[local] =# alter table test_table alter column testvalue type int using case testvalue when '' then 0 else testvalue::int end;
ALTER TABLE
steve#steve#[local] =# select * from test_table;
id | testvalue
----+-----------
1 | 1
2 | 0
3 | 0
(3 rows)
Which is almost equivalent to the "staging table" idea I suggested above, except that now the staging table is your final table. Altering a column type like this requires rewriting the entire table anyway: so actually, using a staging table and reformatting multiple columns at once is likely to be more efficient.