SSIS: Excel source - is it possible to track changed made in excel?

SSIS: Excel source - is it possible to track changed made in excel? - sql

Lets say that I have the following columns in excel: Level, Group, Code, Name, date, Additional info with the following values:
1, A, 1234, John, 2019-09-01, info 1
1, A, 1234, John, 2019-09-01, info 2
I have currently the following logic for importing, if there is no record in database with certain code and level, then new record will be inserted, if code already exists in database then record will be updated. But as there is no unique identifier in excel then it is quit hard to update correct record. What are the common approaches in such cases?
Lets say that in above example, Group or date will be changed for one record. How to implement the logic, which updates correct record in db.

You aren't going to be able to have a distinct dataset if there isn't a unique primary key. Without such a primary key you will not be able update only one row, but rather update one or more similar rows. In the current state, it is not possible to accurately track changes.
If you did have a unique primary key the simplest solution would be to append a datetime as a way to track changes and add it as a new row when any value changes. Your dataset would look like:
1, A, 1234, John, 2019-09-01, Info1, DateCreated, DateChanged
1, A, 1234, John, 2019-09-01, Info2, DateCreated, DateChanged2
1, A, 1234, John, 2019-09-01, Info3, DateCreated, DateChanged3
It is important to remember that this only works with a static primary key, certain fields that are commonly used for composite keys may not work. Users could change their name or rectify an incorrectly entered birth date which could change some composite keys.
In SSIS this would be implemented using two Lookup tasks:
In the first Lookup Task compare the primary key. If the primary key does not exist, use a Derived Column task to set the DateCreated and DateModified to GETDATE().
If the primary key does exist, run a second lookup task that compares all rows in the record. If they are all identical it means that there were no changes to the record and no update needs to be sent to the database.
If there is a difference then use a Derived Column SSIS task to only update the DateModified column to GETDATE() and add it as a new row.
These three branching options should account for every potential state: New record, existing record with no changes, existing record with changes

Since you don't have a unique identifier then you should try to create your own. In a similar case I will use a derived column to concatenate all values that I am sure it will not changes. As Example:
Level + "|" + Code + "|" + Name + "|" + Additional info
Then I will compare the source data and the existing data based on that derived column. You can choose to store these derived column in the destination database or you can use a staging table to store these values.

Related

Delete an item (row) from table but keep the unused ID number for a future insertion

I have a database with an "inventory" table which contains products from a shop. Each product is identified using an ID number set in the "ID" column of the table.
I want to be able to delete a product from the table, but keeping the deleted product's id number for future product insertions into the database.
As a demonstration I inserted 4 items and named all of them "test"
And just as an example I named the "deleted" product as "vacio" (empty in spanish) to show the one that i deleted.
Now, if want to add another product in the future, the id number 2 is unused and I want to add the product with that id number instead of 4 (following the given example).
The DELETE query is no good since it erases the id number as well so its a no go.
I thought about checking for the first row of the table that contains the value "vacio" and using the UPDATE query in all fields except id but this doesnt feel "classy" and is not very efficient as It should have to update values a lot of times.
Is there some nice way of doing this?

I would not actually recommend reusing old identifiers. For one, this prevents you from using the auto_increment feature, which mean that you need to manually handle the column for each and every insertion: this adds complexity and is not efficient. Also, it might cause integrity issues in your database if you have other tables referencing the product id.
But if you really want to go that way: I would go for the deletion option. If there are foreign keys referencing the column, make sure that they have the on delete cascade option enabled so data is properly purged from dependent tables when a product is dropped.
Then, you can fill the first available gap the next time your create a new product with the following query:
insert into products(id, categoria, producto)
select min(id) + 1, 'my new category', 'my new product'
from products p
where not exists (select 1 from products p1 where p1.id = p.id + 1)

You could have a new column ESTADO where you handle if a record is active (1) or inactive (0). Then, to obtain only "undeleted" records you just have to filter by the new column. That way, you also prevent changing the product name to "vacio", which might be useful in the future.

Is there a better way to update database by index instead of by its column name?

Above is a screenshot of my database,
Each row represents a task
ReviewedTimes represents how many times a task is reviewed, it can be set to up to 9.
FirstReview, SecondReview ... TenthReview represent a DateTime that this given task is reviewed.
My current approach is:
For a given task, read its ReviewedTimes column value, N
With a few blocks of if-else (or switch) statements, update a (N+1)thReview column.
For example, for a given task, its ReviewedTimes column reads 3, I will update FourthReview column with a new DateTime value.
Is there a more elegant way to achieve the same goal? I am thinking:
If I can somehow Enumerate FirstReview, SecondReview ... TenthReview as 1, 2 ... 10
Based on ReviewedTimes value, N, I can update column N + 1 respectively.

As I suggested in my above comment, Tasks has a one-to-many relationship with Reviews, to do what I suggested in SQL Server, you'll need to first drop all the Review Columns from the Task Table:
ALTER TABLE [dbo].[Tasks]
DROP COLUMN [FirstReview], DROP COLUMN [SecondReview], etc...
Then Create a new table called Review which has a foreign key to the primary key of Task table, this is assuming you have an Identity column in Tasks called TaskId:
CREATE TABLE [dbo].[Reviews] ([ReviewId] INT NOT NULL Identity(1, 1) PRIMARY KEY, [TaskId] INT NOT NULL References [dbo].[Tasks]([TaskId]), [TimeStamp] DateTime NOT NULL)
Now when someone goes to review a task item, you would first check to see if the review limit has been reached by using COUNT() on the Review Table for a given Task ID, if it has not been reached, insert a new record of the task review in Review table with the Task Id as TaskID column value and CURRENT_TIMESTAMP to return the datetime of Review. If you are tracking the person who reviewed the task, you would then add another column to Review with a foreign key to the user id, but that is just another suggestion.

What happens when you insert rows in a SERIAL column with existing values?

I had to import a large CSV file into a database, and one column must be a unique ID for a purchase. I set the type of the column to SERIAL (yes, I know it's not actually a type) but since I already had some data in there with their own "random" purchase IDs I'm not sure about what will happen when I insert new rows.
Will the purchase ID take the values that are not already in use? Will it start after the biggest existing ID? Will it start at 1 and not care about if a value is already in use?

The underlying SEQUENCE will not care about the values you inserted (explicitly providing values for the serial column, overruling the default), you have to adapt manually to avoid duplicate key errors:
SELECT setval(pg_get_serial_sequence('tbl', 'id'), max(id)) FROM tbl;
'tbl' and 'id' being the names of table and column respectively.
Related:
How to reset postgres' primary key sequence when it falls out of sync?
How to copy both structure and contents of PostgreSQL table, but duplicate sequences?

I need help counting char occurencies in a row with sql (using firebird server)

I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.

There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;

You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not

How to structure database, multiple foreign keys?

I feel like this may be a bit of a unique problem, but hopefully someone out there has come across a similar situation.
My application uses this database table:
DT table
The issue is with Field1 - 9.
Depending on how the user decides to set up their instance of the app there can be any number of fields used (from 0 - 9). The information for these are held in this Table:
Field Table
So for this example there are only to be two fields. And when a record is created for the DT table, field 1 and 2 will have data entered and all other field columns will be NULL. Obviously this isn't good practice, as for one, if a field name was changed in the future, all previous data wouldn't make sense.
I've been trying to think of a way to structure it differently. all I can think of is somehow when a DT record is created it will hold foreign keys to the fields that were used, but it seems that it's not possible to have multiple foreign keys in one column.
Any help or suggestions would be greatly appreciated.

One way to normalize this would be to factor out the repeating fields to a separate table, where you would have one entry per field with DT_id as a foreign key to the DT table.
DT Table:
ID
Start
End
...
DT_field table:
ID
DT_id (foreign key)
Value

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas