Find or Strip Invalid characters from Database

Find or Strip Invalid characters from Database - sql

We are using a database where the front end software has allowed the input of invalid characters. (I have no control or re-writing of the software.)
The types of characters are carriage returns, line breaks, �, ¶, basically anything that is not 0-9, a-z or standard punctuation causes us issues with the database and how we use the data.
I'm looking for a way to scan the entire database to identify these invalid codes and either display them as results or strip them out?
I had been looking at This site wondering if there was a way of searching for a certain range? But I might be barking up the wrong tree.
I'm fairly new to SQL so be gentle with me, thanks.

The only way I could think to do this would be to write a stored procedure which uses system tables to get a list of all fields in the database/schema in question. Have it exclude system tables (or only include those that are user defined) then dynamically write out SQL update statements based on the columns/tables found in the system table inquiries. Using regular expressions or character removal like in this article
The system tables in question are:
SELECT
table_name,column_name
FROM
information_schema.columns
Psudo code would be:
Get list of tables we want to do this for
For each table in list
get list of columns for table that have string data.
For each column in table
generate update statement to strip unwanted characters
--Consider writing out table, column key, before after values to history table. incase this
has to be undone.
--Consider counter so I have an idea of what was updated
execute updatestatement
next column
next table
write out counter

Since you say
the data then moves to a second program that cannot handle these
characters and this causes the process to fail.
I'm wondering if you can leave the unreadable data where it is and create a new column for changed data that's only populated if/when the 2nd process fails. You'll still have to test every character of the data in the failed cell, but you wouldn't have to test every character of every row. After you determine the updated text to process, you can call the 2nd process again with the updated value.

Related

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?

The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.

Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

MS Access Error updating memo field with long text

Searching this problem returns quite a few search hits, but many off-track answers, so I'm posting a concise description here, and answer below.
The problem afflicts Microsoft Access 2010, and some versions before. Access 2013 renames Memo type to Long Text. I don't know if it has the same problem.
The root problem is associated with running an UPDATE query on a table with a memo field, in certain particular circumstances. This might be an UPDATE query composed in the visual query window, or some VBA running SQL via DAO or ADO or similar. Or it could arise while updating via a form.
(The current post is concerned with this occurrence just within an Access database, though elsewhere you will find discussion of similar-sounding issues when Access is connected to an external database server.)
Instead of generating an immediate and obvious error alert, Access (or perhaps Jet) places the value #Error (which is not just the string "#Error"!) into the Memo field. This might easily go unnoticed until some later time, resulting in visible errors such as:
-- You use Compact and Repair. That seems to complete, but Access quietly adds a MSysCompactError table with a couple of rows. One error -1611 complains that Access was stopped and couldn't complete the operation. A second, more-specific-seeming error complains that it can't find field "Description". That appears to be an internal error that has no relevance.
-- You try to copy the table to another database. Access gives an error complaining that another user is using the table or has updated the table, and won't complete the operation.
-- Other operations on the rows that, unnoticed by you, happen to contain the #Error values fail.
Regardless, the root problem is whatever causes the #Error values to get placed into the Memo fields in the first place.
Many posters have noted that it occurs if the UPDATE attempts to put strings longer than about 2000 characters into the Memo field. That's a surprise, as Memo fields should be able to hold 1 gig characters or more depending on version, even if it only allows 65k through the UI.
So why does the error occur when Updating using >2000 characters?

The key factor that provokes this error is the Memo field having an index. Apparently, although the Memo type field can hold a bazillion characters, the index can't deal with more than about 2000.
Knowing that this is the precipitating factor, probably a number of workarounds come to mind. First, you can obviously just disable the index. This solution is easy to verify in a dummy database: Create two tables containing Memo fields, one with an index and the other without. Run update queries that put >2000 characters into each Memo and note the results.
But perhaps you think you need the index? Your use case might be satisfied if you create a second field that will contain an initial substring of the main Memo (shorter than 2000 characters), and index that instead. This could be used for sorting purposes for example. In most cases, where a memo contains narrative information, it's unlikely that the memo data values differ only after 2000 characters. Or perhaps you can devise a hash function and make a separate column of that.
What if you have a database that already contains these #Error values? Some advice floating around on the web, especially in relation to downstream problems like failure of Compact and Repair, suggests that your database is corrupt and should be abandoned. I'm not so sure. If you can delete the #Error-afflicted rows, then delete the index, and then recreate the deleted rows, you may be back in business. Compact and Repair should run properly at that point, giving some confidence that you fixed the offending part. (Make backups along the way, obviously.)

Workaround solution
Create two macros (Macro1 Macro2)
Macro 1
Get all the necessary information from the open form which includ this long text and close it.
Macro 2
Insert all needed actions (starting with the update query that you get error)
Create a form (Form_on_error) with only a button that run Macro2
Finally add at the end of macro 1
On Error
Go to :Macro Name
Macro Name: On_Error_2590
RunMacro Macro2
Submacro On_error_2590
OpenForm (Form_on_error)
End Submacro
.......and it works !!!
So, only when the update query get error, the user must click the button on the form : Form_on_error

Full text search with special characters

I have a table with values such as "F-10" or "Jim-beam". Is there a way for me to get these results if a user had searched say "F10" or "Jimbeam"? Basically, the user may not know there is a dash in the entries but I want the search to be forgiving enough to find it.
Right now I'm trying to use:
SELECT *
WHERE
CONTAINS(table.*, ,'"F10*" Or "Jimbeam*"')

you can create array to get the inserted value from the user
then use the like %value1% or like %value2% for each value in the array
it could be a solution

You could try to replace the values in the database with values that do not contain any special characters. I used the function described in this answer before: https://stackoverflow.com/a/1008566/894974
So with that function installed, your where-clause would become:
where dbo.RemoveNonAlphaCharacters([columnToSearch]) like '%'+#searchString+'%'

You can create a separate column where dashes are removed from these words, then perform your full text searches against that column.
For example, your table could look like this (or the new column could be part of a new table):
Id Text TextForFullTextSearch
-----------------------------------------
1 Jim-beam Jimbeam
2 F-10 F10
3 blah blah blah blah
If you need to support searches on both "Jimbeam*" and "Jim-beam*" then you could perform the full text search against both the old and new columns.
This does require you to store the text twice so there will be more gears in your process. The benefits will be in the search accuracy and performance (LIKE will be much slower), so you'll have to weigh those benefits against the increased complexity.
Some ideas for populating the new column as data is inserted and updated:
Handle this in your data layer, i.e. all insert and update statements should include both Text and TextForFullTextSearch.
Add an insert/update trigger to the table that, whenever Text is inserted or updated, simultaneously updates TextForFullTextSearch.
Create an automated job that continually polls the table for inserts/updates, then updates TextForFullTextSearch accordingly.

Changing the length of Text fields in an Access linked table

I am exporting a file from a system as .csv. My aim is to link to this file as a table (which matches the output field for field) and then run the queries and export.
The problem I am having is that, upon import, all the fields are 255 bytes wide rather than what they need to be.
Here's what I've tried so far:
I've looked at ALTER TABLE but I cannot run multiple ALTER TABLE statements in one macro.
I've also tried appending the table into another table with the correct structure but it seems to overwrite the structure.
I've also tried using the Left function with the appropriate field length, but when I try to export, I pretty much just see 5 bytes per column.
What I would like is a suggestion as to what is the best path to take given my situation. I am not able to amend the initial .csv export, and I would like to avoid VBA if possible, as I am not at all familiar with it.

You don't really need to worry about the size of Text fields in an Access linked table that is connected to a CSV file. Access simply assigns each Text field the largest possible maximum size: 255. It does not mean that every value is actually 255 characters long, it just means that any values in those fields can be at most 255 characters long.
Even if you could change the structure of the linked table (which you can't), it wouldn't make any real difference except to possibly truncate longer Text values, and you could easily do that with a String function. For example, if a particular field had to be restricted to 15 characters then you could simply use Left([fieldName], 15) as a query column or as the control source in a report.

In the end, as the data set is not that large, I have set this up to append from my source data into a table with the correct structure. I can now run my processes against this table as per normal.

Access 2010 Database Clenup

I have problems with my records within my database, so I have a template with about 260,000 records and for each record they have 3 identification columns to determine what time period the record is from and location: one for year, one for month, and one for region. Then the information for identifying the specific item is TagName, and Description. The Problem I am having is when someone entered data into this database they entered different description for the same device, I know this because the tag name is the same. Can I write code that will go through the data base find the items with the same tag name and use one of the descriptions to replace the ones that are different to have a more uniform database. Also some devices do not have tag names so we would want to avoid the "" Case.
Also moving forward into the future I have added more columns to the database to allow for more information to be retrieved, is there a way that I can back fill the data to older records once I know that they have the same tag name and Description once the database is cleaned up? Thanks in advance for the information it is much appreciated.
I assume that this will have to be done with VBA of some sort to modify records by looking for the first record with that description and using a variable to assign that description to all the other items with the same tag name? I just am not sure of the correct VBA syntax to go about this. I assume a similar method would be used for the backfilling process?

Your question is rather broad and multifaceted, so I'll answer key parts in steps:
The Problem I am having is when someone entered data into this
database they entered different description for the same device, I
know this because the tag name is the same.
While you could fix up those inconsistencies easily enough with a bit of SQL code, it would be better to avoid those inconsistencies being possible in the first place:
Create a new table, let's call it 'Tags', with TagName and TagDescription fields, and with TagName set as the primary key. Ensure both fields have their Required setting to True and Allow Zero Length to False.
Populate this new table with all possible tags - you can do this with a one-off 'append query' in Access jargon (INSERT INTO statement in SQL).
Delete the tag description column from the main table.
Go into the Relationships view and add a one-to-many relation between the two tables, linking the TagName field in the main table to the TagName field in the Tags table.
As required, create a query that aggregates data from the two tables.
Also some devices do not have tag names so we would want to avoid the
"" Case.
In Access, the concept of an empty string ("") is different from the concept of a true blank or 'null'. As such, it would be a good idea to replace all empty strings (if there are any) with nulls -
UPDATE MyTable SET TagName = Null WHERE TagName = '';
You can then set the TagName field's Allow Zero Length property to False in the table designer.
Also moving forward into the future I have added more columns to the
database to allow for more information to be retrieved
Think less in terms of more columns than more tables.
I assume that this will have to be done with VBA of some sort to modify records
Either VBA, SQL, or the Access query designers (which create SQL code behind the scenes). In terms of being able to crunch through data the quickest, SQL is best, though pure VBA (and in particular, using the DAO object library) can be easier to understand and follow.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas