how do I retrieve data from a sql table with huge number of inputs for a single column - sql

I have a Company table in SQL Server and I would like to retrieve list of data related to particular companies and list of companies is very huge of around 200 company names and I am trying to use IN clause of T-SQL which is complicating the retrieval as few the companies have special characters in their name like O'Brien and so its throwing up an error as it is obvious.
SELECT *
FROM COMPANY
WHERE COMPANYNAME IN
('Archer Daniels Midland'
'Shell Trading (US) Company - Financial'
'Redwood Fund, LLC'
'Bunge Global Agribusiness - Matt Thibodeaux'
'PTG, LLC'
'Morgan Stanley Capital Group'
'Vitol Inc.'..
.....
....
.....)
Above is the script that is not working for obvious reasons, is there any way I can input those company names from an excel file and retrieve the data?

The easiest way would be to make a table and join it:
CREATE TABLE dbo.IncludedCompanies (CompanyName varchar(1000)
INSERT INTO dbo.IncludedCompanies
VALUES
('Archer Daniels Midland'),
('PTG, LLC')
...
SELECT *
FROM Company C
JOIN IncludedCompanies IC
ON C.CompanyName = IC.CompanyName

I do not think that mysql knows how to handle excel format, but you can fix your query.
Check how complicated names are stored in database (check if they have escape characters in them or anything else".
Replace all ' with \' in your query and it will take care of the ' characters
mysql> select now() as 'O\'Brian'; returns
O'Brian
2014-03-17 15:06:39

So i'm guessing you have a excel sheet with a column containing these names, and you want to use this in your where clause. In addition, some of the values have special characters in them, which needs to be escaped.
First thing you do is to escape the '-characters. You do this in excel, with a search replace for all occurences of ' with '' (the escaped version in sqlserver (\' in MySQL.)) Then, create a new column on each side side of your companies column, and in the first row input a ' on the left hand side, and ', on the right. Then use the copy cell functionality (the little square in the bottom right of the cell when you select it) to copy the cells to the left and right to all the rows, as far as the company list goes (just grab the square and pull it downwards..)
Then, take your list, now containing three columns and x rows and paste it into your favorite text editor. It should look something like this:
' Company#1 ',
' Company with special '' char ',
[...]
' Last company ',
Now, you will have some whitespace to get rid of. Use search replace and replace two space characters with nothing, and repeat (or take the space from the first ' to the start of the text and replace this with nothing.
Now, you should have a list of:
'Company#1',
'Company with special '' char',
[...]
'Last company',
Remove the last comma, and you'll have a valid list of parameters to your in-clause (or a (temporary) table if you want to keep your query a bit cleaner.)

Related

SQL code to remove extra spaces and line breaks in free text?

I'm currently working with a table that deals with patients who have visited a clinic. One of the fields in this table shows the reason for the visit, and it's free text so whoever's booking the appointment can leave a custom note for the doctor depending on what the issue is. Yes, I'm well aware free text is the actual worst, but I did not design this database or the front-end medical record system (which is also the worst) and I'm simply stuck dealing with it. Bear with me.
Because of the special characters, extra spaces, and carriage returns that often find their way into that free text field on the front end, all its contents would show up on a single line in SSMS but would cause all sorts of formatting issues with extra line breaks when the SQL results were pasted into Excel. I did a little research and found a snippet of code that would replace carriage returns, etc. in a given field, thus forcing all the contents of that field to remain in a single cell:
REPLACE(REPLACE(FieldName,char(10),''),char(13),'') as FieldName
This has worked splendidly for this VisitReason field and any other free text fields I've been forced to work with. However, does it account for every possible issue one might find in free text? Yesterday I was working with this table and pasted the results from SSMS into Excel, and there were two people whose VisitReason fields were cut off prematurely and then had all the results (as in multiple fields) from a bunch of other people's visits crammed into that same field (thus making for one really long cell in Excel).
For example, the VisitReason for one of these people showed up in SSMS as complaining of rash, see note. But then when it was pasted into Excel, the results looked like...
PatientID PatientName VisitDate ... VisitReason
----------------------------------------------------------------------------------------------
1001 Smith, John 01/08/2023 ... complaining of rash, see
PatientID1002PatientNameJaneDoeVisitDate01/08/2023VisitRe
asondiabetesfollowupPatientID1003PatientNameBobBrownVisitDa
(and so on)
I can't tell if this has something to do with the free text field, and there's some hidden character in there that's causing the weird line breaks and field merging that my REPLACE function isn't catching, or whether it's an error with Excel (in which case this obviously isn't the right place to be asking). But I wanted to check and see if there was anything that potentially needed to be added to the REPLACE line that would fix the problem.
My full query is really simple:
SELECT
d.PatientID,
d.PatientName,
v.VisitDate,
[some other visit-related fields, none of which are free text],
REPLACE(REPLACE(v.VisitReason,char(10),''),char(13),'') as VisitReason,
[some other demographic fields, none of which are free text]
FROM Demographics d
JOIN Visit v ON d.PatientID = v.PatientID
The REPLACE function works perfectly fine for literally every other patient in the list except for the two with results like what's shown above, which then go on to affect a number of other rows following them. Anyone have any thoughts?
Please try the following solution.
The xs:token data type is stripping out the white space characters.
SQL
USE tempdb;
GO
DROP FUNCTION IF EXISTS dbo.udf_tokenize;
GO
/*
1. All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
2. Then leading and trailing spaces are removed from the value.
3. Further, contiguous occurrences of more than one space will be replaced with a single space.
*/
CREATE FUNCTION dbo.udf_tokenize(#input VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN (SELECT CAST('<r><![CDATA[' + #input + ' ' + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
END
GO
-- DDL and sample data population, start
DECLARE #mockTbl TABLE (ID INT IDENTITY(1,1), col_1 VARCHAR(100), col_2 VARCHAR(100));
INSERT INTO #mockTbl (col_1, col_2) VALUES
(CHAR(13) + ' FL ' + CHAR(9), CHAR(10) + ' Miami'),
(' FL ', ' Fort Lauderdale '),
(' NY ', ' New York '),
(' NY ', ''),
(' NY ', NULL);
-- DDL and sample data population, end
-- before
SELECT *, LEN(col_2) AS [col_2_len]
FROM #mockTbl;
-- remove invisible white space chars
UPDATE #mockTbl
SET col_1 = dbo.udf_tokenize(col_1)
, col_2 = dbo.udf_tokenize(col_2);
-- after
SELECT *, LEN(col_2) AS [col_2_len]
FROM #mockTbl;

Query to retrieve only columns which have last name starting with 'K'

]
The name column has both first and last name in one column.
Look at the 'rep' column. How do I filter only those rep names where the last name is starting with 'K'?
The way that table is defined won't allow you to do that query reliably--particularly if you have international names in the table. Not all names come with the given name first and the family name second. In Asia, for example, it is quite common to write names in the opposite order.
That said, assuming you have all western names, it is possible to get the information you need--but your indexes won't be able to help you. It will be slower than if your data had been broken out properly.
SELECT rep,
RTRIM(LEFT(LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep))), CHARINDEX(' ', LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep)))) - 1)) as family_name
WHERE family_name LIKE 'K%'
So what's going on in that query is some string manipulation. The dialect up there is SQL Server, so you'll have to refer to your vendor's string manipulation function. This picks the second word, and assumes the family name is the second word.
LEFT(str, num) takes the number of characters calculated from the left of the string
RIGHT(str, num) takes the number of characters calculated from the right of the string
CHARINDEX(char, str) finds the first index of a character
So you are getting the RIGHT side of the string where the count is the length of the string minus the first instance of a space character. Then we are getting the LEFT side of the remaining string the same way. Essentially if you had a name with 3 parts, this will always pick the second one.
You could probably do this with SUBSTRING(str, start, end), but you do need to calculate where that is precisely, using only the string itself.
Hopefully you can see where there are all kinds of edge cases where this could fail:
There are a couple records with a middle name
The family name is recorded first
Some records have a title (Mr., Lord, Dr.)
It would be better if you could separate the name into different columns and then the query would be trivial--and you have the benefit of your indexes as well.
Your other option is to create a stored procedure, and do the calculations a bit more precisely and in a way that is easier to read.
Assuming that the name is <firstname> <lastname> you can use:
where rep like '% K%'

Oracle RegEx in a Cast Procedure

I have a Cast Procedure for a table with "raw" data. Any time a record comes from any of our locations into the raw table, my procedure "cleans" the data and loads it into a new table. The original raw table is all varchars and my procedure converts date and number fields to the proper data types. From the clean table, a Java program selects any new records on a daily basis and FTPs them off in a file to another dept. Have just learned that a few of the fields accept input from users and on a rare occasion, someone uses a pipe in what they input. A pipe symbol happens to be the delimiter that the other dept is using and whenever a pipe shows up in the middle of a field, it throws a wrench on their end.
I've never used REGEX or REGEXP_REPLACE in Oracle before. There are only three fields where the users can input data - MISTINTCOMMENT, PALETTE, COLORID. How do I use REGEX or REGEXP_REPLACE to replace any pipes with a space? Do I want to do it on each field? Or is this something I should "wrap around" the entire statement (in case there's a field I missed where someone might be able to input a pipe)?
Here is the portion of the procedure where the Values are cleaned and inserted into new table. How to best use RegEx with this?
VALUES (CASE
WHEN THECOSTCENTER IS NOT NULL
THEN THECOSTCENTER
ELSE (SUBSTR(TRIM(THESENDING_QMGR), -6))
END,
CASE
WHEN THESTORENBR = '0' AND (SUBSTR(THESENDING_QMGR, 1, 5) = 'PDPOS')
THEN TO_NUMBER(SUBSTR(THESENDING_QMGR, 8, 4))
WHEN THESTORENBR = '0' AND (SUBSTR(THESENDING_QMGR, 1, 8) = 'PROD_POS')
THEN TO_NUMBER(SUBSTR(THESENDING_QMGR, 9, 4))
ELSE TO_NUMBER(NVL(THESTORENBR,'0'))
END,
TO_NUMBER(NVL(THECONTROLNBR,'0')), TO_NUMBER(NVL(THELINENBR,'0')), THESALESNBR, TO_NUMBER(NVL(THEQTYMISTINT,'0')), THEREASONCODE, THEMISTINTCOMMENT,
THESIZECODE, THETINTERMODEL, THETINTERSERIALNBR, TO_NUMBER(NVL(THEEMPNBR,'0')), TO_DATE(THETRANDATE,'YYYY-MM-DD'), THETRANTIME, THECDSADLFLD,
THEPRODNBR, THEPALETTE, THECOLORID, TO_DATE(THEINITTRANDATE,'YYYY-MM-DD'), TO_NUMBER(NVL(THEGALLONSMISTINTED,'0'),'999999999.99'), THEUPDATEEMPNBR,
TO_DATE(THEUPDATETRANDATE,'YYYY-MM-DD'), TO_NUMBER(NVL(THEGALLONS,'0'),'999999999.99'), THEFORMSOURCE, THEUPDATETRANTIME, THESOURCEIND,
TO_DATE(THECANCELDATE,'YYYY-MM-DD'), THECOLORTYPE, TO_NUMBER(NVL(THECANCELEMPNBR,'0')), TO_BOOLEAN(THENEEDEXTRACTED), TO_BOOLEAN(THEMISTINTMQXTR),
THEDATASOURCE, THETRANGUID, TO_NUMBER(NVL(THETERMNBR,'0')), TO_NUMBER(NVL(THETRANNBR,'0')), TO_NUMBER(NVL(THETRANID,'0')), THEID, THETINTABLESALESNBR,
TO_NUMBER(NVL(THERETURNQTY,'0')), THECREATED_TS, THEXMIT_GUID, THESENDING_QMGR, THEMSG_ID, THEPUT_TS,
THEBROKER_NAME, THECHECKSUM);
If you have to use a REGEXP_REPLACE to replace pipes, escape them:
REGEXP_REPLACE(x, '\|', ' ')
This is useful to know when your more complex expressions include a pipe.
In this case, REPLACE that performs literal text search and replace will suffice:
REPLACE(x, '|', ' ')

SELECT middle part of a String if it exists. Postgresql

i've got a problem with transferring "real-World" data into my schema.
It's actually a "project" for my Database course and they gave us ab table with dog race results. This Table has a column which contains the name of the Dog (which itself consists of the actuall name and the name of the breeder) and informations about the Birthcountry, actual living Country and the birth year.
Example filed are "Lillycette [AU 2012]" or "Black Bear Lee [AU/AU 2013]" or "Lemon Ralph [IE/UK 1998]".
I've managed it to get out the first word and save it in the right column with split_part like this:
INSERT INTO tblHund (rufname)
SELECT
split_part(name, ' ', 1) AS rufname,
FROM tblimport;
tblimport is a table where I dumped the data from the csv file.
That works just as it should.
Accessing the second part of the Name with this fails because sometimes there isn't a second part and sometimes times there second part consists of two words.
And this is the where iam stuck right now.
I tried it with substring and regular expressions:
INSERT INTO tblZwinger (Name)
SELECT
substring(vatertier from E'[^ ]*\\ ( +)$')AS Name
FROM tblimport
WHERE substring(vatertier from E'[^ ]*\\ ( +)$') != '';
The above code is executed without errors but actually does nothing because the SELECT statement just give empty strings back.
It took me more then 3h to understand a bit of this regular Expressions but I still feel pretty stupid when I look at them.
Is there any other way of doing this. If so just give me a hint.
If not what is wrong with my expression above?
Thanks for your help.
You need to use atom ., which matches any single character inside capturing group:
E'[^ ]*\\ (.+)$'
SELECT
tblimport.*,
ti.parts[1] as f1,
ti.parts[2] as f2, -- It should be the "middle part"
ti.parts[3] as f3
FROM
tblimport,
regexp_matches(tblimport.vatertier, '([^\s]+)\s*(.*)\s+\[(.*)\]') as ti(parts)
WHERE
nullif(ti.parts[2], '') is not null
Something like above.

Remove unnecessary Characters by using SQL query

Do you know how to remove below kind of Characters at once on a query ?
Note : .I'm retrieving this data from the Access app and put only the valid data into the SQL.
select DISTINCT ltrim(rtrim(a.Company)) from [Legacy].[dbo].[Attorney] as a
This column is company name column.I need to keep string characters only.But I need to remove numbers only rows,numbers and characters rows,NULL,Empty and all other +,-.
Based on your extremely vague "rules" I am going to make a guess.
Maybe something like this will be somewhere close.
select DISTINCT ltrim(rtrim(a.Company))
from [Legacy].[dbo].[Attorney] as a
where LEN(ltrim(rtrim(a.Company))) > 1
and IsNumeric(a.Company) = 0
This will exclude entries that are not at least 2 characters and can't be converted to a number.
This should select the rows you want to delete:
where company not like '%[a-zA-Z]%' and -- has at least one vowel
company like '%[^ a-zA-Z0-9.&]%' -- has a not-allowed character
The list of allowed characters in the second expression may not be complete.
If this works, then you can easily adapt it for a delete statement.