Parsing inconsistent data with SQL

Parsing inconsistent data with SQL - sql

I need to write SQL to extract repeat location codes and separate out the sub-location detail. However, the data I am working with does not follow a set pattern.
Here's a sample of what the location codes look like (the real table has over 5,000 locations):
JR-DY-TIN
DY-RHOLD
DY-PREQ-TIN
GLVCSH
GLFLR
GLBOX1
GLBOX2
GLBOX3
GLBOXA
GLBOXB
GLBOXC
GLBOXD
GL
GL0001
GL0002
GL0003
GL0014
…
I was able to create a new column for the sub-location detail when it is numeric but that's all I have so far.
select
LocationCode,
REVERSE(LEFT(REVERSE(LocationCode),PATINDEX('%[A-Za-z]%',
REVERSE(LocationCode))-1)) AS PaddedNumbers
from LocationTable
Results...
LocationCode PaddedNumbers
------------ -------------
JR-DY-TIN
DY-RHOLD
DY-PREQ-TIN
GLVCSH
GLFLR
GLBOX1 1
GLBOX2 2
GLBOX3 3
GLBOXA
GLBOXB
GLBOXC
GLBOXD
GL
GL0001 0001
GL0002 0002
GL0003 0003
GL0014 0014
I still figure out how to display the following in two separate columns:
Location codes without the sub-locations detail, e.g. GLBOX , or just
the original location code if there is no sub-location, e.g. GLFLR.
Numeric and Nonnumeric sub-location detail at the same time, e.g. for
GLBOX have a column that displays 1, 2, 3,A, B, C, D, E, F.
Edit: If I am able to accomplish this the data should look like this:
LocationCode MainLoc SubLoc
------------ --------- ------
JR-DY-TIN JR-DY-TIN
DY-RHOLD DY-RHOLD
DY-PREQ-TIN DY-PREQ-TIN
GLVCSH GLVCSH
GLFLR GLFLR
GLBOX1 GLBOX 1
GLBOX2 GLBOX 2
GLBOX3 GLBOX 3
GLBOXA GLBOX A
GLBOXB GLBOX B
GLBOXC GLBOX C
GLBOXD GLBOX D
GL GL
GL0001 GL 0001
GL0002 GL 0002
GL0003 GL 0003
GL0014 GL 0014
Any help is appreciated.
Environment: SQL Server 2008 R2.

It seems like you want to use something like a parseInt feature, which is not available in SQL Server 2008. You can attempt to use cast, but that won't work with your datatype - varchar.
I'd suggest using a case statement to parse the complex logic you need. ie:
select
LocationCode,
case when left(LocationCode,5) like 'GLBOX%' then substring(LocationCode,5,2)
when left(LocationCode,3) like 'GL0%' then substring(LocationCode,3,4)
else 'null' end as ParsedLocationCode end
from LocationTable

John's answer seems basically correct. I would write it as:
select LocationCode,
(case when LocationCode like 'GLBOX%' then right(LocationCode, 1)
when LocationCode like 'GL%' then right(LocationCode, 4)
end) as ParsedLocationCode
from LocationTable
This changes:
Removes the unnecessary substring() before like.
Fixed a syntax error (probably a typo with an extra end).
Uses right(), just because it seems simpler.

DECLARE #LocationRef TABLE (Location NVARCHAR(20), Ref INT)
INSERT INTO #LocationRef VALUES
('JR-DY-TIN',0)
,('DY-RHOLD',0)
,('DY-PREQ-TIN',0)
,('GLVCSH',0)
,('GLFLR',0)
,('GLBOX1',6)
,('GLBOX2',6)
,('GLBOX3',6)
,('GLBOXA',6)
,('GLBOXB',6)
,('GLBOXC',6)
,('GLBOXD',6)
,('GL',0)
,('GL0001',3)
,('GL0002',3)
,('GL0003',3)
,('GL0014',3)
SELECT Location AS LocationCode
,LEFT(Location,CASE Ref WHEN 0 THEN LEN(Location) ELSE Ref - 1 END)
,RIGHT(Location,CASE Ref WHEN 0 THEN 0 ELSE LEN(Location) - Ref + 1 END)
FROM #LocationRef

Related

Case statement logic and substring

Say I have the following data:
Passes
ID | Pass_code
-----------------
100 | 2xBronze
101 | 1xGold
102 | 1xSilver
103 | 2xSteel
Passengers
ID | Passengers
-----------------
100 | 2
101 | 5
102 | 1
103 | 3
I want to count then create a ticket in the output of:
ID 100 | 2 pass (bronze)
ID 101 | 5 pass (because it is gold, we count all passengers)
ID 102 | 1 pass (silver)
ID 103 | 2 pass (steel)
I was thinking something like the code below however, I am unsure how to finish my case statement. I want to substring pass_code so that we get show pass numbers e.g '2xBronze' should give me 2. Then for ID 103, we have 2 passes and 3 customers so we should output 2.
Also, is there a way to firstly find '2xbronze' if the pass_code contained lots of other things such as '101001, 1xbronze, FirstClass' - this may change so i don't want to substring, could we search for '2xbronze' and then pull out the 2??
SELECT
CASE
WHEN Passes.pass_code like '%gold%' THEN Passengers.passengers
WHEN Passes.pass_code like '%steel%' THEN SUBSTRING(passes.pass_code, 1,1)
WHEN Passes.pass_code like '%bronze%' THEN SUBSTRING(passes.pass_code, 1,1)
WHEN Passes.pass_code like '%silver%' THEN SUBSTRING(passes.pass_code, 1,1)
else 0 end as no,
Passes.ID,
Passes.Pass_code,
Passengers.Passengers
FROM Passes
JOIN Passengers ON Passes.ID = Passengers.ID
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=db698e8562546ae7658270e0ec26ca54

So assuming you are indeed using Oracle (as your DB fiddle implies).
You can do some string magic with finding position of a splitter character (in your case the x), then substringing based on that. Obviously this has it's problems, and x is a bad character seperator as well.. but based on your current set.
WITH PASSCODESPLIT AS
(
SELECT PASSES.ID,
TO_Number(SUBSTR(PASSES.PASS_CODE, 0, (INSTR(PASSES.PASS_CODE, 'x')) - 1)) AS NrOfPasses,
SUBSTR(PASSES.PASS_CODE, (INSTR(PASSES.PASS_CODE, 'x')) + 1) AS PassType
FROM Passes
)
SELECT
PASSCODESPLIT.ID,
CASE
WHEN PASSCODESPLIT.PassType = 'gold' THEN Passengers.Passengers
ELSE PASSCODESPLIT.NrOfPasses
END AS NrOfPasses,
PASSCODESPLIT.PassType,
Passengers.Passengers
FROM PASSCODESPLIT
INNER JOIN Passengers ON PASSCODESPLIT.ID = Passengers.ID
ORDER BY PASSCODESPLIT.ID ASC
Gives the result of:
ID NROFPASSES PASSTYPE PASSENGERS
100 2 bronze 2
101 5 gold 5
102 1 silver 1
103 2 steel 3
As can also be seen in this fiddle
But I would strongly advise you to fix your table design. Having multiple attributes in the same column leads to troubles like these. And the more variables/variations you start storing, the more 'magic' you need to keep doing.
In this particular example i see no reason why you don't simply have the 3 columns in Passes, also giving you the opportunity to add new columns going forward. I.e. to keep track of First class.

You can extract the numbers using regexp_substr(). So I think this does what you want:
SELECT (CASE WHEN p.pass_code LIKE '%gold%'
THEN TO_NUMBER(REGEXP_SUBSTR(p.pass_code, '^[0-9]+'))
ELSE pp.passengers
END) as num,
p.ID, p.Pass_code, pp.Passengers
FROM Passes p JOIN
Passengers pp
ON p.ID = pp.ID;
Here is a db<>fiddle.
This converts the leading digits in the code to a number. Also note the use of table aliases to simplify the query.

Selecting rows using multiple LIKE conditions from a table field

I created a table out of a CSV file which is produced by an external software.
Amongst the other fields, this table contains one field called "CustomID".
Each row on this table must be linked to a customer using the content of that field.
Every customer may have one or more set of customIDs at their own discretion, as long as each sequence starts with the same prefix.
So for example:
Customer 1 may use "cust1_n" and "cstm01_n" (where n is a number)
Customer 2 may use "customer2_n"
ImportedRows
PKID CustomID Description
---- --------------- --------------------------
1 cust1_001 Something
2 cust1_002 ...
3 cstm01_000001 ...
4 customer2_00001 ...
5 cstm01_000232 ...
..
Now I have created 2 support tables as follows:
Customers
PKID Name
---- --------------------
1 Customer 1
2 Customer 2
and
CustomIDs
PKID FKCustomerID SearchPattern
---- ------------ -------------
1 1 cust1_*
2 1 cstm01_*
3 2 customer2_*
What I need to achieve is the retrieval of all rows for a given customer using all the LIKE conditions found on the CustomIDs tables for that customer.
I have failed miserably so far.
Any clues, please?
Thanks in advance.
Silver.

To use LIKE you must replace the * with % in the pattern. Different dbms use different functions for string manipulation. Let's assume there is a REPLACE function available:
SELECT ir.*
FROM ImportedRows ir
JOIN CustomIDs c ON ir.CustomID LIKE REPLACE(c.SearchPattern, '*', '%')
WHERE c.FKCustomerID = 1;

How to make SQL query that will combine rows of result from one table with rows of another table in specific conditions in SQLite

I have aSQLite3 database with three tables. Sample data looks like this:
Original
id aName code
------------------
1 dog DG
2 cat CT
3 bat BT
4 badger BDGR
... ... ...
Translated
id orgID isTranslated langID aName
----------------------------------------------
1 2 1 3 katze
2 1 1 3 hund
3 3 0 3 (NULL)
4 4 1 3 dachs
... ... ... ... ...
Lang
id Langcode
-----------
1 FR
2 CZ
3 DE
4 RU
... ...
I want to select all data from Original and Translated in way that result would consist of all data in Original table, but aName of rows that got translation would be replaced with aName from Translated table, so then I could apply an ORDER BY clause and sort data in the desired way.
All data and table designs are examples just to show the problem. The schema does contain some elements like an isTranslated column or translation and original names in separate tables. These elements are required by application destination/design.
To be more specific this is an example rowset I would like to produce. It's all the data from table Original modified by data from Translated if translation is available for that certain id from Original.
Desired Result
id aName code isTranslated
---------------------------------
1 hund DG 1
2 katze CT 1
3 bat BT 0
4 dachs BDGR 1
... ... ... ...

This is a typcial application for the CASE expression:
SELECT Original.id,
CASE isTranslated
WHEN 1 THEN Translated.aName
ELSE Original.aName
END AS aName,
code,
isTranslated
FROM Original
JOIN Translated ON Original.id = Translated.orgID
WHERE Translated.langID = (SELECT id FROM Lang WHERE Langcode = 'DE')
If not all records in Original have a corresponding record in Translated, use LEFT JOIN instead.
If untranslated names are guaranteed to be NULL, you can just use IFNULL(Translated.aName, Original.aName) instead.

You should probably list the actual results you want, which would help people help you in the future.
In the current case, I'm guessing you want something along these lines:
SELECT Original.id, Original.code, Translated.aName
FROM Original
JOIN Lang
ON Lang.langCode = 'DE'
JOIN Translated
ON Translated.orgId = Original.id
AND Translated.langId = Lang.id
AND Translated.aName IS NOT NULL;
(Check out my example to see if these are the results you want).
In any case, the table set you've got is heading towards a fairly standard 'translation table' setup. However, there are some basic changes I'd make.
Original
Name the table to something specific, like Animal
Don't include a 'default' translation in the table (you can use a view, if necessary).
'code' is fine, although in the case of animals, genus/species probably ought to be used
Lang
'Lanugage' is often a reserved word in RDBMSs, so the name is fine.
Specifically name which 'language code' you're using (and don't abbreviate column names). There's actually (up to) three different ISO codes possible - just grab them all.
(Also, remember that languages have language-specific names, so language also needs it's own 'translation' table)
Translated
Name the table entity-specific, like AnimalNameTranslated, or somesuch.
isTranslated is unnecessary - you can derive it from the existence of the row - don't add a row if the term isn't translated yet.
Put all 'translations' into the table, including the 'default' one. This means all your terms are in one place, so you don't have to go looking elsewhere.

Efficient SQL to merge results or leave to client browser javascript?

I was wondering, what is the most efficient way of combining results into a single result.
I want to turn
Num Ani Country
---- ----- -------
22 cows Canada
20 pigs Canada
40 cows USA
34 pigs USA
into:
cows pigs Country
----- ----- -------
22 20 Canada
40 34 USA
I want to know if it would be better to use SQL only or if I should feed the whole query result set to the user. Once given to the user, I could use JavaScript to parse it into the desired format.
Also, I do not know exactly how I would change this into the right format for a SQL query. The only way I can think of approaching this SQL statement is very roundabout with dynamically creating a temporary table.

The operation you're after is called "pivoting" - the PIVOT info page has a little more detail:
SELECT MAX(CASE WHEN t.ani = 'cows' THEN t.num ELSE NULL END) AS cows,
MAX(CASE WHEN t.ani = 'pigs' THEN t.num ELSE NULL END) AS pigs,
t.country
FROM YOUR_TABLE t
GROUP BY t.country
ORDER BY t.country

There should be an efficient way using a 2-D array on the client-side (php) to achieve the pivoting. To address Ken Downs' concerns about byte pushing, a ragged raw pivot data consumes less bytes than a fully materialized 2-D pivot table, the simple case is
cows | pigs | sheep | goats | country
1 null null null Canada
null 2 null null USA
null null 3 null Egypt
null null null 4 England
which is only 4 rows of raw data (each being 3 columns).
Doing it in the front end also solves the issue of dynamic-pivoting. If your number of pivot columns is unknown, then you would require a MySQL procedure to build up a dynamic sql statement of the pattern "MAX(CASE....)" for each column.
There are advantages to doing this on the client side
can be done (at least considered as an alternative)
can be rendered earlier, if the savings in network traffic is significant despite requiring either (1) php pivottable construction or (2) client side javascript
does not require a MySQL procedure for dynamic pivoting

SQL: Select distinct based on regular expression

Basically, I'm dealing with a horribly set up table that I'd love to rebuild, but am not sure I can at this point.
So, the table is of addresses, and it has a ton of similar entries for the same address. But there are sometimes slight variations in the address (i.e., a room # is tacked on IN THE SAME COLUMN, ugh).
Like this:
id | place_name | place_street
1 | Place Name One | 1001 Mercury Blvd
2 | Place Name Two | 2388 Jupiter Street
3 | Place Name One | 1001 Mercury Blvd, Suite A
4 | Place Name, One | 1001 Mercury Boulevard
5 | Place Nam Two | 2388 Jupiter Street, Rm 101
What I would like to do is in SQL (this is mssql), if possible, is do a query that is like:
SELECT DISTINCT place_name, place_street where [the first 4 letters of the place_name are the same] && [the first 4 characters of the place_street are the same].
to, I guess at this point, get:
Plac | 1001
Plac | 2388
Basically, then I can figure out what are the main addresses I have to break out into another table to normalize this, because the rest are just slight derivations.
I hope that makes sense.
I've done some research and I see people using regular expressions in SQL, but a lot of them seem to be using C scripts or something. Do I have to write regex functions and save them into the SQL Server before executing any regular expressions?
Any direction on whether I can just write them in SQL or if I have another step to go through would be great.
Or on how to approach this problem.
Thanks in advance!

Use the SQL function LEFT:
SELECT DISTINCT LEFT(place_name, 4)

I don't think you need regular expressions to get the results you describe. You just want to trim the columns and group by the results, which will effectively give you distinct values.
SELECT left(place_name, 4), left(place_street, 4), count(*)
FROM AddressTable
GROUP BY left(place_name, 4), left(place_street, 4)
The count(*) column isn't necessary, but it gives you some idea of which values might have the most (possibly) duplicate address rows in common.

I would recommend you look into Fuzzy Search Operations in SQL Server. You can match the results much better than what you are trying to do. Just google sql server fuzzy search.

Assuming at least SQL Server 2005 for the CTE:
;with cteCommonAddresses as (
select left(place_name, 4) as LeftName, left(place_street,4) as LeftStreet
from Address
group by left(place_name, 4), left(place_street,4)
having count(*) > 1
)
select a.id, a.place_name, a.place_street
from cteCommonAddresses c
inner join Address a
on c.LeftName = left(a.place_name,4)
and c.LeftStreet = left(a.place_street,4)
order by a.place_name, a.place_street, a.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas