I'm working with a free dataset from Kaggle to learn SQL further and I've hit a spot where I am stuck. I'm working with an NFL Draft dataset that has player names listed like this:
FirstName LastName\UserName
However, some of the rows are simply this:
FirstName LastName
I wrote this code and have had some success:
SELECT position, substr(Player,0,instr(Player,'\')) AS Player_Name
This specfic code works great on any rows that are formatted like FirstName LastName\UserName but for any rows that are formatted like this FirstName LastName it returns a blank for the Player_Name.
Any tips on how I can fix this to show the FirstName LastName ONLY on my query for both ways?
INSTR returns 0 if the char is not in the string
Supplying a length of 0 causes an output from SUBSTR of zero length string
This is why your name disappears
What we can do is convert the 0 to something else:
SUBSTR(player, 0, COALESCE(NULLIF(INSTR(player, '/'), 0), 9999)
Now if INSTR returns 0, NULLIF converts it to NULL, and COALESCE converts the NULL to 9999
SQLite doesn't care that 9999 is beyond the end of the string (make it bigger if it isn't) and returns the whole rest of string, so effectively the SUBSTR is a non-op
There are other ways to skin the cat:
CASE WHEN player LIKE '%/%' THEN SUBSTR(...) ELSE player END
but they might search the string twice, for lower overall performance.. It might not matter, if it's a handful of values queried infrequently, you might prefer a more readable form over a (theoretically) more performant one
Related
The data in my table represents physical locations with the following data: A municipality name, a state(province/region), and a unique ID which is comprised of a prefix and postfix separated by a dash (all NVARCHAR).
Name
State
UniqueID
Atlanta
Georgia
A12-1383
The dash in the UniqueID is not always in the same position (can be A1-XYZ, A1111-XYZ, etc.). The postfix will always be numerical.
My approach is using a combination of RIGHT and CHARINDEX to first find the index of the dash and then isolate the postfix to the right of the dash, and then applying a MAX to the result. My issue so far has been that this combination is sometimes returning things like -1234 or 12-1234, i.e, including the dash and occasionally some of the prefix. Because of this, the max is obviously applied incorrectly.
Here is my query thus far:
select name, max(right(uniqueid,(Charindex('-',uniqueid)))) as 'Max'
from locations
where state = 'GA' and uniqueid is not NULL
group by name
order by name ASC
This is what the results look like for the badly formatted rows:
Name
Max
Atlanta
11-2442
Savannah
-22
This is returning incorrectly formatted data for 'Max', so I isolated the functions.
CHARINDEX is correctly returning the position of the dash, including in cases where the function is returning badly formatted data. Since the dash is never in the same place, I cannot isolate the RIGHT function to see if that is the problem.
Am I using these functions incorrectly? Or is my approach altogether incorrect?
Thanks!
CHARINDEX is counting how many chars before the - i.e. how many chars on the left hand side, so using RIGHT with the number of chars on the left isn't going to work. You need to find out how many chars are on the right which can be done with LEN() to get the total length less the number of chars on the left.
SELECT RIGHT(MyColumn,LEN(MyColumn)-CHARINDEX('-',MyColumn))
FROM (
VALUES
('A12ssss-1383'),
('A12-13834'),
('A12ss-138')
) X (MyColumn);
]
The name column has both first and last name in one column.
Look at the 'rep' column. How do I filter only those rep names where the last name is starting with 'K'?
The way that table is defined won't allow you to do that query reliably--particularly if you have international names in the table. Not all names come with the given name first and the family name second. In Asia, for example, it is quite common to write names in the opposite order.
That said, assuming you have all western names, it is possible to get the information you need--but your indexes won't be able to help you. It will be slower than if your data had been broken out properly.
SELECT rep,
RTRIM(LEFT(LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep))), CHARINDEX(' ', LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep)))) - 1)) as family_name
WHERE family_name LIKE 'K%'
So what's going on in that query is some string manipulation. The dialect up there is SQL Server, so you'll have to refer to your vendor's string manipulation function. This picks the second word, and assumes the family name is the second word.
LEFT(str, num) takes the number of characters calculated from the left of the string
RIGHT(str, num) takes the number of characters calculated from the right of the string
CHARINDEX(char, str) finds the first index of a character
So you are getting the RIGHT side of the string where the count is the length of the string minus the first instance of a space character. Then we are getting the LEFT side of the remaining string the same way. Essentially if you had a name with 3 parts, this will always pick the second one.
You could probably do this with SUBSTRING(str, start, end), but you do need to calculate where that is precisely, using only the string itself.
Hopefully you can see where there are all kinds of edge cases where this could fail:
There are a couple records with a middle name
The family name is recorded first
Some records have a title (Mr., Lord, Dr.)
It would be better if you could separate the name into different columns and then the query would be trivial--and you have the benefit of your indexes as well.
Your other option is to create a stored procedure, and do the calculations a bit more precisely and in a way that is easier to read.
Assuming that the name is <firstname> <lastname> you can use:
where rep like '% K%'
I trying to draw a statement like this
SELECT CONCAT(street_name, ' ', street_number) as 'street_detail'
FROM geo_map
WHERE CONCAT(street_name, ' ', street_number) LIKE '%'
My table is something like this
postal_code int
building_name nchar(200)
street_number nchar(60)
street_name nchar(120)
The result I get was just the street name, less the street number, although my street number have value, any idea what's went wrong in my concat.
I am using SQL Server
It is best to use NVARCHAR(...) instead of NCHAR(...) types for storing information like what you have. The reason is that for NCHAR(...) types, strings are padded with trailing spaces to fill the whole length of the field.
A string in an NCHAR(200) field is always 200 characters wide. The concatenation of street_name, a space and the street_number will be 261 characters wide. The building number will appear on the 202nd character in the concatenation.
Perhaps you are not seeing a street number in your concatenation because your display field (in your program, SSMS, webpage, ...) just isn't wide enough.
Now with storing your street name in an NVARCHAR(200) and pretty much all other related information in NVARCHAR(...) fields, you would not have that problem. Strings stored in those fields are not padded with trailing spaces, and you would see your street number at the place you expected in your concatenation.
I'm attempting to isolate eight digits from a cell that contains other numbers as well as text and no rhyme or reason to where it is placed. An example return would look something like this:
will deliver 11/07 in USA at 12:30 with conf# 12345678
I need the conf# only, but it could be at the end, beginning, middle of the string and I don't know how to isolate it. I'm working in DB2 so I can't use functions such as PATINDEX or CHARINDEX, so what are my other option for pulling out only "12345678" regardless of where it is located?
While DB2 doesn't have PATINDEX or CHARINDEX, it does have LOCATE.
If your DB2 version supportx pureXML, you can use the regular expression support in XQuery, something like:
select xmlcast(
xmlquery(
' if (fn:matches( $YOURCOLUMN, "(^|.*[^\d])(\d{8})([^\d].*$|$)")) then fn:replace( $YOURCOLUMN,"(^|.*[^\d])(\d{8})([^\d].*$|$)","$2") else "" '
)
as varchar(20)
)
from YOURTABLE
This assumes that 8-digit sequence appears only once in the column. You may need to tweak the regex to support some border cases.
I want to be able to differentiate between a string that is alphnumeric and a string that is in hex format.
My current query is:
<columnName> LIKE '?_____=' + REPLICATE('[0-9A-Fa-f]',16)
I found this method of searching for hex ID's online and I thought it was working. However after getting a significantly larger sample size I can see a high false positive rate in my results. The problem is that this gives me all the results I do want but it also gives me a bunch of results I dont care about. For example:
I want to see:
<url>.php?mains=d7ad916d1c0396ff
but i dont want to see:
<url>.php?mblID=2007012422060265
The difference between the 2 strings is that the 16 characters at the end that i want to collect are all numeric and not a hex ID. What are some ways you guys use to limit the results to hex ID only? Thanks in advnace.
UPDATE:
Juergen brought up a good point, the second number could be a hex value to. Not all hex numbers contain [a-F]. I would like to rephrase the question to state that I am looking for an ID with both letters and numbers in it, not just numbers.
The simplest way is just to add a separate clause for that restriction:
<columnName> LIKE '?_____=' + REPLICATE('[0-9A-Fa-f]',16)
AND <columnName> NOT LIKE '?_____=' + REPLICATE('[0-9]',16)
It should be fairly simple to determine if a string contains only numbers...
Setting up a test table:
CREATE TABLE #Temp (Data char(32) not null)
INSERT #Temp
values ('<url>.php?mains=d7ad916d1c0396ff')
,('<url>.php?mblID=2007012422060265 ')
Write a query:
SELECT
right(Data, 16) StringToCheck
,isnumeric(right(Data, 16)) IsNumeric
from #Temp
Get results:
StringToCheck IsNumeric
d7ad916d1c0396ff 0
2007012422060265 1
So, if the IsNumeric function returns 0, it could be a hex string.
This makes several assumptions:
The rightmost 16 characters are what you want to check
You only ever hit 16 characters. I don't know when the string would get too long to check.
A non-numeric character means hex. Any chance of "Q" or "~" being embedded in the string?