Django use StrIndex to get the last instance of a character for an annotation - sql

I have a model with a key that's value is a file path, ie /src/code/foo. I need the file name so I want to get the sub string of foo for my annotation value. The logical way I was aiming to go about it was to get the index of the last / character. I was trying to do this via StrIndex & Substr but StrIndex only gives me the first index. Is there a way I can get the last index instead?
MyModel.objects.all().annotate(
# this would give me `src/code/foo` but I want just `foo`
name=Cast(Substr("file_path", StrIndex("file_path", V("/"))), TextField())
)

You have to reverse the string in order to get the last occurence, but then you can use that with Right to get the tail. Subtracting 1 to get rid of the / itself from the string.
from django.db.models.functions import StrIndex, Reverse, Right
from django.db.models import CharField, F
MyModel.annotate(
last_occur=StrIndex(Reverse('file_path'), Value('/')),
name=Right('file_path', F('last_occur')-1, output_field=CharField())
)
You could do it all as a single annotation, with StrIndex within Right, but I think it's easier to follow with the steps separated out like this.

Related

Regex match first number if it does not appear at the end

I am currently facing a Regex problem which apparently I cannot find an answer to.
My Regex is embedded in a teradata SQL of the form:
REGEXP_SUBSTR(column, 'regex_pattern')
I want to find the first appearance of any number except if it appears at the end of the string.
For Example:
"YEL2X30" -> "2"
"YEL19XYZ05" -> "19"
"YELLOW05" -> ""
I tried it with '[0-9]+(?!$)/' but this returns me a blank String always.
Thanks in Advance!
Shot in the dark here since I'm unfamiliar with teradata and the supported SQL-functionality. However, reading the docs on the REGEXP_SUBSTR() function it seems like you may want to use the 3rd and 4th possible argument along with a slightly different regular expression:
[0-9]+(?![0-9]|$)
Meaning: 1+ Digits that are not followed by either the end of the string or another digit.
I'd believe the following syntax may work now to retrieve the 1st appearance of any number from the matching results:
REGEXP_SUBSTR(column, '[0-9]+(?![0-9]|$)', 1, 1)
The 3rd parameter states from which position in the source-string we need to start searching whereas the 4th will return the 1st match from any possible multiple matches (is how I read the docs). For example: abc123def456ghi789 whould return 123.
Fiddling around in online IDE's gave me that:
CREATE TABLE TBL (TST varchar(100));
INSERT INTO TBL values ('YEL2X30'), ('YEL19XYZ05'), ('YELLOW05'), ('abc123def456ghi789');
SELECT REGEXP_SUBSTR(TST, '[0-9]+(?![0-9]|$)', 1, 1) as 'RESULTS' FROM TBL;
Resulted in:
RESULTS
2
19
NULL
123
NOTE: I also noticed that leaving out the 3rd and 4th parameter made no difference since they will default back to 1 without explicitly mentioning them. I tested this over here.
Possibly the simplest way is to look for digits followed by a non-digit. Then keep all the digits:
regexp_substr(regexp_substr(column, '[0-9]+[^0-9]'), '[0-9]+')

Filtering rows in Pentaho

I have a dataset with columns containing numbers. However, some of the rows in that column have missing data. Instead of numbers, a dash (-) is placed in the cell.
What I want to happen is to separate those rows with a dash and output them to a separate excel file. Those without the dash, should output to a csv file.
I tried the "filter rows" but it gives me an error:
Unexpected conversion error while converting value [constant String] to a Number
constant String : couldn't convert String to number
constant String : couldn't convert String to number : non-numeric character found at position 1 for value [-]
My condition is if
Column1 CONTAINS - (String)
You cant try to convert to number in the select step,and handler the error, if can not convert to number that mean that is (-)
You can convert missing value indicators (like a dash or any other string) to null in Text-File-Input - see field option "Null if". That way you still can use the metadata detection feature and will not trip over a dash arriving in a Number field.
With CSV-File-Input you should stick to the String datatype until a Null-If step has cleansed the values, so you can change the datatype to Number in a Select-Values step.
If you must preserve the dash character, don't use metadata detection (as it suggests datatype Number) or use more rows to sample (so a field with a dash is encountered) or just revert the datatype to String again before saving and running the transformation.
My solution lies on the first 'Replace in String'. I replaced the dash into something numeric and can easily be distinguished from the rest of the numbers (I used 9999) and carried on with the rest of my process.
In filter rows, I had no problems anymore with the data type because both my variables and condition contained numbers, therefore, it no longer had to convert anything.
After filter rows, I added the 'Null-if' to remove the random 9999 that I used
just to have something to replace the dash.
After that, the separation was made just as I hope it would.
Thanks to #marabu for the Null-if idea.

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

How can I get the pos of row dynamically ? QlikView

I want to delete the rest of a loaded csv file based on the occurrence of a string.
Remove(Row, RowCnd(Interval, Pos(Top, findMeThePositionOfaGivenString('TeddyBear')),
Pos(Bottom, 1), Select(1, 0))
Or just any approach to dynamically delete a range of rows!
If you're doing this during the data import stage then I would recommend
load yourstuff
from yourfile
where index(givenstring,'Teddybear')=0;
Index will return the position of the string in the larger string.
eg. index('ABC','BC')=2 so index()=0 means the string does no exist in the searched text. Be careful of the capitalisation as it will honour that, so use upper or lower to remove that kind of confusion.
I hope I understood your request.

How to do a replace for given example in Sql?

I want to replace always the last node in the string-
root/node1/node2
If I pass node3 as the parameter it should do a replace like this -
root/node1/node3
Can anyone help me do this say the column name was lineage and I have the id. So, the query would be -
Update
tree
set
lineage= -- replace(lineage,node3) -- this is what i need ?
where
id=2
You could do some string manipulation to find the last occurrence of a /, and then strip everything after that point... and then append your new node parameter to that value
Update
tree
set
lineage = LEFT(Lineage, LEN(Lineage) - CHARINDEX('/', REVERSE(Lineage)) + 1) + #NewNode
where
id=2
User Defined Function. You have the option of T-SQL or .NET languages. I, myself, would chose .NET, as it give the ability to use Regex to more efficiently hone in on the last part of the string, but you can split and reassemble a string using T-SQL. This shows splitting, for example:
http://www.logiclabz.com/sql-server/split-function-in-sql-server-to-break-comma-separated-strings-into-table.aspx