Regex for extracting certain information from a string - pandas

Below is the string that I have -
vdp_plus_forecast_aucc_VDP_20221024_variance_analysis_20221107_backcasting_actuals_asp_True_vlt_True.csv
I need RegEx to take out following items from the string -
20221107
vlt_True
Need help with writing right RegEx for these two extractions. I'm performing the operation on a PySpark DF.

I'm assuming that the answer is based on the variable in front of it so it's capturing the value of variance analysis:
(?<=_variance_analysis_)[0-9]+|vlt_(True|False)
This should capture the variables you wanted, if you only need the value of vlt, you can replace vlt_ with (?<=_vlt) which will just capture the value without the variable

Related

Need to extract specific text from a column on excel using either Alteryx or Pandas

I have a column that contains a specific set of text that I need to be retained and the rest removed or moved to another column. Unfortunately, I am not able to use normal text-to-column due to the variation of the text arrangement.
For example, I need the word Issue and the id associated with it to be separated. I am struggling to figure out a way to do this with the variation of the arrangement of the text I need.
If someone can help me find a solution using Alteryx would be much appreciated, if not Pandas would also work.
Thanks all.
Use str.extract with Pattern to extract specific text from the data frame [Pandas]
df['After']=df['Before'].str.extract(pat='(ISSUE \d+|issue \d+)',expand=False)
For an Alteryx-only solution, the easiest way would be an Alteryx Formula using REGEX_Replace:
REGEX_Replace([Before],".*(issue \d+).*","?1",1)
If you don't like RegEx, basic string manipulations can do it also: basically it's a Substring...
Substring([Before], *starting index*, *length*)
The starting index is easy: it's just FindString([Before],"ISSUE")
The length isn't too hard either: it's the index (using FindString again) of the first comma in the substring that starts with "ISSUE": SubString([Before],FindString([Before],"ISSUE"))
Combining all that and spreading it out a bit:
Substring(
[Before],
FindString([Before],"ISSUE"),
FindString(
SubString(
[Before],
FindString([Before],"ISSUE")
),","
)
)

Convert String to array and validate size on Vertica

I need to execute a SQL query, which converts a String column to a Array and then validate the size of that array
I was able to do it easily with postgresql:
e.g.
select
cardinality(string_to_array('a$b','$')),
cardinality(string_to_array('a$b$','$')),
cardinality(string_to_array('a$b$$$$$','$')),
But for some reason trying to convert String on vertica to array is not that simple, Saw this links:
https://www.vertica.com/blog/vertica-quick-tip-dynamically-split-string/
https://forum.vertica.com/discussion/239031/how-to-create-an-array-in-vertica
And much more that non of them helped.
I also tried using:
select REGEXP_COUNT('a$b$$$$$','$')
But i get an incorrect value - 1.
How can i Convert String to array on Vertica and gets his Length ?
$ has a special meaning in a regular expression. It represents the end of the string.
Try escaping it:
select REGEXP_COUNT('a$b$$$$$', '[$]')
You could create a UDx scalar function (UDSF) in Java, C++, R or Python. The input would be a string and the output would be an integer. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/UDx/ScalarFunctions/ScalarFunctions.htm
This will allow you to use language specific array logic on the strings passed in. For example in python, you could include this logic:
input_list = input.split("$")
filtered_input_list = list(filter(None, input_list))
list_count = len(filtered_input_list)
These examples are a good starting point for writing UDx's for Vertica. https://github.com/vertica/UDx-Examples
I wasn't able to convert to an array - but Im able to get the length of the values
What i do is convert to Rows an use count - its not best performance wise
But with this way Im able to do also manipulation like filtering of each value between delimiter - and i dont need to use [] for characters like $
select (select count(1)
from (select StringTokenizerDelim('a$b$c','$') over ()) t)
Return 3

How to append to a list in Automation Anywhere 10.5?

The list starts empty. Then I want to append an value to it for each iteration in a loop if certain condition is met. I don't see append option in Variable Operation.
You can use string split for this, assuming you know of a delimiter that won't ever be in your list of values. I've used a semi-colon, and $local_joinedList$ starts off empty.
If (certain condition is met)
Variable Operation: $local_joinedList$;$local_newValue$ To $local_joinedList$
End If
String Operation: Split "$local_joinedList$" with delimiter ";" and assign output to $my-list-variable$
This overwrites $my-list-variable$.
If you need to append to an existing list, you can do it the same way by using String Join first, append your values to the string, then split it again afterward.
String Operation: Join elements of "$my-list-variable$" by delimiter ";" and assign output to $local_joinedList$
Lists are buggy in Automation Anywhere and have been buggy for several versions. I suggest not using them and instead use XML.
It it a much more versatile approach and allows you to do much more that with lists. You can search, filter, insert, delete etc.
For the example you mention, you would use the "Insert Node" command.
Throwing in my 2 cents as well - my-list-variable appears to be the only mutable in size list you can work with. From my experience with 10.7, it only grows though.
So if you made a list with 60 values, and you wanted to use my-list-variable again for 55, you'll need to clear out those remaining 5 values and create an if condition when looping over the list to ensure the values are not whatever you set those 5 values to be.
I used lime's answer as a reference (thanks lime!) to populate a list variable from some data in an Excel spreadsheet.
Here's my automation for it:

What is the meaning of having a variable="+" ? SAS (sql)

I'm new to SAS and I'm trying to understand a code:
if MAP_ID="+" then output WORK.0201_template;
else
do;
SHEET_ID=MAP_ID;
output WORK.0201_template_f;
end;
What does it mean the MAP_ID="+"? Does it mean that it search on the table for the values where MAP_ID=+, or does it have another menaing?
Thanks
The MAP_ID="+" is a boolean expression that compares the value the variable MAP_ID to the character string literal "+". It will be true when they are the same and false otherwise.
I suspect that the main purpose of this code is to split the data into two different output datasets based on the value of MAP_ID.
It also is changing the value of SHEET_ID. That type of code also looks like something that is designed to carry forward the value of MAP_ID in a retained field SHEET_ID. If I am right then the meaning of the value of + is to keep the same sheet_id. But we would need to seem more of the code and the data to really tell.

Filtering rows in Pentaho

I have a dataset with columns containing numbers. However, some of the rows in that column have missing data. Instead of numbers, a dash (-) is placed in the cell.
What I want to happen is to separate those rows with a dash and output them to a separate excel file. Those without the dash, should output to a csv file.
I tried the "filter rows" but it gives me an error:
Unexpected conversion error while converting value [constant String] to a Number
constant String : couldn't convert String to number
constant String : couldn't convert String to number : non-numeric character found at position 1 for value [-]
My condition is if
Column1 CONTAINS - (String)
You cant try to convert to number in the select step,and handler the error, if can not convert to number that mean that is (-)
You can convert missing value indicators (like a dash or any other string) to null in Text-File-Input - see field option "Null if". That way you still can use the metadata detection feature and will not trip over a dash arriving in a Number field.
With CSV-File-Input you should stick to the String datatype until a Null-If step has cleansed the values, so you can change the datatype to Number in a Select-Values step.
If you must preserve the dash character, don't use metadata detection (as it suggests datatype Number) or use more rows to sample (so a field with a dash is encountered) or just revert the datatype to String again before saving and running the transformation.
My solution lies on the first 'Replace in String'. I replaced the dash into something numeric and can easily be distinguished from the rest of the numbers (I used 9999) and carried on with the rest of my process.
In filter rows, I had no problems anymore with the data type because both my variables and condition contained numbers, therefore, it no longer had to convert anything.
After filter rows, I added the 'Null-if' to remove the random 9999 that I used
just to have something to replace the dash.
After that, the separation was made just as I hope it would.
Thanks to #marabu for the Null-if idea.