Split column preserving values - openrefine

I have a column of words followed by numbers, like this:
I want to split it into two columns, putting the text to the left of the digits in the first column, and the digits and any text that follow into the second column.
I suspect I'll have to add a column based on this column, containing the digits and everything after. Then I'll have to delete the digits and everything after from the previous column.
I'm not great at GREL, and the examples I've found don't work. Help?

There are several ways. If you don't like GREL but you know some regular expressions, you can use "Edit column" -> Split into several columns "and use as separator this regex :
\s(?=\d)
It means "any space that is before a number".
(Don't forget to check the box "regular expression".)
If any of your values contain multiple numbers (eg, "text 123 newtext 345 sometext"), specify "split into 2 columns at most".

Related

Add a value into the 4th character of every row

I'm not good with English but hope I can make my question clear.
I have a table with tons of rows, all written in the format 00000 (ex.: 000001, 00002 etc). I need to change all of them by adding a letter in a fixed point of the number (00A000, 00A001 etc), always the same letter, always in the same position. So, either i manage to change the format of the rows to a custom format number-number-letter-number-number-number (which i'm able to do in Excel, but i can't find a way in Access), or I create an update query to add the letter A in that spot. Anyone can help?
I tired to change the format as i said, but can't find a way to add custom format in access. I've tried to use an update query to add the letter after the 3rd character from the RIGHT, but i'ìve written the query wrong
You show one value with 6 digits and other with 5 digits. If that is correct, consider (x represents your field or string): Left(x,2) & "A" & Mid(x,3)
If that is a posting error and all values are same length of 5 digits, consider: Format(x, "00A000").
Could run an UPDATE action to change data in field but it is not necessary as this calculation can be done when needed in query or textbox.
Could change field to a number type instead of storing repetitious characters. Also, Format property could use: 00\A000.

Extract text using GREL in OpenRefine

I'm trying to add a column based on a column in OpenRefine using GREL.
I need to extract every text after the second space in scientific name.
Here is two examples of the original cell data ---> what I want to extract:
Amandinea punctata (Hoffm.) Coppins & Scheid. ---> (Hoffm.) Coppins & Scheid.
Agonimia tristicula (Nyl.) Zahlbr. ---> (Nyl.) Zahlbr.
Here are three ways to achieve the desired result on the given data, ordered from easy to understand to more advanced.
Use column splitting
You can split the column into three columns by choosing a whitespace as separator and limit the number of new columns to 3 in the corresponding dialog. Then you can delete the first two columns and have your desired result.
Use Array functions
You can use the same technique via GREL and arrays... split on whitespace, discard the first two entries and join the rest on whitespace.
value.split(" ").slice(2).join(" ")
Use regular expressions
You can also use the match function with a regular expression.
value.match(/\S+\s\S+\s(.+)/)[0]
A solution :
partition on what appears to be a good separator : " (", take the right part and add a missing "(" at the beginning.
"("+value.partition(" (")[2]

How do I remove a character from strings of different lengths with sql? Intersystems cache sql

I have a column of strings that have an '&' at the beginning and end of each one that I need to remove for a Crystal report I'm creating. I'm writing the SQL code outside of Crystal I am using Intersystems Cache SQL. Below is an example:
&This& This
&is& is
&What& what
&it& I
&looks& need
&like& it
&now& to
look
like
Any suggestions would be greatly appreciated!!!
Assuming the ampersands are always positioned as both the leading and trailing characters, here's at least maybe a start. Use a combination of SUBSTR (or SUBSTRING, if using stream data) and LENGTH, like so:
SELECT SUBSTR((SELECT column FROM table), 2, LENGTH(SELECT column FROM table) - 2)
This should return a substring that starts counting at the 2nd character [of the original string, given by the first sub-expression/argument to SUBSTR], counting up for the total number of characters [of the original string] less 2 (i.e. less the two ampersands).
If you need to including trailing blanks and/or the string termination character, you may need to use a different variation of the LENGTH function. See resources for details on these functions and their variants:
https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_substr
https://cedocs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_length
Here's a Crystal formula that does the same:
ExtractString({YourData},"&","&")

Difference between _%_% and __% in sql server

I am learning basics of SQL through W3School and during understanding basics of wildcards I went through the following query:
--Finds any values that start with "a" and are at least 3 characters in length
WHERE CustomerName LIKE 'a_%_%'
as per the example following query will search the table where CustomerName column start with 'a' and have at least 3 characters in length.
However, I try the following query also:
WHERE CustomerName LIKE 'a__%'
The above query also gives me the exact same result.
I want to know whether there is any difference in both queries? Does the second query produce a different output in some specific scenario? If yes what will be that scenario?
Both start with A, and end with %. In the middle part, the first says "one char, then between zero and many chars, then one char", while the second one says "one char, then one char".
Considering that the part that comes after them (the final part) is %, which means "between zero and many chars", I can only see both clauses as identical, as they both essentially just want a string starting with A then at least two following characters. Perhaps if there were at least some limitations on what characters were allowed by the _, then maybe they could have been different.
If I had to choose, I'd go with the second one for being more intuitive. After all, many other masks (e.g. a%%%%%%_%%_%%%%%) will yield the same effect, but why the weird complexity?
For Like operator a single underscore "_" means, any single character, so if you put One underscore like
ColumnName LIKE 'a_%'
you basically saying you need a string where first letter is 'a' then followed by another single character and then followed by anything or nothing.
ColumnName LIKE 'a__%' OR ColumnName LIKE 'a_%_%'
Both expressions mean first letter 'a' then followed by two characters and then followed by anything or nothing. Or in simple English any string with 3 or more character starting with a.

SSIS Transform -- Split one column into multiple columns

I'm trying to find out how to split a column I have in a table and split it into three columns after the result is exported to a CSV file.
For example, I have a field called fullpatientname. It is listed in the following text format:
Smith, John C
The expectation is to have it in three separate columns:
Smith
John
C
I'm quite sure I have to split this in a derived column, but I'm not sure how to proceed with that
You are going to need to use a derived column for this process.
The SUBSTRING and FINDSTRING functions will be key to pull this off.
To get the first segment you would use something like this:
(DT_STR,25,1252) SUBSTRING([fullpatientname], 1, FINDSTRING(",",[fullpatientname],1)-1)
The above should display a substring starting with the beginning of the [fullpatientname] to the position prior to the comma (,).
The next segment would be from the position after the comma to the final space separator, and the final would be everything from the position following the final space separator to the end.
It sounds like your business rule is
The "last name" is all of the characters up to the first comma
The "first name" will be all of the characters after the first comma and a space
The "middle name" will be what (and is it always present)?
the last character in the string (you will only ever have an initial letter)
All of the characters after the second space
This logic will fail in lots of fun ways so be prepared for it. And also remember that once you combine information together, you cannot, with 100 accuracy, restore it to the component parts. Capture first, middle, last/surname and store them separately.
Approach A
A derived column component. Actually, a few of them added to your data flow will cover this. The first Derived Column will be tasked with finding the positions of the name breaks. This could be done all in a single Component but debugging becomes a challenge and then you will need to reference the same expression multiple times in a single row * 3 it quickly becomes a maintenance nightmare.
The second Derived Column will then use the positions defined in the first to call the LEFT and SUBSTRING functions to access points in the column
Approach B
I never reach for a script component first and the same should hold true for you. However, this is a mighty fine case for a script. The base .NET string library has a Split function that will break a string into pieces based on whatever delimiter you supply. The default is whitespace. The first call to split will use the ',' as the argument. The zeroeth ordinal string will be the last name. The first ordinal string will contain the first and middle name pieces. Call the string.Split method again, this time using the default value and the last element is the middle name and the remaining elements are called the first name. Or vice versa, the zeroeth element is the first name and everything else is last.
I've had to deal with cleaning names before and so I've seen different rules based on how they want to standardize the name.
Try something like this, if your names are always in the same format (LastName-comma-space-FirstName-space-MI):
declare #FullName varchar(25) = 'Smith, John C'
select
substring(#FullName, 1, charindex(',', #FullName)-1 ) as LastName,
substring(#FullName, charindex(',',#FullName) + 2, charindex(' ',#FullName,charindex(',',#FullName)+2) - (charindex(',',#FullName) + 2) ) as FirstName,
substring(#FullName, len(#FullName), 1) as MiddleInitial
I am using SQL SERVER 2016 with SSIS in Visual Studio 2015. If you are using findstring you need to make sure the order is correct. I tried this first -
FINDSTRING(",",[fullpatientname],1), but it wouldn't work. I had to look up the documentation and found the order to be incorrect. FINDSTRING([fullpatientname],",",1) fixed the problem for me. I am not sure if this is due to differences in versions.