extract names substring using openrefine

extract names substring using openrefine - openrefine

using openrefine, i need to achieve the following:
in a given column with people's names,
i must extract all content before the last dot (.) into a new column named firstname.
By extraction, i mean move the data, so that the initial column does not contain what is already moved to the firstname column.
On a second pass, i need to remove from the names column, and move to the firstname column, everything that follows what is after the (,) comma. How would you approach this?
Example
names column | lastname | firstname
M., E. Handman | Handman | M., E.
Surrey, Sam Bill | Surrey | Sam Bill
My effort is trying to make it work with something like:
value.find(/(\b\w{1}\W{1,})/).join('').trim()

Related

SQL WHERE column values into capital letters

Let's say I have the following entries in my database:
Id
Name
12
John Doe
13
Mary anne
13
little joe
14
John doe
In my program I have a string variable that is always capitalized, for example:
myCapString = "JOHN DOE"
Is there a way to retrieve the rows in the table by using a WHERE on the name column with the values capitalized and then matching myCapString?
In this case the query would return two entries, one with id=12, and one with id=14
A solution is NOT to change the actual values in the table.

A general solution in Postgres would be to capitalize the Name column and then do a comparison against an all-caps string literal, e.g.
SELECT *
FROM yourTable
WHERE UPPER(Name) = 'JOHN DOE';
If you need to implement this is Knex, you will need to figure out how to uppercase a column. This might require using a raw query.

OpenRefine split column with repetitive values

I have a single column in OpenRefine like this:
Title
A Star is born
Author
George Cukor
Date
1954
Other tags...
Data for each item begin with name of the tag (Title, Author, Date etc.), followed by a value, and every tag or value are in successive rows, around ten thousands.
I would like to have as many columns as tags and as many rows as items containing title, date, author etc., something like this:
Title | Author | Date | etc.
A Star is born | George Cukor | 1954 | etc.
Any idea ?
Thanks

This is your original dataset:
Use "Transpose --> Transpose cells in rows into columns" (leaving option 2 as default). You will get this:
Then, on the first column, apply "Transpose --> Columnize by key/value columns" and don't change the default options there either. Final result:
This will obviously work with more tags/columns, but only if each of them is followed by a single value.

CASE ReGex with substring

I'm writing a SQL query where I am taking the substring of 2 names (First name/last name) to create an initials column, the data is unstructured to a certain extent (Can't show for GDPR reasons) but where there is a company name it is just in the surname column.
I'm trying to use Regex to say when the already present initials column is 1 letter (I.e not an initial) and if it is not an initial run a command that I wrote which successfully works.
CAST(CASE
WHEN [DATA_TABLE].[INITIALS] = '\d' THEN (CONCAT(substring([DATA_TABLE].[FIRSTNAMES],1,1),substring([DATA_TABLE].[SURNAME],1,1)) AS char) AS INITIALS
ELSE [DATA_TABLE].[INITIALS]
end as char) as INITIALS,
An example of the data format:
First name last name initials
John smith JS
Electrical company E
Sam Craig SC
I want the names that are just in the surname (Company names) to just remain as they are with no change (I.e The \d regex). Ones which don't will become the substring of their first name as (1,1) and a substring of their last name to also be (1,1).

How do you query only part of the data in the row of a column - Microsoft SQL Server

I have a column called NAME, I have 2000 rows in that column that are filled with people's full names, e.g. ANN SMITH. How do I do a query that will list all the people whose first name is ANN? There are about 20 different names whose first name is ANN but the surname is different.
I tried
and (NAME = 'ANN')
but it returned zero results.
I have to enter the FULL name and (NAME = 'ANN SMITH') ANN SMITH to even get a result .
I just want to list all the people with there first name as ANN

Try in your where clause:
Where Name like 'ANN %'
Should work mate.
ANN% will find all results where ANN is first then anything after.
%ANN% will find the 3 letters ANN in any part of that rows field.
Hope it helps
Also usually Name is separated into First names and second name columns.
this will save Having to use wild cards in your SQL and provide A bit more normalized data.

SELECT NAME
FROM NAMES
WHERE NAME LIKE 'ANN %'
This should wildcard select anything that begins with 'ANN' followed by a space.

SQL: Find highest number if its in nvarchar format containing special characters

I need to pull the record containing the highest value, specifically I only need the value from that field. The problem is that the column is nvarchar format that contains a mix of numbers and special characters. The following is just an example:
PK | Column 2 (nvarchar)
-------------------
1 | .1.1.
2 | .10.1.1
3 | .5.1.7
4 | .4.1.
9 | .10.1.2
15 | .5.1.4
Basically, because of natural sort, the items in column 2 are sorted as strings. So instead of returning the PK for the row containing ".10.1.2" as the highest value i get the PK for the row that contains ".5.1.7" instead.
I attempted to write some functions to do this but it seems what I've written looked way more complicated than it should be. Anyone got something simple or complicated functions are the only way?
I want to make clear that I'm trying to grab the PK of the record that contains the highest Column 2 value.

This query might return what you desire
SELECT MAX(CAST(REPLACE(Column2, '.', '') as INT)) FROM table

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extract names substring using openrefine - openrefine

Related

SQL WHERE column values into capital letters

OpenRefine split column with repetitive values

CASE ReGex with substring

How do you query only part of the data in the row of a column - Microsoft SQL Server

SQL: Find highest number if its in nvarchar format containing special characters

Categories

Resources