Google Refine - pull out identical values in cell - data-manipulation

I have data in a column that looks like this
["Lymore Cottages", "Lymore Cottages", "Lymore Cottages", "Lymore Cottages", "Lymore Cottages", "Lymor Cottages"]
Its essentially the same thing multiple times, but as these are entered by users they could be different. If you notice the last one has the e missing.
What I would like to do is create a new column with just the unique names in it. So the new column would just contain "Lymore Cottages, Lymor Cottages".
I believe this is possible with Google/open Refine. I tried clustering but this also clustered all the other rows with the same details rather than per cell. (I need this for each row regardless of if there are 20 other rows with the same data)

This isn't a programming question, however a combination of splitting values in the cell, removing the duplicates and then reassembling the contents might work.
There's probably an easier way to do this. Roughly, you could
Split
Split multi-valued cells... on the column
Remove the brackets and quotes with
value.replace('[', '').replace(']','').replace('"', '')
Remove duplicates
Next, Sort... A-Z and Reorder rows permanently
Blank down on the column
Invoke the Facet by blank and select True
Remove all matching rows from All > Edit rows
Reassemble
On the column, Transpose cells in rows into columns...
Rebuild the field with brackets and quotes using
'['+ ' ' + value + ',' + ' ' + cells['Step 7 Field Name'].value + ' ' + ']'

Related

BigQuery: Call a UDF on each column of each row and aggregate the output in new column dynamically

I have come up with a JS UDF in BigQuery which needs to be call on each cell of each row and the output of that row needs to be aggregated in another column dynamically & should work for all tables. I have referred answer provided by Mikhail in this question : BigQuery - Concatenate multiple columns into a single column for large numbers of columns
This answer partially works for me. But since, some of my tables have columns having text with comma, it ends up in splitting those columns again. eg. In below screenshot, it should have 5 values in the last column one for each. I have tried few ways like using %T for format etc. since I need to make it generic. It is having limitations.by comma.
Following is the query I am using :
SELECT *, (SELECT string_agg(myFunc(col), ', ' ORDER BY offset) FROM UNNEST(split(trim(format('%t', (SELECT AS struct t.* )), '()'), ', ')) col WITH offset WHERE NOT upper(col) = 'NULL') AS funcOutPut FROM `my-project`.db.customer t;
Is there anyway this can be achieved generically for all the tables I have? Any help would be appreciated. :)

How to search for separated values in cloumns from a merged values column

I have a database where the data I need to work with is stored into two different columns. I also need to import an excel file and the data in this excel file is all together only separated by a dash. So either I need to figure out how to create a query, maybe an alias, or how to split the column by the dash and then make the query with the data split up.
The code I was trying was the following:
SELECT
CAST (dbo_predios.codigo_manzana_predio as nvarchar(55))+'-
'+CAST(dbo_predios.codigo_lote_predio as nvarchar(55)) as ROL_AVALUO
FROM dbo_predios
WHERE ROL_AVALUO like '%9132-2%'
That is one way I tried, but I don't know well how to split by a determined symbol. The data on the excel comes in the exact same way that I wrote in the "like" portion of the code.
I believe this is what you are after from the sounds of it:
SELECT
[locateDashInString] = CHARINDEX('-', e.FieldHere, 0) --just showing you where it finds the dash
,[SubstringBeforeItemLocated] =
SUBSTRING(
e.FieldHere --string to search from
,0 --starting index
,CHARINDEX('-', e.FieldHere, 0) --index of found item
)
,[SubstringAfterItemLocated] =
SUBSTRING(
e.FieldHere --string to search from
,CHARINDEX('-', e.FieldHere, 0) + 1 --starting index for substring
,LEN(e.FieldHere) --finish substring at this point
)
FROM ExcelImportedDataTable e
The locateDashInString column is just to show you where it finds the '-' symbol, you don't actually need it, the other two columns are a split of the value so '9132-2' split into two values/two columns.
**Just note that this will only work if you always have the format of val1-val2 in the data. Aslong as the format is the same it should be fine.

Blank Space in every row of table SQL

Hello i have a table with rows
and i was doing a simple
select from table where column ='string'
and it gives me back no result, but when i use:
select from table where column ='%string%'
it gives me the row that exist in my table,
then i did a select * from table and noticed that there is a blank space before my rows:
Image of my SQL result
If you look closely theres a space at the beginning of the second row, and only in the first row theres no blank space.
so i thought it was a simple white space at the beggining but when i tried using this:
SELECT LTRIM(RTRIM(MATERIAL)) FROM table
nothing happened.
then i tried to copy the result of my
select * from table
to Excel and noticed this:
Excel paste from SQL
my 2nd row got splitted in 2 rows right at the start of the column 'material', so the thing i thught it was a blank space its something like a jump line.
i have never had this problem before or seen this before.
Larnu has commented how to remove all the linebreaks from the data. Here are some other things that could also work, and slightly differently depending on the effect you want:
--trim everything that is not a number or letter off the left hand side only
UPDATE table SET material = SUBSTRING(material, PATINDEX(material, '[0-9a-z]', 99999)
--convert all linebreaks to spaces and trim off the left and right spaces
UPDATE table SET material = RTRIM(LTRIM(REPLACE(material, CHAR(10), ' ')))
Larnu's SQL isn't wrong, it'll just remove every line break anywhere, which may cause more formatting disruption than is wanted. I'd be tempted to replace all the linebreaks with spaces, as two words that are separated by a line break would remain separated by a space rather than become one word if the space was removed
some
word
-> some word (if you replace linebreak with space)
-> someword (if you replace linebreak with nothing)
If all you want is to remove linebreaks from the left side of the field, the patindex method will search the field for the first occurrence of a numbe rof a letter, and return the index, then substring will cut everything from that index for a length of 99999 (use a bigger number if your field is longer). This has the effect of removing only linebreaks at the start of the field
As to how it happened, whoever inserted the data, or the data import program, made some mistakes when it was cutting up the data. Perhaps it was a Windows style text file, whose line endings are CR LF (ascci 13 followed by 10), and the program that did the import decided to cut the file up based on the 13 only, leaving behind the 10 to become "part of" the material field:
this,is,my,data1<13><10>this,is,my,data2<13><10>
//now lets cut it up into 2 records, based on using <13> only to denote the end of line:
record 1= this,is,my,data1
record 2= <10>this,is,my,data2
The program just sees a stream of bytes, it is we humans that interpret "lines". If the program treats 13 as the separator, then all the 10s get left behind as part of the data that gets inserted. The very first record in the file won't have 13/10 (crlf) before it because it's the first line, so one of your rows (the one with ascii (49)) won't suffer this problem
You could "cure" the bad data with a trigger upon insert:
CREATE TRIGGER prevent_bad_data
ON yourtable
INSTEAD OF INSERT
AS
BEGIN
INSERT INTO yourtable(somecolumn,othercolumn,material)
SELECT foo,
bar,
LTRIM(REPLACE(material, CHAR(10), ' '))
FROM Inserted
END
Or you could program the db to reject bad rows and fix the tool that is inserting the bad data:
ALTER TABLE yourtable
ADD CONSTRAINT prevent_bad_material
CHECK material LIKE '[0-9a-z]%'; --check it starts with a number or letter
Edit: though having seen your updated question with screenshots, the material column really should be a number, not a varchar type, then this wouldn't happen

SQL finding rows that only contain chars from a certain Unicode range

I recently asked a question to obtain rows that contain characters in a certain Unicode range.
SELECT *
FROM #kanjinames
WHERE UNICODE(LEFT(ForeNames, 1)) BETWEEN 0x4e00 AND 0x9fff
A very helpful user shared the above with me. To my understanding it checks the first character on the left and if it is within the Unicode range it returns an a the row. Through testing I believe this works.
My current problem is how do I go about checking the entire column is within the range? For example:
石山コンタクトレンズ
The above contains characters outside of the range (the first two characters are within range) in the query above but I am not sure about how I go about checking the entire field. I am away of using stuff like
is not like N'%^a-z%'
for the English alphabet. Just not sure how to apply it for this situation.
Any help would be great on this.
I think this will work:
SELECT *
FROM #kanjinames
WHERE ForeNames NOT LIKE '%[^' + NCHAR(0x4e00) + '-' NCHAR(0x9fff) + ']%';
That is, the string contains no characters outside that sequence.
Edit: I had to alter this slightly to get it to work. I had to use the decimal values instead of the hex.
SELECT *
FROM #kanjinames
WHERE ForeNames NOT LIKE '%[^' + NCHAR(19968) + '-' + NCHAR(40802) + ']%';
This still returns blank values but I removed those separately.

Excel Macro to Convert Cell Data to Multiple Columns

I have a big bunch of cells in Excel that look like the following:
FName LName, Loc JB
Abbreviations are bad, so: First Name, Last Name, Location, Job.
I need to move that so it looks like this:
FName LName || LOC || JB
Caveats:
Must remove the , after the name.
Must capitalize the location (it's 2 or 3 letters, inconsistently capitalized. I want to make them all caps).
JB is anywhere from 1 to 4 characters on the end. I just need to take that last bit and dump it in.
They're all separated by at least a space (the first has a comma and a space).
I'd like a macro to do this, because I have to do it with relative frequency, and doing 200 rows of this by hand is a pain. Any help?
Sounds like all you need is a formula in the next column. If your values are in column A starting with cell A1, try:
=LEFT(A1,FIND(",",A1)-1)&" || "&UPPER(MID(A1,FIND(",",A1)+2,FIND(" ",A1,FIND(",",A1)+2)-FIND(",",A1)-2))&" || "&RIGHT(A1,LEN(A1)-FIND(" ",A1,FIND(",",A1)+2))
This formula takes everything to the left of the comma and adds " || ". Then it finds the next space starting its search two characters after the comma. Using that index it then can extract the location and make it upper case. Then again we add " || ". Then knowing the index of that space we can grab everything to the right to grab the job. This same logic can be applied in VBA but this is probably a quicker solution and easier to pass between computers.
You don't necessarily need a macro. You can do it with a series of right()'s left()'s mid()'s and find()'s. Also need to use Upper() for the loc.
For instance, if your data is in column A, to get a column with first and last name in column B, in B1 you could use:
=LEFT(A1,FIND(",",A1)-1)
That'll return everything up to but not including the comma. For Loc, assuming there's a comma between Loc and JB in C1 you'd use:
=UPPER(MID(A1,FIND(",",A1)+2,FIND(",",A1,FIND(" ",A1)+1)-FIND(" ",A1)-3))
That'll return the uppercase version of the middle of the string, starting 2 chars after the first comma (so you don't get the space), and ending 2 less than the difference between the first and second commas. If there's no comma, you could do a similar set of searches to find where that last space is.
The last IN d1 is:
=RIGHT(A1,LEN(A1)-FIND(" ",A1,FIND(",",A1)+2))
edited after the clarification of commas and spaces.