OpenRefine: Remove duplicate, comma separated values in cells - openrefine

How can I clean up (and later export to JSON) cells which contain comma separated, probably duplicate values?
Example of Cells:
+-------------+
| foo,bar,foo |
+-------------+
| bar,qux |
+-------------+
| bar,bar |
+-------------+
What I'm after is either the data split up into new columns and deduplicated like so:
+-----+-----+
| foo | bar |
+-----+-----+
| bar | qux |
+-----+-----+
| bar | |
+-----+-----+
or a possibility to export the deduplicated data as a JSON array
+---------+
| foo,bar |
+---------+
| bar,qux |
+---------+
| bar |
+---------+
to
"cellname": ["foo", "bar"]
"cellname": ["bar", "qux"]
"cellname": ["bar"]
Thanks for your help!

You must first import your dataset in line-based mode so that the values are contained in a single column.
Like this.
Then, you can use this hacky Python/Jython script (already mentioned here) to transform your column:
from collections import OrderedDict
dedup = list(OrderedDict.fromkeys(value.replace(' ','').split(',')))
return '["' + '","'.join(dedup) + '"]')
Result:
Finally, by clicking on "Export -> Templating", you can use a value like this in the "Row template" field:
"cellnames" : {{cells["Column 1"].value}}

Related

Is there a way to alter all columns in a SQL table to have 'utf-8' format

I'm using Spark and I found that my data is not being correctly interpreted. I've tried using decode and encode built-in functions but they can be applied only to one column at a time.
Update:
An example of the behaviour I am having:
+-----------+
| Pa�s |
+-----------+
| Espa�a |
+-----------+
And the one I'm expecting:
+-----------+
| País |
+-----------+
| España |
+-----------+
The sentence is just a simple
SELECT * FROM table

Remove/delete values in a column SQL

I am very new to using SQL and require help.
I have a table containing comma in the values
+-------------------+
| Sample |
+-------------------+
| sdferewr,yyuyuy |
| q45345,ty67rt |
| wererert,rtyrtytr |
| werr,ytuytu |
+-------------------+
I would want to delete/remove the values after the comma(,) and keep only those values before it.
Output required.
+----------+
| Sample |
+----------+
| sdferewr |
| q45345 |
| wererert |
| werr |
+----------+
How would I be able to do this in SQL? please help
Assuming that the table name is "TABLE_NAME" and the field name is "sample". Then
update TABLE_NAME set sample=SUBSTRING_INDEX(`sample`, ',', 1)
The most simple way to do that is
UPDATE table_name
SET column = substring(column for position('',' in column))
WHERE condition;
position(',' in column) will return the position of the comma and substring(column for n) returns the first n characters

How to get part of the String before last delimiter in AWS Athena

Suppose I have the following table in AWS Athena
+----------------+
| Thread |
+----------------+
| poll-23 |
| poll-34 |
| pool-thread-24 |
| spartan.error |
+----------------+
I need to extract the part of the string from columns before last delimiter(Here '-' is delimiter)
Basically need a query which can give me output as
+----------------+
| Thread |
+----------------+
| poll |
| poll |
| pool-thread |
| spartan.error |
+----------------+
Also i need a group by query which ca generate this
+---------------+-------+
| Thread | Count |
+---------------+-------+
| poll | 2 |
| pool-thread | 1 |
| spartan.error | 1 |
+---------------+-------+
I tried various forms of MySql queries using LEFT(), RIGHT(), LOCATE(), SUBSTRING_INDEX() functions but it seems that athena does not support all these functions.
You could use regexp_replace() to remove the part of the string that follows the last '-':
select regexp_replace(thread, '-[^-]*$', ''), count(*)
from mytable
group by regexp_replace(thread, '-[^-]*$', '')

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

I want to get the distinct values and their respective counts of every column of a dataframe and store them as (k,v) in another dataframe.
Note: My Columns are not static, they keep changing. So, I cannot hardcore the column names instead I should loop through them.
For Example, below is my dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
My final result should be created in a new dataframe as (k,v) where k is the distinct record and v is the count of it.
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
Can anyone please help me with this, I'm using Spark 2.4.0 and Scala 2.11.12
Note: My columns are dynamic, so I can't hardcore the columns and do groupby on them.
I don't have exact solution to your query but I can surely provide you with some help that can get you started working on your issue.
Create dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
Next calculate count of distinct element of each column
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
Create a range to iterate over distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
Aggregate your data
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
Create data frame:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
I hope this helps you in some way.

hive regexp_extract after second occurrence of delimiter

we have a Hive table column which has string separated by ';' and we need to extract the string after second occurrence of ';'
+-----------------+
| col1 |
+-----------------+
| a;b;c;d |
| e;f; ;h |
| i;j;k;l |
+-----------------+
Required output:
+-----------+
| col1 |
+-----------+
| c |
| <null> |
| k |
+-----------+
select regexp_extract
Split the string on ; which will return an array of values and from this you can get the element at index 2.
select split(str,';')[2]
from tbl
If you want to convert empty and space-only strings to NULLs like in your example, then this macro can be useful:
create temporary macro empty_to_null(s string) case when trim(s)!='' then s end;
select empty_to_null(split(col1,'\\;')[2]);