Bulk replace text in all columns - openrefine

I am using OpenRefine to do some data preparation. I have dozens of columns that need to be cleaned with the same GREL expression value.replace("text to be replaced","new text")
How do bulk apply the GREL expression to all columns at once?

Supported since OpenRefine 2.7
Multi-column edit is supported via the "Transform All" feature - available since OpenRefine v2.7 Release
Initial Answer:
Currently, OpenRefine doesn't support multiple column editing. However, you can make the edit on the first column and then edit the JSON code generated to apply it to other columns. You can read more about this process here: http://googlerefine.blogspot.ca/2012/06/google-refine-json-and-my-notepad-or.html

Related

Underscore and dash in column names after JSON import

I've been using OpenRefine very successfully for a couple of years, working solely with CSV (and TSV) source files. Recently I had some tables from an sql database that I wanted to bring into OpenRefine so I exported them (from SQL) as JSON and then used OpenRefine's JSON import feature. It works beautifully except that the column names all begin with _ - . For example, my JSON records start with
{"ID":"97247",
and OpenRefine made the first column name _ - ID instead of just ID (which I'd prefer - I know I can edit them later, but I have hundreds of fields). I can't see any settings in the parsing page that might help this. Does anyone know if there is a way to import without the extra characters (or if there's an explanation for the underscore dash)? I'm considering submitting a feature request but I thought I'd check to see what other users may know.
This is a known issue.
There has also been a proposal to switch to a standard representation for JSON paths.
Feel free to comment on either tickets to indicate which solution you would prefer.

openrefine, cluster and edit two datasets

i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase.
I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function.
As an example for Column B in each data set you could 'Add column based on this column' using the GREL:
value.fingerprint()
This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C'
You can then look up between the two projects using the following GREL in Dataset 2:
cells["Column C"].cross("Dataset 1","Column C")
If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work
You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this.
Owen
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L).
And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)

How to insert array to Firebird table field using sql?

I have that table:
create table t_place(
f_plc_timefrom time,
f_plc_timeto time,
f_plc_minute_cost Decimal(18,4)[24]
);
So, I can create array field, but I don't know, how can I fill in this array field in SQL code. I tryed to find way in many sources, but I could find nothing. I need your help.
AFAIK one can work with arrays only via API, there is no SQL syntax for that.
There is virtually no array support in Firebird in the query and procedural language. As ain says there is only some support via the API. Removal of the array functionality is on the table as well. See also ticket CORE-710.

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

PostGIS: register a "geometry" column without AddGeometryColumn

The usual way to create a geometry column is AddGeometryColumn, however I have to work with pre-existing columns, so I can't use that function (as far as I know).
Thanks to the PostGIS docs, I can already register the column in the "geometry_columns" table, however AddGeometryColumn seems to do more than create a column and add a row in geometry_columns, for example it adds checks on the column.
So my question is what: what do I need to do to register the column manually, besides adding a row in geometry_columns ?
(for example, is there a modified version AddGeometryColumn that works with an existing column ?)
The easiest way of doing it on existing columns is using the function Populate_Geometry_Columns:
https://postgis.net/docs/Populate_Geometry_Columns.html
In other words: The function you are asking for is already there :-)
HTH
Nicklas
As you said, AddGeometryColumn is only a handy shortcut for creating not only the column, but adding type checks and indexes. Of course, you can add these by hand to an existing column: you simply need to do the same things that the AddGeometryColumn does for you in a single command.
If you need to transfer one "regular" column to a "gis" column, why not use SELECT INTO for transfering the data?