Batch edit in OpenRefine - openrefine

So, I have a bunch of .csv files which need cleaning. They all need to go through the same steps, so I've extracted OpenRefine's operation history in order to apply it to other ones.
I could open each file one by one in OpenRefine and apply the extracted JSON history. But there are a lot of files...
Also, I don't have enough memory to open them all at once in OpenRefine (multiple selecting when opening the files).
Is there any way I could edit them all or automatically using that JSON I extracted from OpenRefine?

That's what we created BatchRefine for, the README should be pretty much self-explanatory. If not, let me know.
I just recently converted 4 million CSV records to RDF using BatchRefine, took me less than 10 minutes on my MacBook Pro.
I execute BatchRefine with this simple shell script:
#!/bin/bash
for file in ./input/*.tsv
do
filename=$(basename "$file")
if [ ! -f "target/"$filename"-transformed" ]
then
echo Processing $filename...
curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary "#"$file -o "target/"$filename"-transformed" 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json'
else
echo Found "target/"$filename"-transformed", skipping $file
fi
done;
Note that you need to adjust the Acceptheader in the script, I guess you want CSV as output again, not RDF.

You can automate some OpenRefine operations using one of the existing libraries:
python
An other python library
ruby
javascript - nodejs

Related

IntelliJ: Dynamically updated file header

By default, IntelliJ Idea will insert (something like) the following as the header of a new source file:
/**
* Created by JohnDoe on 2016-04-27.
*/
The corresponding template is:
/**
* Created by ${USER} on ${DATE}.
*/
Is it possible to update this template so that it inserts the last date of modification when the file is changed? For example:
/**
* Created by JohnDoe on 2016-03-27.
* Last modified by JaneDoe on 2016-04-27
*/
It is not supported out of the box. I suggest you do not include information about author and last edit/create time in file at all.
The reason is that your version control system (Git, SVN) contains the same information automatically. So the manual labelling is just duplicate of already existing info, but is only more error prone and needs to be manually updated.
Here's a working solution similar to what I'm using. Tested on mac os.
Create a bash script which will replace first occurrence of Last modified by JaneDoe on $DATE only if the exact value is not contained in the file:
#!/bin/bash
FILE=src/java/test/Test.java
DATE=`date '+%Y-%m-%d'`
PREFIX="Last modified by JaneDoe on "
STRING="$PREFIX.*$"
SUBSTITUTE="$PREFIX$DATE"
if ! grep -q "$SUBSTITUTE" "$FILE"; then
sed -i '' "1,/$(echo "$STRING")/ s/$(echo "$STRING")/$(echo "$SUBSTITUTE")/" $FILE
fi
Install File Watchers plugin.
Create a file watcher with appropriate scope (it may be this single file or any other scope, so that any change in project's source code will update modified date or version etc.) and put a path to your bash script into Program field.
Now every time the file changes the date will update. If you want to update date for each file separately, an argument $FilePath$ should be passed to the script.
This might have been just a comment to #oleg-mikhailov excellent idea, but the code snippet won't fit. Basically, I just tweaked his solution.
I needed a slightly different syntax but that's not the issue. The issue was that when the script ran automatically upon file save using the File Watchers plugin, if ran on a file which doesn't include PREFIX it would run over and over for ever.
I presume the that the issue is with the plugin itself, as it didn't happen when run from the shell, but I'm not sure why it happened.
Anyway, I ended up running the following script (as I said only a slight change with respect to the original). The new script also raises an error if the the prefix doesn't exist. For me this is a feature as Pycharm prompts me with the error, and I can fix the file.
Tested with PyCharm 2021.2.3 on macOS 11.6.
#!/bin/bash
FILE=$1
DATE=`date '+%Y-%m-%d'`
PREFIX="last_modified_date: "
STRING="$PREFIX.*$"
SUBSTITUTE="$PREFIX$DATE"
if ! grep -q "$SUBSTITUTE" "$FILE"; then
if grep -q "$PREFIX" "$FILE"; then
sed -i '' "s/$(echo "$STRING")/$(echo "$SUBSTITUTE")/" $FILE
else
echo "Error!"
echo "'$PREFIX' doesn't appear in $FILE"
exit 1
fi
fi
PHPStorm has not a "hook" for launching task after detect a change in file (just for uploading in server yes). Code templating is based on the creation of file not change.
The behaviour you want (automatic change file after manual change file) can be useful for lot of things but it's circular headhache for editor. Because if you change a file it must change file (and if a file is change ? it change file ?).
However, You can, perhaps, "enable Live Templates" when you launch a "reformat code" which able to rewrite your begin template code that way rewrite date modification.
Other solution is that use a tools with as grunt but I don't know if manage php file.

How to create sequence files from tsv file for text classification

I have a tsv file which is seperated in class, id and text, e.g.
positive 2342 This is very good.
negative 4343 I hate it.
and I'm trying to feed Mahout's nbayes to classify the text part either pos or neg.
My first attempt was using mahout seqdirectory command on every line as a seperate file in its class directory. This works well with a small amount of data but eventually fails at around 30 Gigabytes of data with OutOfMemoryException. Increasing the heap size fails with "GC overhead limit exceeded" probably because of the large amount of seperate files.
My second attempt was loading the data into a hive table and convert it to a sequence file, as it is described here [0], which seems to work fine at first but after creating the vector file and splitting up the data set the trainnb step fails with an ArrayIndexOutOfBounds Exception.
[0] http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
Right now I'm out of ideas what to look for. Any ideas how I can convert the tsv file or hive table to a sequencefile as it's generated by seqdirectory command on a directory?
Going to answer by myself in case some else needs a solution to the same or similar problem:
I found this code snippet at github and modified it to my needs. Additionally I had to trim the value string to get proper results.
This may be a simpler implementation for those searching for this answer in the future. This can be done completely from the command line (I tested it in EMR):
hadoop jar \
/home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-inputformat TextInputFormat \
-input {input_directory}/* \
-mapper '/bin/cat' \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-output {output_directory}
/home/hadoop/contrib/streaming/hadoop-streaming.jar is the location of the hadoop-streaming.jar on Amazon EMR (AMI 3.4.0). It may be a in a different location depending on your configuration.

How to get a list of files modified since date/revision in Accurev

I have created a workspace backed by some collaboration stream. The stream is updated regularly by team members. My goal is to take modified files in a given path and put them to another repository (do it regularly).
The question is how to create a list of files which were modified since a revision or date or ..? (I don't know which approach is the best.) The command line is preferable.
Once I get the file list I create an automating script to take the files from one place and put them to another.
accurev hist -s Your_Stream -t "2013/05/16 01:00:00"-now -a -fl
You can run accurev stat -m -fx and then parse resulting XML. element elements will have modTime attribute, which is the UNIX timestamp when the file was modified.

convert pdf to svg

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.
Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.
Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/
pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf
pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done
I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.
inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).
Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg

Does grep work properly on pdf files?

Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?
Well, PDF is a binary format, and grep can search binary files as if they were text
grep -a
or you can just use pdftotext (which comes with xpdf) like this:
pdftotext whee.pdf | grep pattern
You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line:
mdfind -onlyin search/directory/path "kind:pdf search text"
use something like Solr or clucene I think they can do what you want.
Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this:
ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword"
Or you can use the pdf2text command on pdf's and then search result with grep.
This tool pdfgrep will do the work. It has a syntax similar to grep. To search in several files just a simple shell script. For example:
$> ls Documents/*.pdf | xargs pdfgrep -n -H "system"
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: designed episodic memory system
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: how ISAC's episodic memory system is
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: cognitive system employs a combination
....
PDF is a binary dump of objects used to display the pages. There may be some meta data you can grep but the actual page text is in a Postscript stream and may be encoded in a variety of ways. Its also not guaranteed to be in any order. You need to think of PDF as more like a Vector image file than a text file.
There is a short article explaining text in PDFs in more detail at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
If you have pdftotext installed via the popplar package, then try this perl script :
#!/usr/bin/perl
my $p = shift;
foreach my $fn (#ARGV) {
open(F,"pdftotext $fn - |");
while (<F>) { print "$fn:$_" if /$p/; }
close(F);
}