Adding new language to spaCy: lemma lookup problem - spacy

I'm writing a new spacy language class for ancient Greek (grc) and it is working at the tokenizer level but fails to do the lemma lookup in a dictionary I built from a corpus. When I run the test for the lookup table with python -m pytest spacy_lookups_data in the corresponding directory of spacy_lookups_data and with a test file I wrote, I get the following error:
grc_nlp = <spacy.lang.grc.AncientGreek object at 0x7fe8a8696730>, string = 'μοι', lemma = ' ἐγώ'
#pytest.mark.parametrize(
"string,lemma",
[
("ἄνδρα", " ἀνήρ"),
("μοι", " ἐγώ"),
("πολύτροπον", "πολύτροπονος"),
("ὃς", "ὃς"),
("πολλὰ", "πολύς"),
("Τροίης", "Τροία"),
],
)
def test_grc_lemmatizer_lookup_assigns(grc_nlp, string, lemma):
tokens = grc_nlp(string)
assert tokens[0].lemma_ == lemma
E AssertionError: assert 'μοι' == ' ἐγώ'
E - ἐγώ
E + μοι
test_grc.py:26: AssertionError
_ test_grc_lemmatizer_lookup_assigns[\u03c0\u03bf\u03bb\u03cd\u03c4\u03c1\u03bf\u03c0\u03bf\u03bd-\u03c0\u03bf\u03bb\u03cd\u03c4\u03c1\u03bf\u03c0\u03bf\u03bd\u03bf\u03c2] _
I get similar errors for all 5 lemmas I wrote into the test file. Could this be an error related to unicode? Maybe the Greek polytonic characters are missing from spacy character definition? I suspect this because ancient use diacritics that may not be supported by spacy. Or is this error rather pointing to a problem with my lemma table?
Thanks.
Jacobo

It's probably not a unicode problem. Did you add the entry points in setup.cfg and spacy_lookups_data/__init__.py and reinstall the spacy-lookups-data package?

Related

Why syntax error in this very simple print command

I am trying to run following very simple code:
open Str
print (Str.first_chars "testing" 0)
However, it is giving following error:
$ ocaml testing2.ml
File "testing2.ml", line 2, characters 0-5:
Error: Syntax error
There are no further details in the error message.
Same error with print_endline also; or even if no print command is there. Hence, the error is in part: Str.first_chars "testing" 0
Documentation about above function from here is as follows:
val first_chars : string -> int -> string
first_chars s n returns the first n characters of s. This is the same
function as Str.string_before.
Adding ; or ;; at end of second statement does not make any difference.
What is the correct syntax for above code.
Edit:
With following code as suggested by #EvgeniiLepikhin:
open Str
let () =
print_endline (Str.first_chars "testing" 0)
Error is:
File "testing2.ml", line 1:
Error: Reference to undefined global `Str'
And with this code:
open Str;;
print_endline (Str.first_chars "testing" 0)
Error is:
File "testing2.ml", line 1:
Error: Reference to undefined global `Str'
With just print command (instead of print_endline) in above code, the error is:
File "testing2.ml", line 2, characters 0-5:
Error: Unbound value print
Note, my Ocaml version is:
$ ocaml -version
The OCaml toplevel, version 4.02.3
I think Str should be built-in, since opam is not finding it:
$ opam install Str
[ERROR] No package named Str found.
I also tried following code as suggested in comments by #glennsl:
#use "topfind"
#require "str"
print (Str.first_chars "testing" 0)
But this also give same simple syntax error.
An OCaml program is a list of definitions, which are evaluated in order. You can define values, modules, classes, exceptions, as well as types, module types, class types. But let's focus on values so far.
In OCaml, there are no statements, commands, or instructions. It is a functional programming language, where everything is an expression, and when an expression is evaluated it produces a value. The value could be bound to a variable so that it could be referenced later.
The print_endline function takes a value of type string, outputs it to the standard output channel and returns a value of type unit. Type unit has only one value called unit, which could be constructed using the () expression. For example, print_endline "hello, world" is an expression that produces this value. We can't just throw an expression in a file and hope that it will be compiled, as an expression is not a definition. The definition syntax is simple,
let <pattern> = <expr>
where is either a variable or a data constructor, which will match with the structure of the value that is produced by <expr> and possibly bind variable, that are occurring in the pattern, e.g., the following are definitions
let x = 7 * 8
let 4 = 2 * 2
let [x; y; z] = [1; 2; 3]
let (hello, world) = "hello", "world"
let () = print_endline "hello, world"
You may notice, that the result of the print_endline "hello, world" expression is not bound to any variable, but instead is matched with the unit value (), which could be seen (and indeed looks like) an empty tuple. You can write also
let x = print_endline "hello, world"
or even
let _ = print_endline "hello, world"
But it is always better to be explicit on the left-hand side of a definition in what you're expecting.
So, now the well-formed program of ours should look like this
open Str
let () =
print_endline (Str.first_chars "testing" 0)
We will use ocamlbuild to compile and run our program. The str module is not a part of the standard library so we have to tell ocamlbuild that we're going to use it. We need to create a new folder and put our program into a file named example.ml, then we can compile it using the following command
ocamlbuild -pkg str example.native --
The ocamlbuild tool will infer from the suffix native what is your goal (in this case it is to build a native code application). The -- means run the built application as soon as it is compiled. The above program will print nothing, of course, here is an example of a program that will print some greeting message, before printing the first zero characters of the testing string,
open Str
let () =
print_endline "The first 0 chars of 'testing' are:";
print_endline (Str.first_chars "testing" 0)
and here is how it works
$ ocamlbuild -package str example.native --
Finished, 4 targets (4 cached) in 00:00:00.
The first 0 chars of 'testing' are:
Also, instead of compiling your program and running the resulting application, you can interpret your the example.ml file directly, using the ocaml toplevel tool, which provides an interactive interpreter. You still need to load the str library into the toplevel, as it is not a part of the standard library which is pre-linked in it, here is the correct invocation
ocaml str.cma example.ml
You should add ;; after "open Str":
open Str;;
print (Str.first_chars "testing" 0)
Another option is to declare code block:
open Str
let () =
print (Str.first_chars "testing" 0)

snakemake STAR module issue and extra question

I discovered that the snakemake STAR module outputs as 'BAM Unsorted'.
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
When I add the option in the 'extra' options I get an error message about duplicate definition:
EXITING: FATAL INPUT ERROR: duplicate parameter "outSAMtype" in input "Command-Line"
SOLUTION: keep only one definition of input parameters in each input source
Nov 15 09:46:07 ...... FATAL ERROR, exiting
logs/star/se/UY2_S7.log (END)
Should I consider adding a sorting module behind STAR instead?
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
the code:
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester#jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1
fq2 = snakemake.input.get("fq2")
if fq2:
fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2
assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])
if fq1[0].endswith(".gz"):
readcmd = "--readFilesCommand zcat"
else:
readcmd = ""
outprefix = os.path.dirname(snakemake.output[0]) + "/"
shell(
"STAR "
"{extra} "
"--runThreadN {snakemake.threads} "
"--genomeDir {snakemake.params.index} "
"--readFilesIn {input_str} "
"{readcmd} "
"--outSAMtype BAM Unsorted "
"--outFileNamePrefix {outprefix} "
"--outStd Log "
"{log}")
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
I would add another sorting rule after the wrapper as it is the most 'standardized` way of doing it. You can also use another wrapper for sorting.
There is an explanation from the author of snakemake for the reason why the default is unsorted and why there is no option for sorted output in the wrapper:
https://bitbucket.org/snakemake/snakemake/issues/440/pre-post-wrapper
Regarding the SAM/BAM issue, I would say any wrapper should always output the optimal file format. Hence, whenever I write a wrapper for a read mapper, I ensure that output is not SAM. Indexing and sorting should not be part of the same wrapper I think, because such a task has a completely different behavior regarding parallelization. Also, you would loose the mapping output if something goes wrong during the sorting or indexing.
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
If you wanted to do this, one way would be to download the local copy of the wrapper. Change in the shell portion of the downloaded wrapper Unsorted to {snakemake.params.outsamtype}. In your Snakefile change (wrapper to script, path/to/downloaded/wrapper and add the outsamtype parameter):
rule star_se:
input:
fq1 = "reads/{sample}_R1.1.fastq"
output:
# see STAR manual for additional output files
"star/{sample}/Aligned.out.bam"
log:
"logs/star/{sample}.log"
params:
# path to STAR reference genome index
index="index",
# optional parameters
extra="",
outsamtype = "SortedByCoordinate"
threads: 8
script:
"path/to/downloaded/wrapper"
I think a separate rule w/o a wrapper for sorting or even making your own star rule rather is better. Modifying the wrapper defeats the whole purpose of it.

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).

pdfminer3k - pdf2txt.py error

I want to convert my pdf files to txt files and used pdfminer3k module & pdf2txt.py, however, I got an error.
pdf2txt.py -o file.txt -t tag file.pdf
This is my code at cmd screen.
Traceback (most recent call last):
File "C:\Python36\lib\site.py", line 67, in
import os
File "C:\Python36\lib\os.py", line 409
yield from walk(new_path, topdown, onerror, followlinks)
^
SyntaxError: invalid syntax
This is an error message that I got.
Could you help me to fix this problem??
Added for reference: Great resourse:
http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
The -t flag is the type of output. The options are text, tag, xml, and html.
Tag refers to generating a tag for xml. Replace tag with text in your command and try it.
The order of optional input also matters.
You also must invoke python, your command line does'nt know what import means, yet some of your environment seems to be setup. My example is for windows cmd from Anaconda3\Scripts directory. If your in juptyer notebook or a console, you should be able to run import pdf2txt with the .py
To setup your environment you need to append the os.path.append(yourpdfdirectory) otherwise file.pdf will not be found.
Try python pdf2txt.py -t text -o file.txt file.pdf
Or if you are brave...this is how to do programmatically. The trouble with xml is if you want to get the text, each character from xml tree is returned in an arbitrary order. You can get it to work but you need to build the string character by character which is not that hard, its just logically time consuming.
fp = open(filesin,'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager(caching=False)
laparams = LAParams(all_texts=True)
laparams.boxes_flow = -0.2
laparams.paragraph_indent = 0.2
laparams.detect_vertical = False
#laparams.heuristic_word_margin = 0.03
laparams.word_margin = 0.2
laparams.line_margin = 0.3
outfp = open(filesin+".out.tag" ,'wb')
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#process_pdf(rsrcmgr, device, pdfparse, pagenos,caching=c, check_extractable=True)
for p,page in enumerate(doc.get_pages()):
if p == 0: #temporary for page 1
interpreter.process_page(page)
layout = device.get_result()
alltextinbox = ''
#This is a rich environment so categorization of this object hierarchy is needed
for c,lt_obj in enumerate(layout):
#print(type(lt_obj),"This is type ",c,"th object on the ",p,"th page")
if isinstance(lt_obj,LTTextBoxHorizontal) or isinstance(lt_obj,LTTextBox) or isinstance(lt_obj,LTTextLine):
print("Type ,",type(lt_obj)," and text ..",lt_obj.get_text())
obj_textbox_line.update({lt_obj:lt_obj.get_text()})
elif p != 0:
pass
fp.close()
#print(obj_textbox_line)
#call the column finder here
#check_matching("example", "example1")
#text_doc_df = pd.DataFrame(obj_textbox_line,columns=['text'])
#print (text_doc_df)
pass
I'm working on a generic row/column matcher. If you don't want to bother, you can buy this software already for like 150 bucks for a pro converter.

Liquibase: changeset auto generate ID

How to auto generate ID of changeset with liquibase?
I don't want to set the ID of every changeset manually, is there a way to do it automatically?
I dont think that generated ids are a good idea. The reason is that liquibase uses the changeSet id to calculate the checksum (in addition to the author and fileName). So if you ever insert a changeSet between others the checksums of all subsequent changeSets will change and you will get tons of warnings/errors.
Anyway i can think of those solutions if you still want to generate Ids:
create your own ChangeLogParser
If you parse the ChangeLog on your own you are free to generate the ids as you want.
The downside is that you will have to provide a custom Xml Schema for the changeLog. The schema from Liquibase has a constraint on changeSet ids (required). With a new schema you'll probably have to do a significant amount of tweaking on the parser.
Alternatively you may choose another changeLog Format (YAML, JSON, Groovy). Their parsers may be easier to customize as they do not need that schema definition.
Do some preprocessing
You may write a simple xslt (Xml transformation) that generates a changeLog with changeSet ids, from a file that has none.
use timestamps as Ids
This would be my advice. It does not solve the question the way you asked, but it is simple, consistent, provides additional information and is a good practice for other database migration tools as well http://www.jeremyjarrell.com/using-flyway-db-with-distributed-version-control/
I wrote a Python script to generate unique IDs into Liquibase changelogs.
Be careful!
DO generate IDs
when the changelog is in development or ready for release
or when you control the checksums of the target database
DON'T generate IDs
- when the changelog(s) are deployed already
"""
###############################################################################
Purpose: Generate unique subsequent IDs into Liquibase changelogs
###############################################################################
Args:
param1: Full Windows path changelog directory (optional)
OR
--inplace: directly process changelogs (optional)
By default, XML files in the current directory are processed.
Returns:
In case of success, the output path is returned to stdout.
Otherwise, we crash and drag the system into mordor.
If you feel like wasting time you can:
a) port path handling to *nix
b) handle any obscure exceptions
c) add Unicode support (for better entertainment)
Dependencies:
Besides Python 3, in order to preserve XML comments, I had to use lxml
instead of the stock ElementTree parser.
Install lxml:
$ pip install lxml
Proxy clusterfuck? Don't panic! Simply download a .whl package from:
https://pypi.org/project/lxml/#files and install with pip.
Bugs:
Changesets having id="0" are ignored. Usually, these do not occur.
Author:
Tobias Bräutigam
Versions:
0.0.1 - re based, deprecated
0.0.2 - parse XML with lxml, CURRENT
"""
import datetime
import sys
import os
from pathlib import Path, PureWindowsPath
try:
import lxml.etree as ET
except ImportError as error:
print ('''
Error: module lxml is missing.
Please install it:
pip install lxml
''')
exit()
# Process arguments
prefix = '' # hold separator, if needed
outdir = 'out'
try: sys.argv[1]
except: pass
else:
if sys.argv[1] == '--inplace':
outdir = ''
else:
prefix = outdir + '//'
# accept Windows path syntax
inpath = PureWindowsPath(sys.argv[1])
# convert path format
inpath = Path(inpath)
os.chdir(inpath)
try: os.mkdir(outdir)
except: pass
filelist = [ f for f in os.listdir(outdir) ]
for f in filelist: os.remove(os.path.join(outdir, f))
# Parse XML, generate IDs, write file
def parseX(filename,prefix):
cnt = 0
print (filename)
tree = ET.parse(filename)
for node in tree.getiterator():
if int(node.attrib.get('id', 0)):
now = datetime.datetime.now()
node.attrib['id'] = str(int(now.strftime("%H%M%S%f"))+cnt*37)
cnt = cnt + 1
root = tree.getroot()
# NS URL element name is '' for Etree, lxml requires at least one character
ET.register_namespace('x', u'http://www.liquibase.org/xml/ns/dbchangelog')
tree = ET.ElementTree(root)
tree.write(prefix + filename, encoding='utf-8', xml_declaration=True)
print(str(cnt) +' ID(s) generated.')
# Process files
print('\n')
items = 0
for infile in os.listdir('.'):
if (infile.lower().endswith('.xml')) == True:
parseX(infile,prefix)
items=items+1
# Message
print('\n' + str(items) + ' file(s) processed.\n\n')
if items > 0:
print('Output was written to: \n\n')
print(str(os.getcwd()) + '\\' + outdir + '\n')