carrot2 api not support japanese language - api

I am trying to use carrot2 API to cluster documents in japanese language. It throws out this WARN:
org.carrot2.text.linguistic.DefaultTokenizerFactory: Tokenizer for Japanese (ja) is not available. This may degrade clustering quality of Japanese content.
Hence, the clustering process failed and all docs belong to "other topic" cluster.
Is there any help to solve this problem?
Thanks in advance.

The open source algorithms available in Carrot2 unfortunately do not support Japanese. The constant was added to cover the possible future support for Japanese.
Alternatively, you can try running Carrot2 with a customized linguistic pipeline, the UsingCustomLanguageModel example class in Carrot2 Java API distribution shows how to do it.

Related

Does DocToolChain support Python?

I came across docToolChain (http://doctoolchain.org/) for generating Docs. Wanted to know if this can support other programming languages such as Python, Go etc? Or will it only support Java?
yes, it does - depending on your needs :-)
When we incremented the version of docToolchain from 1.x to 2.0, the biggest feature was that the technology is now hidden behind a wrapper.
You still need java installed (v8-v14), but you don't have any code any more in your repository.
But regarding Python - it is quite likely that you want to use restructuredText as your markup language.
Until now, docToolchain focussed on AsciiDoc (based on ruby) as the markup language for your projects documentation.
There is now a feature coming up: jBake, the static site generator used by docToolchain is already able to render markdown.
But there is now also a PoC which shows that it can also render restructuredText with a little help of a small python script:
http://doctoolchain.org/multi-markup-demo/demo/asciidoc.html
PS: I am the maintainer of docToolchain, so my answer might be biased

How to use Phonetic or Phoneme pronunciation in google text to speech?

I have been trying for a while to get Phonetic or Phoneme pronunciation working with google's text to speech but have not managed to get it performing consistently.
I have managed to get limited results from using https://tophonetics.com/
It translated "The cow went mad." to "ðə kaʊ wɛnt mæd." but the 'the' 'ðə'
was not audible. And when I tried "ðɪs ɪz səm fəˈnɛtɪk tɛkst ˈɪnˌpʊt".
Are there any SSML codes to define phonetic blocks of text,
that can be this format "D,Is Iz sVm f#n'EtIk t'Ekst 'InpUt"
can be used instead of "ðɪs ɪz səm fəˈnɛtɪk tɛkst ˈɪnˌpʊt"
"
Google Text-to-Speech supports the <phoneme> tag since at least spring 2021.
However, there are a lot of potential gotchas to overcome:
The demo page filters out <phoneme> tags on the client side before they even reach the API. (It does the same with the <voice> tag as pointed out here)
As with Microsoft Azure Text-to-speech (see the other answer for details), each language only supports a limited set of phonemes ("letters") that can be used.
If you use an unsupported one, the phoneme tag is completely ignored without any warning. So the official example <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> does not work with any English variant but en-US, since all others lack the "o" or "oʊ" phoneme.
It's unclear if you need to use the v1beta1 API (which I can confirm is working) or if version v1 is also ok.
There is the SSML tag <phoneme> that serves your purpose.
Unfortunately, it's currently not supported in Google Cloud Text-to-speech. The available subset of SSML tags for Google Cloud is listed in the documentation. The <phoneme> tag is not in this list. An experiment using Google Cloud's text-to-speech-demo confirms that the phonemes are ignored. The content of the tag is being read as ordinary text, as has already been remarked by #Trevor in the comments.
The <phoneme> tag is, however, being supported by Microsoft Azure Text-to-Speech and Amazon Polly. In both cases, the available phonemes are limited to those available in the language being used (see here for Azure and here for Polly). The Azure documentation isn't 100% clear about the exclusion of out-of-language phonemes, but practical experiments with the Azure Text-to-Speech demo confirm that they're not working properly. In some cases, they at least seem to be replaced by the nearest available equivalent in the language used.
Being restricted to the phonemes of one language severely limits the usefulness of the phonemes tag. E.g., you can't used the feature to embed correctly pronounced content in a second language, as the second language will usually have some phonemes that are not available in the first language. Concrete language pairs in which each language has some phonemes that are not available in the other one are English/German, Spanish/German, English/Spanish.

Convert internal wiki page into PDF

My team uses internal wiki pages for all kinds of stuff. The pages are created with MediaWiki. I wonder if there is any way to convert the wiki pages into PDF format. I have to use it to convert the user documentation to PDF format, so that it can be shipped with the next release. I have seen the 'Download as PDF' option on wikipedia but our internal wiki does not have it. Is there any plugin available which would allow me to convert it?
The Collection Extension has been decommissioned. The last long term support version supporting it will end of life in June 2019 according to
https://www.mediawiki.org/wiki/Version_lifecycle.
The Wikimedia foundation has stopped any development of a replacement after some failed attempts according to https://www.mediawiki.org/w/index.php?title=Topic:Uxkv0ib36m3i8vol&topic_showPostId=uxsjbpkqfmgq1jyx#flow-post-uxsjbpkqfmgq1jyx.
Closed source alternatives are under development by a least two companies. A working open source solution is available as a Debian package, Windows binary, and online converter http://mediawiki2latex.wmflabs.org/, which has been developed by a German worker in a cowshed using the Haskell programming language.
Wikipedia uses the Collection extension with OfflineContentGenerator (OCG) for this purpose.

A technology for reading pdfs online with annotations?

is there an open source solution that displays PDFs for online reading? It has to be searchable much like google books and if possible has the ability to display annotations?
By "online reading" I'll assume you mean without a PDF reader plugin on the client. In that case you'll need to convert to HTML
http://pdftohtml.sourceforge.net/
If you don't mind losing the ability to copy text then converting to PNG may give you a more accurate rendering
http://www.imagemagick.org/
Regardless of the output format you can manage your searching using the original PDF data. One technology for this is mnogosearch
http://www.mnogosearch.org/
Monogosearch uses pdftotext internally, you may find this useful if you want to write your own search routines. pdftotext is part of the Xpdf suite of utilities
http://www.foolabs.com/xpdf/about.html
All of the tools listed above are available on Windows or Linux
You may also be interested in the Vuzit DocuPub Platform: http://vuzit.com/products/docupub_platform
The display technology itself is not open source, but they provide an API to access their service, so perhaps it is worth investigating.
Don't know if you are looking a software to install or some service to pay for...
I've read a lot about www.getbackboard.com (this is not advertising, only reporting something I've read about, that maybe fits your needs.. ;)
Not sure if they do annotations, but both of these will show PDFs quite well:
http://pdfmenot.com
http://docs.google.com
ICEPdf recently released their code as open source. It is Java based.
PyPdf is really nice. It supports reading the text as well as encryption which I know that itextsharp does not.
Of course you'd have to program in python as IronPython's class libraries aren't quite to the point where you can ref them from another language and use them. (But I imagine they will be someday soon)
PyPdf
This is not open source, but check it out anyways. You can download a free trial of their SDK to try it out. Reading PDF's and their annotations is not simple and I wouldn't trust a production app to open source decoders.
Here is an online demo.
http://www.atalasoft.com/ajaxannotations/default.aspx
Another good pdf reader is FoxitReader.

Software/Platform to Share Specs

What are the software/ Wiki you use to write and share your specs about the developers, testers and management?
Do you use Wiki system, and if so, what Wiki software you use?
Or do you use Sharepoint to manage and version the specs? One problem with SharePoint 2003 as specs platform is that it's very hard to collaborate among different people.
For backward compatibility sake, I would also like to have the platform able to import Microsoft Word seamlessly. And it would certainly help if the interface is similar to Microsoft Word.
Any idea?
I've used Confluence at a number of places, it's a pretty powerful wiki and very good for creating specifications that can be shared amongst various parties. See:
http://www.atlassian.com/software/confluence/
There's some more information here on the advantages of using Confluence:
https://stackoverflow.com/questions/170352/confluence-experiences
EDIT: I've updated this to deal with the Microsoft Word import feature you mentioned. Confluence supports this through the Office Connector here:
http://www.atlassian.com/software/confluence/plugins/office-connector.jsp
There's also a Sharepoint connector:
http://www.atlassian.com/software/confluence/plugins/sharepoint-connector.jsp
plus a whole bunch of plugins:
http://www.atlassian.com/software/confluence/plugins/sharepoint-connector.jsp
Some of these are user contributed also. I can't recommend Confluence enough as a commercial wiki.
I've also used JSPWiki, which is open source. it's ok but not as good as confluence, see:
http://www.jspwiki.org/
You could try Google docs - I have successfully used this in the past. It supports import / export to MS Word, and it has great support for multiple user - see http://www.brighthub.com/internet/google/articles/8236.aspx.
It supports versioning, allows you to chat with other people who are currently working on the document, and shows you a list of all the changes others have made to the document (without needing to close / reopen the document).
If you want corporate support, Google also provides that - see Google Apps for business.
We use SharePoint -- it's not ideal, but it does a decent job. If I were you, I would seriously look at getting off SharePoint 2003 and on to MOSS (SharePoint 2007). It's not perfect, but it's substantially better. Here's a little bit on using MOSS as a wiki. I think in general wiki's are a good tool for getting people up to speed on your system. We used to pass around "getting started documents" and now we have all that type of stuff in our developer portal.
Per John's comment, I looked up this feature comparison. I have to go back and look at what features I'm using that are not in WSS -- I might be paying for licenses I don't need! :)
We use email. I know it isn't elaborate, but it is easy to use. Everyone has it installed and there are no licensing issues. All spec changes are sent to an super set email distro indicating the updates and the location on the network share where the spec can be found.
We use Alfresco, in its Community version, from both its Share and Explorer web interfaces.
Quite useful, with a document library, wiki, forum and calendar.
We curently host about 1.8 Go consisting mainly in docs, versionned and sometimes automatically converted to PDF (by creating an automatic content rule).
FTP, WebDav and network share are also used to access to the same repository.
You could take a look at Microsoft Groove - the collaboration software that Microsoft bought a few years back.
It's bundled free with premium versions of Microsoft Office.
You can customize the workspace with discussion boards and can fairly seamlessly store collaboratively-edited Office documents.
We use MediaWiki for dos & specs. Wiki definitely wins anything like Microsoft Word or SharePoint - it allows you to develop a documentation in "first refer, then describe" = "divide and rule" way. Perfect for developers - they used to think the same way. The process of developing a documentation is almost ideal: you start from TOC and drill down until you write the document for every link you put earlier.
MediaWiki is quite customizable - there are lots of extensions there. The most necessary ones are:
Source code highlighter - CSO_Source
Our own templates integrating wiki with class reference.
Others are InterWiki, FileProtocolLinks, YouTube (we use customized version of it to display HD video), ReCaptcha, SpecialDeleteOldRevisions, Maintenance.
Some integration examples are here.
And we use Google issue tracker to track the issues. Its main advantages:
Imput usability: the process of adding\changing the issue is really convenient there. Earlier we tried Track Studio - the same actions require 2-3 times more time there, so it died fast simply because most of us hated to use it.
Customizable grids. See the examples. Really helpful.
Atom\RSS support. So everyone knows what's going on.
There is a Gurtle tool integrating it with TortoiseSVN. Really helpful.
Its main disadvantage is that it can't be closed from the public access. This makes it simply unusable in many cases.
If you want a UI similar to Word, why not use Word with SharePoint 2007? You're on 2003 so the experience is there. Upgrade to SharePoint 2007 and you can have the collaboration, Word features, document sharing, and so on.
This is the kind of thing Microsoft wants people to use Office for, so there's a ton of doco out there about how to configure your SharePoint and Office environment to support collaboration.
There is something that Google do in this direction and it looks really cool: wave.google.com. It would be a great step in collaboration and worth to wait it.
Here we use Google Docs it makes the documents available to everyone write or read only, public or private among people that have or not Google accounts, it also can import Word docs, not to mention that it runs directly into the browser so it has high availability with zero cost and zero setup, also its computer/OS agnostic, we have a nice experience with it.
Also perhaps you should take a look at Basecamp or Backpack at 37Signals, any of then might also fit your bill.
We use DocBook for all of our specifications (and other customer-facing documentation). DocBook is an XML format that lets you easily generate documents in just about any format, including PDF, which is how we distribute things to clients to get them signed off. We can divide a document into files (by section) and commit everything to our source control system (Subversion). Because it is all XML (i.e. text-based), Subversion's automatic merging and conflict resolution works great if two people work on the same file. We have a set of stylesheets that all of our documents use, so all documents share the exact same style/format, with no extra work on our part.
And if you don't like editing XML files directly, there are GUI front-ends that provide a reasonably WYSIWYG-like experience. I believe that most people in my office use XMLMind. Still, we happen to all be technical people so if we had to write XML directly it wouldn't be an issue.
As a sidenote, we also put out release notes. We have some XSLT that lets us write documents like this:
<bugs>
<bug id="1234" component="web">JavaScript error when clicking the Kick Me button</bug>
</bugs>
We then have a script that runs through our Subversion repository doing an svn log from the previous release tag to the current release tag, and some Bugzilla integration to automatically generate release notes on-the-fly.
(also, for most internal-only documentation, we use MediaWiki, which is also a great way to collaborate.)
We use OnTime. It was originally only used for defect tracking, but we've started using it to track features as well. These can be used to document the feature as it evolves during development. Features can be grouped together into sprints or releases, and time can be tracked against each feature. If you are using SCRUM, you can also plot burn-down charts for each sprint. It also has wiki functionality.