I'm writing a quick front end to display guitar tablature. The front end is in Flash but I want to store the tab in some human-readable format. Anyone know of something that already exists? Any suggestions on how to go about it? One idea I got from reading some stackoverflow posts was to use a strict ASCII tab format like so:
e||-1------3--------------0--|----2-------0---
B||--1-----3------------1----|----3-------0---
G||---2----0----------0------|----2-------1---
D||----3---0--------2--------|----0-------2---
A||----3---2------3----------|------------2---
E||----1---3----3------------|------------0---
It has advantages. I can gain a lot of info from the structure (how many strings, their tunings, the relative placement of notes) but it is a bit verbose. I'm guessing the '-'s will compress away pretty well when sent over the wire.
If anyone knows of an existing data-format for describing guitar tab I'll take a look as well.
edit:
I should note that this format is 90% for me and may not ever been seen by anyone other than myself. I want an easy way to write tab files that will be displayed eventually as graphics in a Flash front-end and I don't want to have to write an editor front end.
Check out the ASCII tab format. Also great description of the format is here:
http://www.howtoreadguitartabs.net/
ASCII export would be a great feature, but using ASCII as internal data format is not a good idea. For example, note durations would be extremely hard to express (hou would you store 32nds or even 16ths?, not to mention triplets...), so parsing those files would be extremely difficult. Moreover, users would be tempted to load ASCII files created outside your app, which will be likely to fail.
To sum up, i'd recommend to either try to reuse existing format or invent your own if that's not feasible. You may try to use XML for that.
EDIT: Beside DGuitar, i know of TuxGuitar and KGuitar, which support Guitar Pro files. You can look into their sources or ask their authors about file formats. I think there is also open source PowerTab-to-ASCII converter.
See Supported file formats in TuxGuitar.
TuxGuitar is open-source multiplatform software for reading, writing and playing the guitar tabs.
It supports the mentioned Guitar Pro and PowerTab format, and it also has its own TuxGuitar (.tg) format.
If you need the backend data structure to remain in human readable form I would probably stick it in a CDATA inside of XML. That could be inserted into a relational database with song/artist/title information and become searchable. Another option is to save it as zipped text files and insert links to those files in a database with the main artist info still searchable by sql.
These are not human readable:
Most common formats are Guitar Pro (proprietary) and PowerTab (freeware). DGuitar and TuxGuitar are open source viewers for Guitar Pro format. I'm sure that they have documentation for the format somewhere (at least in the code).
Advantage for using a common format would be the easiness of making tabs with those programs.
The Guitar Pro 4 format is described here http://dguitar.sourceforge.net/GP4format.html
I wrote a quick utility for displaying tab. For personal use. You can happily take the internal format I used.
I use a very simple string based format. There are three important structures.
Column, a vertical column in the output tab - all notes played simultaneously.
Bar, a collection of Columns
Motif, a collection of Bars
A Column looks like ':#|:#|*:#' where each * is a string number and each # is a fret number. If you are playing a chord you separate each string:fret with a '|'
A Bar looks like '[,,-,*]' where each * is a Column. A - indicates an empty column where no notes are played.
A Motif looks is just many Bars running back to back. For instance
"[1:5,-,3:7,-,3:5,-,3:7,-,-,3:5,3:7,-,1:8,-,1:5]"
e||---------------|---------------||
B||---------------|---------------||
G||---------------|---------------||
D||--7-5-7--57----|--7-5-7--57----||
A||---------------|---------------||
E||5-----------8-5|5-----------8-5||
"[-,-,1:3|2:2|3:0|4:0|5:3|6:3,-,-][-,-,3:0|4:2|5:3|6:2,-,-]"
e||--3--|--2--||
B||--3--|--3--||
G||--0--|--2--||
D||--0--|--0--||
A||--2--|-----||
E||--3--|-----||
Related
I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.
I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.
I resort to your expertly advice because I am sort of "new" to Objective-C, I have read a couple of books and docs (namely Aaron Hillegass & Stephen G. Kochan's books), but some things are still unclear to me, for lack of practise.
To put you in context, I have a NSDocument project that uses Core Data for storage.
I struggle with 2 things right now: reading/writing to files, and table views ^^
So my first question is about Core Data : is it only able to save in SQL, XML or Binary format ?
Or can I use core data to read/write in any format, according to what I declared in the plist file ?
I am trying to work with .po files, and I want to display the translations in a table view containing 2 columns (1 for the msgid and the other for the msgstr).
To read and write files in the po format and display lines in my table view, I most likely need to parse the files using line endings and characters such as "#"as delimiters.
I haven't gotten around to doing that yet (I have no idea how to do that yet!), but I would like to know if it is possible or if I need to restart my project that doesn't use Core Data...
Please DO NOT just throw links to the apple documentation at me, it's the most confusing thing ever, and feels like it's made for experts only! I need me some human-readable explanations :)
Thanks a bunch for any help and advice you can give me!
It is possible to write a different storage format for Core Data, but it is not easy and it sounds like you are not at a level where that is a possibility (no shame there, I'm not either).
If you are only displaying data from the .po files then there is no need to use CoreData. CoreData is meant to provide a file storage solution. You create/edit data and save it using coredata. If you have no intention to create and edit data then get rid of coredata, it will only get in the way.
I have a PDF file which has the marklist of certain exam.
I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.
I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).
Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting point '1.' above! -- see these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.
But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.
I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.
Also you'll first need to use pdftk.exe to uncompress the file.
A shortcut that me be of aid:
http://www.adobe.com/devnet/pdf/pdf_reference.html
This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx
So I have some Spanish content saved in Excel, that I am exporting into a .csv format so I can import it from the Firefox sql manager add-on into a .sql db. The problem is that when I import it, whenever there is an accent mark, (or whatever the technical name for those things are) Firefox doesn't recognize it, and accordingly produces a big black diamond with a white ?. Is there a better way to do this? Is there something I can do to have my Spanish content readable in a sql db? Maybe a more preferable program than the Firefox extension? Please let me know if you have any thoughts or ideas. Thanks!
You need to follow the chain and make sure nothing gets lost "in translation".
Specifically:
assert which encoding is used in the CSV file; ensure that the special charaters are effectively in there, and see how they are encoded (UTF8, particular Code page, ...)
ensure the that SQL server can
a) read these characters and
b) store them in an encoding which will preserve their integrity. (BTW, the encoding used in the CSV can of course be remapped to some other encoding of your choosing, i.e. one that you know will be suitable for consumption by your target application)
ensure that the database effectively stored these characters ok.
see if Firefox (or whichever "consumer" of this text) properly handles characters in this particular encoding.
It is commonplace but useful for this type of inquiries to recommend the following reading assignement:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky