How? Question is simple, hope the answer will be as well.
Use the text tool T and type
Select the text with the Selection tool V
From the menu select Modify then Break Apart or Ctrl+b
Finally use the Subselection tool A and click the text edge (no the inside fill), you'll see the cursor change to an arrow with a black dot.
If it was imported, as #Dementic says, you'll need to trace it using Modify > Bitmap > Trace Bitmap.
If you created it in Flash using the Text Tool, then it is already in a vector format. If you created the file using a newer/older version of Flash that the default/only option is Classic Text whereas for CS5 the default is the TLF format introduced in that version. If you're not sure about the difference between TLF and Classic Text, Andy Anderson can explain.
Reading between the lines of your concise question you're probably trying to break apart Classic Text, so here's how.
Related
I have a document that was created from a scanned document, after using the Acrobat XI pro's text recognition tool, with parameters language: Spanish; PDF output: clear scan; downsample to 600 dpi.
It worked rather well, with only small problems, which can be easily overlooked. Except that I use foxit PDF reader to actually read PDF (I have a slow PC), and there is an "a" glyph that in Adobe looks normal, but in foxit it looks filled, without the empty space at its center (the problem exists only in italics lowercase "a")
(example of problem). There are lots of lower case italics a's, almost in every other page. I use this book to study for a central course for my degree, it's the best we have at our school's library in Spanish, so I read it almost every day, and it's quite annoying (example 2).
There are examples of that italics lowercase "a" that show up fine in foxit the a's in "plantaciĆ³n" are normal.
Sample pages, the first page has normal a's, the second has filled a's
Could I copy the normal looking a glyph and replace the one that causes the problem? if so, what software would I need?
Thanks for reading this.
Yes it is possible to change the ClearScanType (Fd1428390-Identity-H) to conventional font here changed to 11pt Times Roman Italic. Also messed with colour, size and bold to demonstrate effects, but you just need to use one combination.
This change is allowed in the Free version of Tracker PDF-XChange Editor but beware if not done cautiously text edits could trigger demo watermarks.
Select the edit text only from buttons then select text, with properties pane active (on the right) and make changes, if you see the demo banner appear then Ctrl-Z and try a different approach.
There are many GUI automation tools that allow clicking on a specified image (well-known Sikuli, for example). Is there any way to click on the specified text, not image? This way the tool will:
make screenshot
recognize text on it
find text position (somehow)
send click event to this position
It would be much easier to write tests using this approach (many interfaces have text button, inputs etc.) rather than make screenshots for every single element.
I've seen some OCR feature in Sikuli but it didn't work for me (I tried invoking click('some-text-here').
Sikuli built-in OCR features are pretty buggy and unstable. All (or at least most of) the related issues are listed in this BUG. However there are few possible workarounds which are, however, not also always applicable..
If the text is known, you can take a screenshot of the text and then look for it as a screenshot. For example if you know exact font of this text, you can automatically generate such text on the screen and use it as a pattern to locate it elsewhere.
The built-in tesseract based OCR, performs significantly better when the font is bigger, "fatter" and in Grayscale (usually). Hence you might do some background image processing before attempting the actual recognition. I used ImageMagick to resize and filter the images for better recognition. It can be in the background as a command line tool. For example:
convert -filter spline -resize 100x -unsharp 10x20 -type Grayscale
I am aware that this does not answer your question directly but these are steps you might consider taking towards the final solution.
I'm a developer at Deskover company and we are currently developing an application, UiPath Studio that meets your needs.
We provide text recognition on various technologies with 100% accuracy, ability to find specific text in an area on screen, a control or an entire window, and also ability to click text or controls.
You can execute different actions, sequentially by creating workflows.
We at Deskover are big fans of Sikuli project. We actually use the same image recognition engine in UiPath Studio.
UiPath Studio is a visual tool that helps you create workflows easily, but you can also use the underlying API and implement an application that extracts text and clicks on it.
You can find more details about the UiPath library here.
When i draw a small circle in LibreOffice draw and export it to pdf i get some extra dots around the circles. Especially in the upper left and lower right outer corner of the circle.
See example PDF here: https://dl.dropbox.com/u/233922/example-dots-circle.pdf
or as a Screenshot here:
You have any idea how i can get rid of this?
It is old bug and has not been fixed yet. I can reproduce it under Linux and Windows. My version: LibreOffice 4.1.0.
Create new file in LO Impress or LO Draw.
Draw ellipse (or rounded rectangle, or smile etc.).
Set line width e.g. 5mm (for better view).
Export as PDF.
I propose two workaround:
Export to MS PowerPoint and export in it :/
Print to PDF (using e.g. cups-pdf).
ad 1) You must have MS PP and you graphics may look bad.
ad 2) I use cups-pdf and PDF look very well, but:
Text is stored as bitmap graphics (small rectangles)! You can not extract text without using OCR.
You must use paper format from list (A4, A0, Letter etc.). If you use unstandardised paper format you must use bigger format and you get white bars on PDF. However you can use pdfcrop and remove white bars.
PDF is always orienter horizontally. If you print as vertically you can rotate pdf using pdf270 command line tool.
In Adobe Reader (version 11 at least) -> Go to "Preferences" => "Page Display" => uncheck "Enhance thin lines"
Libre Office seems to add dots of 0 size and practically no visibility. When "Enhance thin lines" is checked, Adobe Reader will make these dots visible.
Best wishes,
Patrick
Similar to the https://stackoverflow.com/users/1797782/dzwiedziu-nkg 's answer, I need a multi-step process to fix this issue.
Steps:
Open the file in a pdf viewer (Document Viewer for me in Ubuntu.)
Print the pdf to a file (also a pdf) from the viewer. I assume this also uses cups-pdf, as it modifies the image size. (I don't mind, because I use the next step to eliminate all margins anyways.)
Use pdfcrop to remove all the extra space around the actual content's bounding box. If you just give pdfcrop one argument, it doesn't overwrite the old file, so use the same argument twice:
$ pdfcrop monkey.pdf monkey.pdf
Another "workaround" that worked for me:
Go without outline. You can set the line style in Draw to "none" and just work with flat solid objects.
PS: I see these dots also in Draw, not just in the exported pdf.
A simple workaround is to "patch" the dot in Libreoffice Draw using a white object -- say, a square with white area and white outline. Note that you can not see the dot in Draw. So you first generate the pdf with the orginal drawing, see where the dot appears in the pdf, go back to Draw, and a add a white patch where it is required.
Searching for a workaround myself, I've found this awk script called odg2epsfix that will fix the exported EPS to not contain those ghost dots anymore.
I stumbled upon it in this launchpad bug entry.
Fixed in LibreOffice pre-export.
Steps:
Right click on the circle in LibreOffice and select "Line"
On the "Line" page, set "Corner Style" to "-none-"
Save document and Export as PDF.
The dot is gone without removing line enhance. Mine still shows in preview but doesn't print.
The bug is still present in LO 6.0. But if you set "Cap style" to "flat" in the "Line" tab of the "Graphic Styles", the dots disappear from the screen and from the exported pdf.
I'd like to extract all of the information (formatted text, images, etc) from powerpoint slides into a flowing, readable (MS Word-style) format.
I'm not interested in keeping the slide concept at all--think of taking class slides from a college course and batch converting them all into one collective study guide.
I can't find a way to do this within powerpoint (though if you know of one, please share!) and,
I don't have experience scripting Office apps. Is this kind of thing easily done? Does this kind of script already exist somewhere?
Clarification:
In an earlier version of this post, I used the word "flowing" to refer to a slide-free (MS Word-like) format. This does not, however, refer to the actual formatting of slide content. So keeping bullet lists, etc. is fine and even desirable.
I don't see this being a simple task. College professors use a format of either "TITLE: BULLET POINTS OR IMAGE" or "EVERY WORD I'M ABOUT TO SAY" for their slides in my experience, and you're just not going to get flowing, readable text from the former no matter what you do. For the latter, you've already got your text, you just have to copy it to another document.
I think you might as well just open the PowerPoint, select all the text, and copy+paste into Word/Publisher/InDesign/your favorite page layout program. You'll have the same effect and the same amount of editing after the fact except without all the hassle of writing a program to do it for you.
Doing a Print operation to a PDF with the N-up options might be a good solution for handouts if that's all you need. You could expand the idea and condense ALL the slide decks into one, get it printed (with N slides per page and the note space next to it) and bound, and voila, instant study guide. I've seen that, and then you get options for note taking.
More power to you if you're doing this just because you can - don't let me stop you. There is much good learning to be had that way. You might want to look into writing a program using the Microsoft.Office.Interop namespace in .NET (starting at http://msdn.microsoft.com/en-us/library/bb772069.aspx ), or perhaps look on CPAN ( http://search.cpan.org/search?mode=all&query=powerpoint ) and do it with Perl! There are lots of ways to do it, but you've got to be up for the challenge.
Text is fairly simple to extract, but what text do you want? The text from the title and body text placeholders only? File, Save As, and choose to save the outline.
The other text on the slide? That can be pulled out to a text file programmatically, but in what order? Suppose you have a complex diagram with text callouts. Extracting the text is going to give you gibberish. There's no obvious/meaningful order to the text other than what the human viewer supplies by noting that "Ah. The arrow next to this bit of text points to the fribulator sub-assembly, so must relate to it in some way." Try doing that in code. ;-)
You could give the author a way to sort the text into reading order so that the code knows what order to extract it in, but that would require a fair amount of work on the part of the author.
If you can be certain that all of the content is in title+bullet form, no worries. Otherwise, you'd have to be able to articulate exactly what you want extracted, in what form and in what order before you could get anywhere with this.
MS Word-style is not only readable, but writeable as well (which was not specified in your requirements). If you want a read-only guide, PDF is your natural choice (either through Acrobat Distiller or LibreOffice). Combine individual Acrobatted presentations with PDFtk, or Acrobat or Foxit and you're good to go without any programming at all.
"Is this kind of thing easily done?" - Yes, your humble servant did a couple of similar scripts ages ago (extracting enhanced metafiles from Powerpoint slides).
"Does this kind of script already exist somewhere?" - Yes. Probably at hundreds of places, but not sure if any of them get posted to the 'Net. All things considered think you'd be better off learning some scripting and macro programming on your own, since a ready-made script may be not quite fit for your needs - and to understand and rewrite it you'd need more time than to code & debug from scratch.
Since you mention that title+bullet form is ok, open the file, choose to save as and pick Outline as the save-as type.
I think you could parse through the PowerPoint file for formatting, text and pictures. There are Visual Studio namespaces available for such a task. You open the file, parse through it and make Word file from these. Complicated work, as you would have to consider type of elements and their position, you would have to use a temporary structure for each slide.
Have a look at this sample code :
http://msdn.microsoft.com/en-us/library/office/gg278331.aspx
How to: Get All the Text in All Slides in a Presentation
Basically, using c# and openXML SDK 2.0, it loops through all the slides in the presentation, and then adds each text in every slide into a string builder. You can write out the result into a text file if you like (modification required).
Recommendation: <25 oct 2012>
For your study guide, maybe you could extract all the text in each slide, and dump those text programmatically (by adding that function into the sample code above while it's iterating the slides) into the "Notes" section of each slide. With that, you can print it in Notes Page view. You'll get the entire slide image at the top half of the page, and the actual slide texts at the bottom of it in the Notes Page view. It sure beats trying to copy and paste all the text from the slide into the notes section. You can even print it 2 slides per page, as small text would not be an issue inside the slide's image, and diagrams would still be visible more or less.
Unfortunatly, this method works for simple standard slide format ... meaning, it's OK if your slides just have a title, and a center text box with all the bullet points... any complex slide layout (maybe text boxes scattered everywhere) will come out in non-order and will be confusing. But at least you can still look at the slide image above to make sense of it :)
I want to read an existing pdf & extract the text and graphics information. Within graphics, currently i just need the drawn lines. There are many vendor component for reading PDF text, but are there ones that can give graphics info too ? Though free/open-source is preferred, I'm ok to commercial ones too.
The requirement is:
For every page in PDF:
Reading text blocks
Getting to know the canvas co-ordinate of the text block (rectangle containing the block). Note, for text with higher font size, the rect size will change.
Lines - need collection of (x1,y1,x2,y2) for every line in a page in pdf
Thanks,
- Seeker
This is my field, though the question is a bit old. Hopefully this still helps.
You leave some room for assumptions, so here are mine:
you seek a script, rather than stand-alone software
your object is archival
you are running command-line scripts:
Use this command line script, detailed at: http://stefaanlippens.net/extract-images-from-pdf-documents
you are running server-side code using imagemagick or graphicsmagick functions:
Something like "convert -background white -flatten test1.pdf test1.jpg" (imagemagick) will render the whole PDF page into a jpeg. If you want to then crop it to the image(s), then it depends upon the context of the project to determine the best script(s) to do that.
A rather complex question. If you wish to provide more details about the project, then I can provide some more guidance. Best of luck.