How to convert PDF with images which I don't care about to text? - pdf

I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks

One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.

Related

Inline math font size and equations spacing in markdown to pdf conversion using pandoc

I'm using vim and markdown as an alternative to obsidian. I'm doing the conversion from markdown to pdf using pandoc and I would like to resemble as much as possible the pdf output of obsidian since I like how it looks.
In general I could make both pdf looks almost the same, but, I got two problems, the first is that the inline math font is too big, the second that the spacing before and after an equation is different.
Here are two screenshots, the first one being the pandoc output, the second the obsidian output.
To style the pdf I'm using a custom latex snippet which I include with pandoc -H style.tex ... during the pdf compilation, with this I was able to change the spacing between the text and the sections title as well as other things like page margins, etc. But I didn't find anything related to the math nor the equation for a template
I've also tried writing the equation as $\small \vec{E}$ but didn't work.
I think it has to be a way of changing the spacing from the latex template, I know that pandoc is using the unicode-math package to convert the latex equations but didn't find nothing related on how to change the spacing for the equations nor the font size.
EDIT: the style.tex file
% page setup
\usepackage[a4paper,
top=2cm,
bottom=1.75cm,
left=1.75cm,
right=1.75cm]{geometry}
\usepackage{titlesec}
\usepackage{fontspec}
% inline code (backticks in md)
% taken from https://jdhao.github.io/2019/05/30/markdown2pdf_pandoc/
\linespread{1.15}
\definecolor{bgcolor}{HTML}{e0e0e0}
\let\oldtexttt\texttt
\renewcommand{\texttt}[1]{
\colorbox{bgcolor}{\oldtexttt{#1}}
}
% change boldfont bold to extrabold
% \setmainfont[
% BoldFont={Inter-ExtraBold}
% ]{Inter}
% change regular font to light font
% \setmainfont{Inter light}
\newfontfamily\titlefont{Inter}[
UprightFont = *-Regular,
BoldFont = *-ExtraBold,
]
\newfontfamily\sectionsfont{Inter}[
UprightFont = *-Regular,
BoldFont = *-SemiBold,
]
\titleformat{\section}
{\titlefont\huge\bfseries}
{}
{0em}
{}
\titleformat{\subsection}
{\sectionsfont\LARGE\bfseries}
{}
{0em}
{}
\titleformat{\subsubsection}
{\sectionsfont\Large\bfseries}
{}
{0em}
{}
\titleformat{\paragraph}
{\sectionsfont\large\bfseries}{\theparagraph}{1em}{}
\titlespacing*{\paragraph}
{0pt}{3.25ex plus 1ex minus .2ex}{1.5ex plus .2ex}
\titlespacing*{\subsubsection}
{0pt}{2.5ex plus 1ex minus .2ex}{1.5ex plus .2ex}
\titleformat{\subparagraph}
{\normalfont\large\bfseries}{\theparagraph}{1em}{}
\titlespacing*{\subparagraph}
{0pt}{3.25ex plus 1ex minus .2ex}{1.5ex plus .2ex}
EDIT2: this is the .tex output part showed in the screenshot
taken from:
pandoc --pdf-engine=xelatex file.md -o file.tex
eléctrica que efectúa el campo sobre la partícula. se puede calcular entonces como:
\[\frac{w_{a \rightarrow b}}{q_{0}} = - \int_{a}^{b} \vec{e} \cdot d\vec{l} = v_{b} - v_{a} = v_{ab}\]
donde \({q_{0}}\) es una pequeña carga puntual, \(v_{a}\) y \(v_{b}\) el potencial por unidad de carga de los puntos \(a\) y \(b\) respectivamente, \(\vec{e}\) el valor del campo eléctrico

Translation with google trad api

I'm trying to write a program that takes the text of a file, for example PDF, and translates the text extracted with the Google API, except that the API doesn't work with my code. I don't have a clue why it isn't working.
I've already tried to modify my code but nothing I've done works.
from tika import parser
# from googletrans import Translator
import os
from textblob import TextBlob
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")
raw = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = raw['content']
text = text.replace('https://mp4directs.com','')
text = text.replace('\t','')
text = text.replace('\r','')
fichier = open("arifureta.txt", "a")
fichier.write(text)
fichier.close()
fic = open("arifureta-formater.txt", "a")
cpt=0
with open("arifureta.txt") as f :
for line in f :
if len(line)==1 :
cpt+=1
else :
cpt=0
if cpt<2:
fic.write(line)
fic.close()
nbLigneTraité = 0
fic2 = open("arifureta-traduit.txt", "a")
compteur=0
textPasTraduit=''
with open("arifureta-formater.txt") as f :
for line in f :
fic2.write(str(blob.translate(from_lang='en',to='fr')))
if len(line)>1:
textPasTraduit += line
compteur+=1
if compteur%1000==0:
blob = TextBlob(textPasTraduit)
try:
fic2.write(str(blob.translate(from_lang='en',to='fr')))
print(blob.translate(from_lang='en',to='fr'))
except Exception as e:
pass
nbLigneTraité+=1
print(nbLigneTraité)
if len(line)==1:
fic2.write('\n')
fic2.close()
I expect to have the entire translation of the PDF's text in the result file, but actually the answer is 'broken link'. I think it is due to the quantity of text, but I haven't find a way to try any other method.

Extracting strings of RightToLeft langueges with iTextSharp from a pdf file

I searched for finding a solution of extracting strings of RightToLeft langueges with iTextSharp, but I could not find any way for it. Is it possible extracting strings of RightToLeft langueges from a pdf file with iTextSharp?
With thanks
EDIT:
This Code has very good result:
private void writePdf2()
{
using (var document = new Document(PageSize.A4))
{
var writer = PdfWriter.GetInstance(document, new FileStream(#"C:\Users\USER\Desktop\Test2.pdf", FileMode.Create));
document.Open();
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
var tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
var reader = new PdfReader(#"C:\Users\USER\Desktop\Test.pdf");
int intPageNum = reader.NumberOfPages;
string text = null;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
text = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(text);
File.WriteAllText("page-" + i + "-text.txt", text.ToString());
}
reader.Close();
ColumnText.ShowTextAligned(
canvas: writer.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
//phrase: new Phrase(new Chunk(text, tahoma)),
x: 300,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
System.Diagnostics.Process.Start(#"C:\Users\USER\Desktop\Test2.pdf");
}
But "phrase: new Phrase(new Chunk(text, tahoma))" does not have correct output for all strings in the PDF. Therefore I used "PdfStamper" to make a PDF which is suitable for "PdfReader" in "iTextSharp".
Reproducing the issue
As initially the OP couldn't provide a sample file, I first tried to reproduce the issue with a file generated by iTextSharp itself.
My test method first creates a PDF using the ColumnText.ShowTextAligned with the string constant which according to the OP returns a good result. Then it extracts the text content of that file. Finally it creates a second PDF containing a line created using the good ColumnText.ShowTextAligned call with the string constant and then several lines created using ColumnText.ShowTextAligned with the extracted string with or without the post-processing instructions from the OP's code (UTF8-encoding and -decoding; applying UnicodeCharacterPlacement) performed.
I could not immediately find the UnicodeCharacterPlacement class the OP uses. So I googled a bit and found one such class here. I hope this is essentially the class used by the OP.
public void ExtractTextLikeUser2509093()
{
string rtlGood = #"C:\Temp\test-results\extract\rtlGood.pdf";
string rtlGoodExtract = #"C:\Temp\test-results\extract\rtlGood.txt";
string rtlFinal = #"C:\Temp\test-results\extract\rtlFinal.pdf";
Directory.CreateDirectory(#"C:\Temp\test-results\extract\");
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
Font tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
// A - Create a PDF with a good RTL representation
using (FileStream fs = new FileStream(rtlGood, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
// B - Extract the text for that good representation and add it to a new PDF
String textA, textB, textC, textD;
using (PdfReader pdfReader = new PdfReader(rtlGood))
{
textA = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new LocationTextExtractionStrategy());
textB = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(textA.ToString()));
textC = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textA);
textD = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textB);
File.WriteAllText(rtlGoodExtract, textA + "\n\n" + textB + "\n\n" + textC + "\n\n" + textD + "\n\n");
}
using (FileStream fs = new FileStream(rtlFinal, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 600,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textA, tahoma)),
x: 500,
y: 550,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textB, tahoma)),
x: 500,
y: 500,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textC, tahoma)),
x: 500,
y: 450,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textD, tahoma)),
x: 500,
y: 400,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
}
The final result:
Thus,
I cannot reproduce the issue. Both the final two variants to me look identical in their Arabic contents with the original line. In particular I could not observe the switch from "سلام" to "سالم". Most likely content of the PDF C:\Users\USER\Desktop\Test.pdf (from which the OP extracted the text in his test) is somehow peculiar and so text extracted from it draws with that switch.
Applying that UnicodeCharacterPlacement class to the extracted text is necessary to get it into the right order.
The other post-processing line,
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
does not make any difference and should not be used.
For further analysis we would need that PDF C:\Users\USER\Desktop\Test.pdf.
Inspecting salamword.pdf
Eventually the OP could provide a sample PDF, salamword.pdf:
I used "PrimoPDF" to create a PDF file with this content: "Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم".
Next I read this PDF file. Then I received this output: "Test. Hello world. Hello people. م . م . م دم ".
Indeed I could reproduce this behavior. So I analyzed the way the Arabic writing was encoded inside...
Some background information to start with:
Fonts in PDFs can have (and in the case at hand do have) a completely custom encoding. In particular embedded subsets often are generated by choosing codes as the characters come, e.g. the first character from a given font used on a page is encoded as 1, the second different as 2, the third different as 3 etc.
Thus, simply extracting the codes of the drawn text does not help very much at all (see below for an example from the file at hand). But a font inside a PDF can bring along some extra information allowing an extractor to map the codes to Unicode values. These information might be
a ToUnicode map providing an immediate map code -> Unicode code point;
an Encoding providing a base encoding (e.g. WinAnsiEncoding) and differences from it in the form of glyph names; these names may be standard names or names only meaningful in the context of the font at hand;
ActualText entries for a structure element or marked-content sequence.
The PDF specification describes a method using the ToUnicode and the Encoding information with standard names to extract text from a PDF and presents ActualText as an alternative way where applicable. The iTextSharp text extraction code implements the ToUnicode / Encoding method with standard names.
Standard names in this context in the PDF specification are character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font.
In the file at hand:
Let's look at the Arabic text in the line written in Arial. The codes used for the glyphs here are:
01 02 03 04 05 01 02 06 07 01 08 02 06 07 01 09 05 0A 0B 01 08 02 06 07
This looks very much like an ad-hoc encoding as described above is used. Thus, using only these information does not help at all.
Thus, let's look at the ToUnicode mapping of the embedded Arial subset:
<01><01><0020>
<02><02><0645>
<03><03><062f>
<04><04><0631>
<08><08><002e>
<0c><0c><0028>
<0d><0d><0077>
<0e><0e><0069>
<0f><0f><0074>
<10><10><0068>
<11><11><0041>
<12><12><0072>
<13><13><0061>
<14><14><006c>
<15><15><0066>
<16><16><006f>
<17><17><006e>
<18><18><0029>
This maps 01 to 0020, 02 to 0645, 03 to 062f, 04 to 0631, 08 to 002e, etc. It does not map 05, 06, 07, etc to anything, though.
So the ToUnicode map only helps for some codes.
Now let's look at the associated encoding
29 0 obj
<</Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[ 1
/space/uni0645/uni062F/uni0631
/uni0645.init/uni06440627.fina/uni0633.init/period
/uni0647.fina/uni0644.medi/uni06A9.init/parenleft
/w/i/t/h
/A/r/a/l
/f/o/n/parenright ]
>>
endobj
The encoding is based on WinAnsiEncoding but all codes of interest are remapped in the Differences. There we find numerous standard glyph names (i.e. character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font) like space, period, w, i, t, etc.; but we also find several non-standard names like uni0645, uni06440627.fina etc.
There appears to be a scheme used for these names, uni0645 represents the character at Unicode code point 0645, and uni06440627.fina most likely represents the characters at Unicode code point 0644 and 0627 in some order in some final form. But still these names are non-standard for the purpose of text extraction according to the method presented by the PDF specification.
Furthermore, there are no ActualText entries in the file at all.
So the reason why only " م . م . م دم " is extracted is that only for these glyphs there are proper information for the standard PDF text extraction method in the PDF.
By the way, if you copy&paste from your file in Adobe Reader you'll get a similar result, and Adobe Reader has a fairly good implementation of the standard text extraction method.
TL;DR
The sample file simply does not contains the information required for text extraction with the method described by the PDF specification which is the method implemented by iTextSharp.

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.

Content templates rendering in TYPO3

I've got a strange problem connected with content rendering.
I use following code to grab the content:
lib.otherContent = CONTENT
lib.otherContent {
table = tt_content
select {
pidInList = this
orderBy = sorting
where = colPos=0
languageField = sys_language_uid
}
renderObj = COA
renderObj {
10 = TEXT
10.field = header
10.wrap = <h2>|</h2>
20 = TEXT
20.field = bodytext
20.wrap = <div class="article">|</div>
}
}
and everything works fine, except that I'd like to use also predefined column-content templates other than simple text (Text with image, Images only, Bullet list etc.).
The question is: with what I have to replace renderObj = COA and the rest between the brackets to let the TYPO3 display it properly?
Thanks,
I.
The available cObjects are more or less listed in TSRef, chapter 8.
TypoScript for rendering Text w/image can be found in typo3/sysext/css_styled_content/static/v4.3/setup.txt at line 724, and in the neighborhood you'll find e.g. bullets (below) and image (above), which is referenced in textpic line 731. Variants of this is what you'll write in your renderObj.
You will find more details in the file typo3/sysext/cms/tslib/class.tslib_content.php, where e.g. text w/image is found at or around line 897 and is called IMGTEXT (do a case-sensitive search). See also around line 403 in typo3/sysext/css_styled_content/pi1/class.cssstyledcontent_pi1.php, where the newer css-based rendering takes place.