Efficient way to store HTML in database - sql

I have a text editor on webpage. It contains function like Bold, Italics, Highlight. So a text may contain any of these. It may even contain numbered or unnumbered lists.
The text editor generates HTML for the formatted text.
Due to this, the format text data (html) is atleast 60% more than what unformatted text would have been.
This consumes lot of space (in terms of characters) which leads to space hungry database.
Is there a way to compress or some other way to store this efficiently ?

There is no built-in compression function in Db2. But you may write your own external functions (using Java or C/C++) to implement such a functionality. I can provide a java example (using java.util.zip package) of such an implementation, if you are interesting.
Another way is to use Db2 Row compression. Db2 may compress any non-LOB columns and so called "inlined" LOBs.
Storing LOBs inline in table rows

If you store your data as XML in a Db2 XML data-type column, it will be stored in a more efficient form than raw text
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.xml.doc/doc/c0022770.html

Related

Display 500+ character field from SAP transparent table

As it commonly known, it is not recommended by SAP to use 255+ character fields in transparent tables. One should use several 255 fields instead, wrap text in LCHR, LRAW or STRING, or use SO10 text etc.
However, while maintaining legacy (and ugly) developments, such problem often arises: how to view what is stored in char500 or char1000 field in database?
The real life scenario:
we have a development where some structure written and read from char1000 field in transparent table
we know field structure and parsing the field through CL_ABAP_CONTAINER_UTILITIES=>FILL_CONTAINER_C or SO_STRUCT_TO_CHAR goes fine, all fields are put wonderfully
displaying the fields via SE11/SE16/SE16n gives nothing as the field is truncated to 255, and to 132 in debugger, AFAIR.
Is there any standard tool, transaction or FM we can use to display such long field?
In the DBA cockpit (ST04), there is a SQL command line, where you can enter directly the "native" SQL commands and display the result as an ALV view. With a substring function, you can split a field into several sections (expl: select substr(sql_text,1,100) s1, substr(sql_text,101,100) s2, substr(sql_text,201,100) s3, substr(sql_text,301,100) s4 from dba_hist_sqltext where sql_id = '0cuyjatkcmjf0'). PS: every ALV cell is 128 characters maximum.
Not sure whether this tool is available for all supported database softwares.
There is also an equivalent program named RSDU_EXEC_SQL (in all ABAP-based systems?)
Unfortunately, they won't work for ersatz of tables by SAP (clustered tables and so on) as they can be queried only with ABAP "Open SQL".
If you have an ERP system to you hand check transaction PP01 out with infotype 1002. Basically They store text in table HRP1002 and HRT1002 and create a special view with an text editor. It looks like this: http://www.sapfunctional.com/HCM/Positions/Page1.13.jpg
In debugger you can switch the view to e.g. HTML and you should see the whole string, but editing is limited as far as i know to a certain number of charachters.

Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?

I have a PDF generated by a third party. I am trying to get the text out of it, but neither pdf2text nor copying and pasting results in readable text. After a little digging in the output (of either of two) I found that each character on the screen is made up of three bytes. For example, "A" is the bytes ef, 81, and 81. Looking at the metadata on the PDF it claims to be encoded in Identity-H, so I assume what I am seeing is a set of characters encoded in Identity-H. I have a partial mapping based on the documents I already have, but I want to make a more complete mapping. To do that I need something like an ASCII table for Identity-H.
It is not always possible to extract text from a PDF especially when the /ToUnicode map is missing as pointed out by mkl.
If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of extracting the text yourself. If Acrobat cannot extract it then it is very unlikely that any other tool can extract the text correctly.
If you manually create an encoding table then you could use this to remap the extracted characters to their correct values but this most likely will only work for this one document.
Often this is done on purpose. I have seen documents that randomly remap characters differently for each font in the dot. It is used as a form of obfuscation and the only real way to extract text from these PDF's is to resort to OCR. There are many financial reports that use this type of trick to stop people from extracting their data.
Also, Identity-H is just a 1:1 character mapping for all characters from 0x0000 to 0xFFFF. ie. Identity is an identity mapping.
Your real problem is the missing /ToUnicode entry in this PDF. I suspect there is also an embedded CMap in your PDF that explains why there could be 3 bytes per character.

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

What data type to use for variable length data (for performance)?

What data type should I use for data that can be very short, eg. html link (think twitter), or very long eg. html blog post (think wordpress).
I am thinking if I use varchar(4000), it maybe too short for a html formated blog entry? but if I use text, it will take up more space and is less efficient?
[update]
i am still condering using MySQL (if PHP 5.3/Zend Framework) or MSSQL (if ASP.NET MVC 2)
MySQL also has a Text data type for storing an arbitrarily large amount of text. You can find more here: The BLOB and TEXT Types
If you are using Micrsoft SQL server 2008 you can use varchar(max).
Edit:
Text is also available but isn't searchable without text indexing..

What MySQL datatype & attributes should be used to store large amounts of html formatted data?

I'm setting up a database using PHPMyAdmin and many fields will be large chunks of HTML.
What MySQL datatype & attributes should be used for fields that store large amounts of HTML data?
TEXT, MEDIUMTEXT, or LONGTEXT
I would recommend against storing large chunks of HTML (or other text data) in a database. I find that it's often far more useful to store files as files, and put a filename in the database instead.
On the other hand, if you're doing everything through phpMyAdmin, you may not have that option available to you.
You really really should start with the documentation, then if you have questions based on the data types you find there, try to ask for some clarification. But it really helps to understand what the datatypes are before asking the question: Documentation here:
http://dev.mysql.com/doc/refman/5.4/en/data-types.html
That said, take a closer look at text and blob. Text will store a large body of textual information (probably a good choice) where blob is designed for binary data. This does make a difference based on the query functions and what data types they operate on.
I think you can store HTML in simple TEXT field. If your html is more then 64KB then you can use MEDIUMTEXT instead.
See also Storage Requirements for String Types for more details about maximum length of stored value.
Also remember than characters in Unicode can require more then 1 byte to store.