How do I read a file and turn it to a RAW bit string? For example I open an image that is 512kb, It reads the file byte by byte, and it spits out the long bit string that is the file? I would like to apply some functions to the strings but I can't figure a way to unpack files consistently.
I imagine what I need is something that reads a file byte by byte with no care of the original file format... As it reads byte by byte, a giant integer like thing file bit string is created.
I used a Python's bit generator and NumPy, that seemed to work well, but The program didn't behave well with actual files. What is the best way to unpack files into 1's and 0's?
How do I read any file and store the contents as an easy to read HEX file? or BIN file? And how do I stop the "open" function from truncating leading 0's!
UGH!
Using Python or GOLANG, how do I open any file and create an uninterrupted bit string of the contents where every leading zero in a BYTE read is significant?
After looking and asking everyone I'm acquainted to I found my answer. The best way to turn any file into a RAW HEX string is by
f = open("file_name", "rb")
content = f.read().hex()
with open("File HEX bitstream.txt", "w") as text_file:
print(f"HEX Bitstream Import", content, file=text_file)
f.close()
import lxml.html.clean as clean
cleaner = clean.Cleaner(style=True, remove_tags=['div','span',], safe_attrs_only=['href',])
text = cleaner.clean_html('link')
print text
prints
link
how to get:
link
i.e href in normal encoding?
clean does the right thing -- the string in parentheses should be properly encoded, and the seemingly garbled thing is the proper encoding.
You might not know, but kyrillic domain names don't exist -- there's a complex system to map these to "allowed" characters.
I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.
In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.
Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.
Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue
i control the problem of the data what is uploaded by the POST method, in the web.
If the file is a text theres no problem but the trouble comes when it's an encoded file, as a Picture or other. What the when the system insert the data into the file.
Well it doesn 't encoded in the right way. I will put all the code, from the area whats take the environ['wsgi.input'] to the area thats save the file:
# Here the data from the environ['wsgi.input'],
# first i convert the byte into a string delete the first
# field that represent the b and after i strip the single quotes
tmpData = str(rawData)[1:].strip("' '")
dat = tmpData.split('\\r')#Then i split all the data in the '\\r'
s = open('/home/hidura/test.png', 'w')#I open the test.png file.
for cont in range(5,150):#Now beging in the 5th position to the 150th position
s.write(dat[cont])#Insert the piece of the data in the file.
s.close()#Then closed.
Where is the mistake?
Thankyou in advance.
Why do you convert the binary data to a string? A png file is binary data. Just write the binary data to the file. You need to open the file in binary mode as well.
s = open('/home/hidura/test.png', 'wb')
s.write(data)