read wsgi post data with unicode encoding - mod-wsgi

How can i read wsgi POST with unicode encoding,
this is part of my code :
....
request_body_size = int(environ.get('CONTENT_LENGTH', 0))
req = str(environ['wsgi.input'].read(request_body_size))
and from req i read my fileds,
this is what i posted :
کلمه
and this is what i read it from inside of py code:
b"%DA%A9%D9%84%D9%85%D9%87"
This is a byte string but i can't convert it or read it ,
I use encode and decode methods but none of these are not worked .
I use python3.4 and wsgi and mod_wsgi(apache2).

I use urllib module of python, with this code and worked :
fm = urllib.parse.parse_qs(request_body['family'].encode().decode(),True) # return a dictionary
familyvalue = str([k for k in fm.keys()][0]) # access to first item
is this a right way ?

Related

Translation with google trad api

I'm trying to write a program that takes the text of a file, for example PDF, and translates the text extracted with the Google API, except that the API doesn't work with my code. I don't have a clue why it isn't working.
I've already tried to modify my code but nothing I've done works.
from tika import parser
# from googletrans import Translator
import os
from textblob import TextBlob
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")
raw = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = raw['content']
text = text.replace('https://mp4directs.com','')
text = text.replace('\t','')
text = text.replace('\r','')
fichier = open("arifureta.txt", "a")
fichier.write(text)
fichier.close()
fic = open("arifureta-formater.txt", "a")
cpt=0
with open("arifureta.txt") as f :
for line in f :
if len(line)==1 :
cpt+=1
else :
cpt=0
if cpt<2:
fic.write(line)
fic.close()
nbLigneTraité = 0
fic2 = open("arifureta-traduit.txt", "a")
compteur=0
textPasTraduit=''
with open("arifureta-formater.txt") as f :
for line in f :
fic2.write(str(blob.translate(from_lang='en',to='fr')))
if len(line)>1:
textPasTraduit += line
compteur+=1
if compteur%1000==0:
blob = TextBlob(textPasTraduit)
try:
fic2.write(str(blob.translate(from_lang='en',to='fr')))
print(blob.translate(from_lang='en',to='fr'))
except Exception as e:
pass
nbLigneTraité+=1
print(nbLigneTraité)
if len(line)==1:
fic2.write('\n')
fic2.close()
I expect to have the entire translation of the PDF's text in the result file, but actually the answer is 'broken link'. I think it is due to the quantity of text, but I haven't find a way to try any other method.

Problem Replacing <br> Tags with Newline Using bs4

Problem: I cannot replace <br> tags with a newline character using Beautiful Soup 4.
Code: My program (the relevant portion of it) currently looks like
for br in board.select('br'):
br.replace_with('\n')
but I have also tried board.find_all() in place of board.select().
Results: When I use board.replace_with('\n') all <br> tags are replaced with the string literal \n. For example, <p>Hello<br>world</p> would end up becoming Hello\nworld. Using board.replace_with(\n) causes the error
File "<ipython-input-27-cdfade950fdf>", line 10
br.replace_with(\n)
^
SyntaxError: unexpected character after line continuation character
Other Information: I am using a Jupyter Notebook, if that is of any relevance. Here is my full program, as there may be some issue elsewhere that I have overlooked.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://boards.4chan.org/g/")
soup = BeautifulSoup(page.content, 'html.parser')
board = soup.find('div', class_='board')
for br in board.select('br'):
br.replace_with('\n')
message = [obj.get_text() for obj in board.select('.opContainer .postMessage')]
image = [obj['href'] for obj in board.select('.opContainer .fileThumb')]
pid = [obj.get_text() for obj in board.select('.opContainer .postInfo .postNum a[title="Reply to this post"]')]
time = [obj.get_text() for obj in board.select('.opContainer .postInfo .dateTime')]
for x in range(len(image)):
image[x] = "https:" + image[x]
post = pd.DataFrame({
"ID": pid,
"Time": time,
"Image": image,
"Message": message,
})
post
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
display(post)
Any advice would be appreciated. Thank you for reading.
Just tried and it works for me, my bs4 version is 4.8.0, and I am using Python 3.5.3,
example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('hello<br>world')
for br in soup('br'):
br.replace_with('\n')
# <br> was replaced with \n successfully
assert str(soup) == '<html><body><p>hello\nworld</p></body></html>'
# get_text() also works as expected
assert soup.get_text() == 'hello\nworld'
# it is a \n not a \\n
assert soup.get_text() != 'hello\\nworld'
I am not used to work with Jupyter Notebook, but seems that your problem is that whatever you are using to visualize the data is showing you the string representation instead of actually printing the string,
hope this helps,
Regards,
adb
Instead of replacing after converting to soup, try replacing the <br> tags before converting. Like,
soup = BeautifulSoup(str(page.content).replace('<br>', '\n'), 'html.parser')
Hope this helps! Cheers!
P.S.: I did not get any logical reason why this is not working after changing into soup.
After experimenting with variations of
page = requests.get("https://boards.4chan.org/g/")
str_page = page.content.decode()
str_split = '\n<'.join(str_page.split('<'))
str_split = '>\n'.join(str_split.split('>'))
str_split = str_split.replace('\n', '')
str_split = str_split.replace('<br>', ' ')
soup = BeautifulSoup(str_split.encode(), 'html.parser')
for the better part of two hours, I have determined that the Panda data-frame prints the newline character as a string literal. Everything else indicates that the program is working as intended, so I assume this has been the problem all along.
for some reason direct replace with newline does not work with bs4 you have to first replace with some other unique character (character sequence preferably) and then replace that sequence in text with newline.
try this.
for br in soup.find_all('br'): br.replace_with('+++')
text=soup.get_text().replace('+++','\n)

Can't change the language in microsoft cognitive services spellchecker

Here is the Microsoft Python example of using spellchecker API:
import http.client, urllib.parse, json
text = 'Hollo, wrld!'
data = {'text': text}
# NOTE: Replace this example key with a valid subscription key.
key = 'MY_API_KEY'
host = 'api.cognitive.microsoft.com'
path = '/bing/v7.0/spellcheck?'
params = 'mkt=en-us&mode=proof'
headers = {'Ocp-Apim-Subscription-Key': key,
'Content-Type': 'application/x-www-form-urlencoded'}
# The headers in the following example
# are optional but should be considered as required:
#
# X-MSEdge-ClientIP: 999.999.999.999
# X-Search-Location: lat: +90.0000000000000;long: 00.0000000000000;re:100.000000000000
# X-MSEdge-ClientID: <Client ID from Previous Response Goes Here>
conn = http.client.HTTPSConnection(host)
body = urllib.parse.urlencode(data)
conn.request ("POST", path + params, body, headers)
response = conn.getresponse()
output = json.dumps(json.loads(response.read()), indent=4)
print (output)
And it works well for mkt=en-us. But if I try to change it, for example to 'fr-FR'. It always answers me with a blank response to any input text.
{
"_type": "SpellCheck",
"flaggedTokens": []
}
Has anybody encountered the similar problem? May it be connected with my trial api key (though they do not mention that trial supports only English)?
Well, I've found out what the problem was. 'mode=proof' — advanced spellchecker currently available only if 'mkt=en-us' (for some Microsoft reasons it does not available even if 'mkt=en-uk'). For all other languages, you should use 'mode=spell'.
The main difference between 'proof' and 'spell' is described like this:
The Spell mode finds most spelling mistakes but doesn't find some of the grammar errors that Proof catches (for example, capitalization and repeated words).

NiFi: Remove fixed number of header lines from file

I'm processing a file and I'd like to remove (trim) the first X header lines to keep only data, possibly avoiding using regular expressions.
Thanks
You can remove the first X header lines by using ExecuteScript procesor in Nifi.
The following is a example Jython script which I wrote for myself:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
for line in text[3:]:
outputStream.write(line + "\n")
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
This obviously removes the first 3 lines but you can easily modify it to remove more or less lines.
Hope that helps.

Sending form data with an HTTP PUT request using Grinder API

I'm trying to replicate the following successful cURL operation with Grinder.
curl -X PUT -d "title=Here%27s+the+title&content=Here%27s+the+content&signature=myusername%3A3ad1117dab0ade17bdbd47cc8efd5b08" http://www.mysite.com/api
Here's my script:
from net.grinder.script import Test
from net.grinder.script.Grinder import grinder
from net.grinder.plugin.http import HTTPRequest
from HTTPClient import NVPair
import hashlib
test1 = Test(1, "Request resource")
request1 = HTTPRequest(url="http://www.mysite.com/api")
test1.record(request1)
log = grinder.logger.info
test1.record(log)
m = hashlib.md5()
class TestRunner:
def __call__(self):
params = [NVPair("title","Here's the title"),NVPair("content", "Here's the content")]
params.sort(key=lambda param: param.getName())
ps = ""
for param in params:
ps = ps + param.getValue() + ":"
ps = ps + "myapikey"
m.update(ps)
params.append(NVPair("signature", ("myusername:" + m.hexdigest())))
request1.setFormData(tuple(params))
result = request1.PUT()
The test runs okay, but it seems that my script doesn't actually send any of the params data to the API, and I can't work out why. There are no errors generated, but I get a 401 Unauthorized response from the API, indicating that a successful PUT request reached it, but obviously without a signature the request was rejected.
This isn't exactly an answer, more of a workaround that I came up with, that I've decided to post since this question hasn't yet received any responses, and it may help anyone else trying to achieve the same thing.
The workaround is basically to use the httplib and urllib modules to build and make the PUT request instead of the HTTPClient module.
import hashlib
import httplib, urllib
....
params = [("title", "Here's the title"),("content", "Here's the content")]
params.sort(key=lambda param: param[0])
ps = ""
for param in params:
ps = ps + param[1] + ":"
ps = ps + "myapikey"
m = hashlib.md5()
m.update(ps)
params.append(("signature", "myusername:" + m.hexdigest()))
params = urllib.urlencode(params)
print params
headers = {"Content-type": "application/x-www-form-urlencoded"}
conn = httplib.HTTPConnection("www.mysite.com:80")
conn.request("PUT", "/api", params, headers)
response = conn.getresponse()
print response.status, response.reason
print response.read()
conn.close()
(Based on the example at the bottom of this documentation page.)
You have to refer to the multi-form posting example in Grinder script gallery, but changing the Post to Put. It works for me.
files = ( NVPair("self", "form.py"), )
parameters = ( NVPair("run number", str(grinder.runNumber)), )
# This is the Jython way of creating an NVPair[] Java array
# with one element.
headers = zeros(1, NVPair)
# Create a multi-part form encoded byte array.
data = Codecs.mpFormDataEncode(parameters, files, headers)
grinder.logger.output("Content type set to %s" % headers[0].value)
# Call the version of POST that takes a byte array.
result = request1.PUT("/upload", data, headers)