Preserve raw text file newlines with scrapy - scrapy

When I scrape a raw text file, e.g. https://raw.githubusercontent.com/microsoft/TypeScript/master/Gulpfile.js scrapy ignores line breaks and saves the content as a text blob. Is there a way to preserve the \n characters?
I've tried using these commands with no luck:
response.css("body p::text").extract_first()
response.css("body p").xpath("text()").extract_first()
response.css("body p").xpath("string()").extract_first()
Thanks in advance.

Related

Custom Speech: "normalized text is empty"

I uploaded a wav and text file to the custom speech portal. I got the following error: “Error: normalized text is empty.”
The text file is UTF-8 BOM, and is similar in format to a file that did work.
How I can trouble-shoot this?
There can be several reasons for a normalized text to be empty, e.g. if there are words of Latin and non-Latin characters in a sentence (depending on the locale). Also, words that are repeated multiple times in a row may cause this. Can you share which locale you're using to import the data? If you could share the text we can find the reason. Otherwise you could try to reduce the input text (no need to cut the audio for this) to find out what causes the normalization to discard the sentence.

How to pass emoji scraping a text in phyton with bs4

I'm creating a scraper that scrapes all the comments in a URL page and I'm saving the text in a txt file (1 comment = 1 txt).
Now I'm having a problem when there are some emoji in the text of a comment. In fact, the program stops and says "UnicodeEncodeError: 'charmap' codec can't encode the character". How can I pass this problem? (I'm using bs4)
The structure of the code is like this:
q=requests.get(url)
soup=BeautifulSoup(q.content, "html.parser")
x=soup.find("a", {"class":"comments"})
y=x.find_all("div", {"class":"blabla"})
i=0
for item in y:
name=str(i)
comment=item.find_all("p")
out_file=open('%s.txt'%CreatorName, "w")
out_file.write(str(comment)
out_file.close
i=i+1
Thanks to everyone.
My guess is that you are on Windows. You code works perfectly on Linux. So change the encoding on the file you open to utf-8 like this:
out_file=open('%s.txt'%CreatorName, "w", encoding='utf-8')
This should write to the file without error although the emoji may not display properly in notepad you can always open it in FireFox or another application if you want to see the emoji. Other comment text should be readable in notepad though.

text file encoding detect has issue

I am using streamReader to read a text file and then saving content of file into another text file using streamwriter. File which I am trying to read can be in ANSI or UTF-8 format.
I am facing problem with non english characters. When input file contains Chinese or Japanese language every thing works fine.
But if input file contains characters like ã then output text file show question mark type symbol.
I tried to fix this by using encoding iso-8859-1 for streamreader but now Chinese and Japanese language coming like this ¢ãƒ­ãƒ¼ãƒ»ãƒ¦ãƒ¼ã

Using rmarkdown to create headers from a large .txt input

I'm trying to create a PDF with headers from rmarkdown. I'm reading in a large text file, and I want to print out this PDF using headers. I can easily read in and print to the PDF with desired formatting, minus the headers, using
```{r comment='', echo=FALSE}
cat(readLines("blah.txt", encoding="UTF-8"), sep="/n")
```
However, I can't get rmarkdown to evaluate '#' in my text, which creates headers. I've inserted the '#' into different sections of the .txt file where I want to create a header, but it doesn't evaluate the hashtag.
Does anyone know how to get rmarkdown to evaluate the '#' as a header without messing up the formatting of the text file as I already have it?
RMarkdown recognizes # ...-headers only if there is an empty line before. So what you need is simply to use one more line-break \n (note the back-slashes).
```{r comment='', echo=FALSE, results='asis'}
cat(readLines("blah.txt", encoding="UTF-8"), sep="\n\n")
```
Note, that you may want to add results='asis' in the R-junk.

Using the Google text to speech service with spanish text

I am trying to generate a voice message from text using the following URL:
"http://translate.google.com/translate_tts?tl=es&q=La+castaña+está+muy+buena"
With python, but the urllib library fails with:
File "/usr/lib/python2.7/urllib.py", line 227, in retrieve
url = unwrap(toBytes(url))
File "/usr/lib/python2.7/urllib.py", line 1051, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://translate.google.com/translate_tts?tl=es&q=La+casta\xf1a+est\xe1+muy+buena' contains non-ASCII characters
How can I pass special characters (ñ, á, é, í, ó, ú), which have phonetical meaning, to the Google t2s service?
I was facing the same problem and found the solution in this question:
all I had to do was pass an additional parameter: &ie=UTF-8
That's it.
My resulting PHP looked something like:
$q = "Años más, años menos";
$data = file_get_contents('http://translate.google.com/translate_tts?tl=es&ie=UTF-8&q='.$q;
Note the translate_tts?tl=es&ie=UTF-8&q='.$q part of the GET parameters.