Reading CSV File - invalid byte sequence in UTF-8

Reading CSV File - invalid byte sequence in UTF-8 - ruby-on-rails-3

I have been using a rake file for a number of months to read in data from a CSV file. I have recently tried to read in a new CSV file but keep getting the error "invalid byte sequence in UTF-8". I have tried to manually work out where the problem is, but with little success. The csv file is just text and URLs, there were a few unusual characters initially (where the original text had fancy bulletpoints) but I have removed those and cannot find any additional anomalies.
Is there a way to get round this problem automatically and identify and remove the problem characters?

I've found a solution to discard all invalid utf8 bytes from a string :
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
(taken from this blog post)
Hope this helps.

Where abouts do you put these. I have something like this:
CSV.foreach("/Users/CarlBourne/Customers/Lloyds/small-test2.csv", options) do |row |
name, workgroup, address, actual, output = row
next if nbname == "NBName"
#ssl_info[name] = workgroup, address, actual, output
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
clean = ic.iconv(output + ' ')[0..-2]
puts clean
end
However it doesn't seam to work.

Related

U-SQL extracting files complete contents (extracting full source code from html files)

I've got a bunch of HTML files in my Data Lake Store and would like to get their full source code into a table (just one column with the code from all the files, the output format is not relevant to me, but probably tsv). I can't find a way to use the standard Extractors or anything on the web that works for me. Do I have to write a custom Extractor for that?
I've tried the Extractors.Tsv() and Extractors.Text() with a whole bunch of delimiters. I first tried:
#data =
EXTRACT source string
FROM "<MY DIRECTORY IN ADL>"
USING Extractors.Text(delimiter:'');
This didnt work out as it seems to not like having no delimiter, but also when I tried using delimiters that aren't in the html files it didnt work out.
Has anyone got an idea how to get this done? It seems to me that I am just stupid, so I hope someone here is a little smarter.
Even better than just the source code would be if I had the source code + filename in two columns, but I wanna start small.
Thank you!

#files =
EXTRACT FileName string,
Text string
FROM #"/somepath/{FileName}.html"
USING Extractors.Text(silent: true, delimiter: '`');
OUTPUT #files
TO "/somepath/Test.txt"
USING Outputters.Tsv(outputHeader: false, quoting: false);

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David

I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?

inserting character in file , jython

I have written a simple program where to read the first 4 characters and get the integer of it and read those many character and write xxxx after it . Although the program is working the only issues instead of inserting the character , its replacing.
file = open('C:/40_60.txt','r+')
i=0
while 1:
char = int(file.read(4))
if not char: break
print file.read(char)
file.write('xxxx')
print 'done'
file.close()
I am having issue with writing data .
considering this is my sample data
00146456135451354500107589030015001555854640020
and expected output is
001464561354513545xxxx00107589030015001555854640020
but actually my above program is giving me this output
001464561354513545xxxx7589030015001555854640020
ie. xxxx overwrites 0010.
Please suggest.

Files do not support an "insert"-operation. To get the effect you want, you need to rewrite the whole file. In your case, open a new file for writing; output everything you read and in addition, output your 'xxxx'.

Jython - Extract information

I try to programm in jython but I have some problems.
I would like to read information after the :
For the moment "Ext" only read the first line of the webpage and I don't know why.
This is the first problem.
Then I would like to do a while to read all the file.
When I did it, the while never finish.
how can I extract information after the :
Thanks for your help

You should read next line in while loop, for now you read only one line! That is main problem that causes your while loop is infinite loop!
What library do you use to read HTTP response? Your code is without any import
Is it working code? I don't know .Contains() method. In Python/Jython there is if ':' in Ext to check if char or sting is in other string
You can split line like: s1, s2 = Ext.split(':', 1), and then use s2 variable: it contains text after first : , then you can strip() it to remove spaces or other white chars at both ends of the string

Problem saving uploaded files in Python3

i control the problem of the data what is uploaded by the POST method, in the web.
If the file is a text theres no problem but the trouble comes when it's an encoded file, as a Picture or other. What the when the system insert the data into the file.
Well it doesn 't encoded in the right way. I will put all the code, from the area whats take the environ['wsgi.input'] to the area thats save the file:
# Here the data from the environ['wsgi.input'],
# first i convert the byte into a string delete the first
# field that represent the b and after i strip the single quotes
tmpData = str(rawData)[1:].strip("' '")
dat = tmpData.split('\\r')#Then i split all the data in the '\\r'
s = open('/home/hidura/test.png', 'w')#I open the test.png file.
for cont in range(5,150):#Now beging in the 5th position to the 150th position
s.write(dat[cont])#Insert the piece of the data in the file.
s.close()#Then closed.
Where is the mistake?
Thankyou in advance.

Why do you convert the binary data to a string? A png file is binary data. Just write the binary data to the file. You need to open the file in binary mode as well.
s = open('/home/hidura/test.png', 'wb')
s.write(data)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading CSV File - invalid byte sequence in UTF-8 - ruby-on-rails-3

I've found a solution to discard all invalid utf8 bytes from a string : ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string + ' ')[0..-2] (taken from this blog post) Hope this helps.

Related

U-SQL extracting files complete contents (extracting full source code from html files)

Cannot upload CSV that starts with an integer

inserting character in file , jython

Jython - Extract information

Problem saving uploaded files in Python3

Categories

Resources