How to remove characters from tsv file during parsing? - ruby-on-rails-3

I am parsing a tsv file and loading it into MySQL. I got this to work then found there are back slashes in the tsv file that are being interpreted as line breaks. I would like to remove the \ from all fields before the data is sent to the database. This is a shortened example, there are 300 columns in the file and many of them will be blank.
begin
CSV.foreach(file, :col_sep => "\t") do |row|
row.map!{ |e| e.gsub(/\\/, '')}
d = Datafeed.new
d.id = row[0]
d.description = row[1]
d.save!
end
end
When I run this example, I get an error: undefined method `gsub' for nil:NilClass. I think this error is being generated by blanks in the file. However, when I try adding
row.map!{ |e| unless e.blank e.gsub(/\\/, '') }
it will not execute and I get an error for an unexpected }.
Is this the right direction to eliminate the back slashes? What is the best approach?
Thanks

The unless statement should follow the other code. That's what is causing the second error. Try this:
row.map!{ |e| e.gsub(/\\/, '') unless e.blank? }
Note: That code will turn "" into nil which may or may not be what you expect.
Your approach seems reasonable.
Edit:
To retain the blanks, you can do the following:
row.map!{ |e| e.blank? ? '' : e.gsub(/\\/, '') }
or if that's a bit too much for one line for you, this:
row.map! do |e|
if e.blank?
''
else
e.gsub(/\\/, '')
end
end

Related

Striping strings in rows of dataframe

I have a dataset which consist of around 6 millions urls (rows),
I'm trying to strip off the protocol part of every url ( https://, http://, ftp://) and also want to remove ('www.'), applying that for each row or each url
I applied the next command which works fine:
df['url'] = df['url'].str.replace('http://', "")
df['url'] = df['url'].str.replace('https://', "")
df['url'] = df['url'].str.replace('ftp://', "")
df['url'] = df['url'].str.replace('www.', "")
but it is a naive approach I guess, and I'm trying to replace those lines with one more efficient line of code, but my attempts didnt work well so far.
can you provide me with a better solution, maybe .apply function or lambda ?
Use replace with dictionary instead of str.replace
df.url.replace({
'http://': '',
'https://': '',
'ftp://': '',
'www\.': ''
}, regex=True)
Note: Since regex flag is True be careful while creating strings.

Error, "[" unexpected In Maple

I have the following code and the annoying Error, "[" unexpected In Maple error keeps coming up. Does anyone see what it is that I am doing wrong because I have been staring at the screen for hours and still dont see it.
Relations:=proc(n::posint,fb::Array,{mindeps::posint:=5,verbose::truefalse:=false})
local s,np,f,j,g,f1,f2,i;
s:=isqrt(n);
np:=ArrayNumElems(fb);
f:=[];
j:=1;
g:=np+mindeps;
while nops(f) < g do
f1:=FBTrialDivision(n,s-j+1,fb);
f2:=FBTrialDivision(n,s+j,fb);
f:=[op(f),f1,f2];
j:=j+1
end do;
if verbose then
printf("smooth",g,2*j-2)
else
print("");
print(2*j-2)
end if
[Vector([seq(f[i][1], i = 1..nops(f))]),Vector([seq(f[i][2], i = 1..nops(f))]),
LinearAlgebra:-Transpose(Matrix([seq(f[i][3], i = 1..nops(f))]))]
end proc:
Second one:
FindFactors:=proc(n,rels,deps)
local fact, i, x, y;
fact:=1;
for i to nops(deps) while fact = 1 or fact = n do
x:=mul(j,j=rels[1]~deps[i]);
y:=isqrt(mul(j,j=rels[2]~deps[i]));
fact:=igcd(x+y,n)
end do;
if fact <> 1 and fact <> n then
``(fact)*``(iquo(n,fact))
else
print("no trivial")
end if;
end proc:
There is no terminator for the preceding line.
As plaintext 1D Maple Notation code, the previous line,
end if
is missing a statement terminator (either colon or semicolon). That's the cause of the error.
I notice that in several; places your code makes use of the fact that terminators are not required on lines that precede an end if, end do, end proc, etc. You may be seeing one of the dangers of that habit: when you edit and add a new statement between such a line and the end that trails it, you have to remember to add a statement terminator to the line which is no longer "last". Some people find that it just pays off to keep things simple and always use statement terminators, whether the current line needs it or not.

Using SQL like and % in Perl

When I use the following code, I only seem to print the last results from my array. I think it has something to do with my like clause and the % sign. Any ideas?
my #keywords = <IN>;
my #number = <IN2>;
foreach my $keywords (#keywords)
{
chomp $keywords;
my $query = "select *
from table1 a, table2 b
where a.offer = b.offer
and a.number not in (#number)
and a.title like ('%$keywords%')";
print $query."\n";
my $sth = $dbh->prepare($query)
or die ("Error: Could not prepare sql statement on $server : $sth\n" .
"Error: $DBI::errstr\n");
$sth->execute
or die ("Error: Could not execute sql statement on $server : $sth\n" .
"Error: $DBI::errstr\n");
while (my #results = $sth->fetchrow_array())
{
print OUT "$results[0]\t$results[1]\t$results[2]\t$results[3]\t",
"$results[4]\t$results[5]\t$results[6]\t$results[7]\t",
"$results[8]\n";
}
}
close (OUT);
I'm guessing that your IN file was created on a Windows system, so has CRLF sequences (roughly \r\n) between the lines, but that you're running this script on a *nix system (or in Cygwin or whatnot). So this line:
chomp $keywords;
will remove the trailing \n, but not the \r before it. So you have a stray carriage-return inside your LIKE expression, and no rows match it.
If my guess is right, then you would fix it by changing the above line to this:
$keywords =~ s/\r?\n?\z//;
to remove any carriage-return and/or newline from the end of the line.
(You should also make the changes that innaM suggests above, using bind variables instead of interpolating your values directly into the query. But that change is orthogonal to this one.)
Show the output of the print $query and maybe we can help you. Better yet, show the output of:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($query);
Until then, your comment about "replaces the an of and" makes me think your input has carriage returns, and the use of #number is unlikely to work if there's more than one.

Reading CSV File - invalid byte sequence in UTF-8

I have been using a rake file for a number of months to read in data from a CSV file. I have recently tried to read in a new CSV file but keep getting the error "invalid byte sequence in UTF-8". I have tried to manually work out where the problem is, but with little success. The csv file is just text and URLs, there were a few unusual characters initially (where the original text had fancy bulletpoints) but I have removed those and cannot find any additional anomalies.
Is there a way to get round this problem automatically and identify and remove the problem characters?
I've found a solution to discard all invalid utf8 bytes from a string :
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
(taken from this blog post)
Hope this helps.
Where abouts do you put these. I have something like this:
CSV.foreach("/Users/CarlBourne/Customers/Lloyds/small-test2.csv", options) do |row |
name, workgroup, address, actual, output = row
next if nbname == "NBName"
#ssl_info[name] = workgroup, address, actual, output
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
clean = ic.iconv(output + ' ')[0..-2]
puts clean
end
However it doesn't seam to work.

Replace() on a field with line breaks in it?

So I have a field that's basically storing an entire XML file per row, complete with line breaks, and I need to remove some text from close to three hundred rows. The replace() function doesn't find the offending text no matter what I do, and all I can find by searching is a bunchy of people trying to remove the line breaks themselves. I don't see any reason that replace() just wouldn't work, so I must just be formatting it wrong somehow. Help?
Edit: Here's an example of what I mean in broad terms:
<script>...</script><dependencies>...</dependencies><bunch of other stuff></bunch of other stuff><labels><label description="Field2" languagecode="1033" /></labels><events><event name="onchange" application="false" active="true"><script><![field2.DataValue = (some equation);
</script><dependencies /></event></events><a bunch more stuff></a bunch more stuff>
I need to just remove everything between the events tags. So my sql code is this:
replace(fieldname, '<events><event name="onchange" application="false" active="true"><script><![field2.DataValue = (some equation);
</script><dependencies /></event></events>', '')
I've tried it like that, and I've tried it all on one line, and I've tried using char(10) where the line breaks are supposed to be, and nothing.
Nathan's answer was close. Since this question is the first thing that came up from a search I wanted to add a solution for my problem.
select replace(field,CHAR(13)+CHAR(10),' ')
I replaced the line break with a space incase there was no break. It may be that you want to always replace it with nothing in which case '' should be used instead of ' '.
Hope this helps someone else and they don't have to click the second link in the results from the search engine.
Worked for me on SQL2012-
UPDATE YourTable
SET YourCol = REPLACE(YourCol, CHAR(13) + CHAR(10), '')
If your column is an xml typed column, you can use the delete method on the column to remove the events nodes. See http://msdn.microsoft.com/en-us/library/ms190254(v=SQL.90).aspx for more info.
try two simple tests.
try the replace on an xml string that has no double quotes (or single quotes) but does have CRLFs. Does it work? If yes, you need to escape the quote marks.
try the replace on an xml string that has no CRLFs. Does it work? Great. If yes use two nested replace() one for the CRLFs only, then a second outter replace for the string in question.
A lot of people do not remember that line breaks are two characters
(Char 10 \n, and Char 13 \r)
replace both, and you should be good.
SELECT
REPLACE(field , CHR(10)+CHR(13), '' )
FROM Blah..