I would like to import csv file into my database. I am using Ruby 1.8.7 and Rails 3.2.13 and the gem 'csv_importer'.
I am going to fetch productname and release_date from csv file
In my controller
def import
csv_text = File.read('test.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts row #getting entire values from csv
puts row['productname'] #getting error
end
If I print row/row[0], I am getting entire values from csv file as
productname,release_date
xxx,yyy
in my log.
If I print row['productname'], I am getting error as can't convert String into Integer.
How can I rectify this error?
It looks like you are actually expecting the FasterCSV API, which does support a signature like parse(csv_text, :headers => true).
In Ruby version 1.9 the CSV library in stdlib was replaced with the FasterCSV library. Your code looks like it might work straight up in Ruby 1.9+.
If you want to use the FasterCSV style API without upgrading to a newer Ruby, you can grab the gem and use it instead of CSV:
require 'fastercsv'
csv_text = File.read('test.csv')
csv = FasterCSV.parse(csv_text, :headers => true)
csv.each do |row|
puts row['productname']
end
From http://ruby-doc.org/stdlib-1.8.7/libdoc/csv/rdoc/CSV.html#method-c-parse:
CSV.parse(str_or_readable, fs = nil, rs = nil, &block)
Parse lines from given string or stream. Return rows as an Array of Arrays.
... so row in your case is an Array. Array[] takes an Integer as argument, not aString`, which is why you're getting the error.
In other words; row['anything'] cannot work, but row[0] and row[1] will give you the values from column 1 and 2 of the row.
Now, in your case, you are actually calling CSV.parse like so:
CSV.parse(csv_text, :headers => true)
Looking at the docs, we see that the second argument to CSV.parse is the field separator. You're passing :headers => true as a field separator. That tells CSV to split each row whenever it encounters the string "headerstrue" - it doesn't, so it doesn't split each row.
If you remove the second argument to CSV.parse you should be closer to what you expect.
Related
I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)
So heres the issue guys,
I have a very simple little program that reads in some setup details from a file (to make it reuseable for other sets of data) and stores them into variables.
It then uses one of those variables to open another file that I need to write some results to, as well as various search parameters.
When passing the variable to the .open() function, it fails saying it cant find the file, but when passing the exact same information, but as a written string instead of a variable, it works.
Is this a known problem, or am I just doing something wrong?
The code(problem bit bolded)
def urlTrawl(filename):
import urllib
read = open(getMediaPath(filename), "rt")
baseurl = read.readline()
orgurl = read.readline()
lasturlfile = read.readline()
linksfile = read.readline()
read.close()
webpage = ""
links = ""
counter = 0
lasturl = ""
nexturl = ""
url = ""
connection = ""
try:
read = open(lasturlfile, "rt")
lasturl = read.readline()
except IOError:
print "IOError"
webpage = connection.read()
connection.close()
**file = open(linksfile, "wt")**
file.close()
file = open(lasturlfile, "wt")
file.write(nexturl)
return 1
The information being passed in
http://www.questionablecontent.net/
http://www.questionablecontent.net/view.php?comic=2480
C:\\Users\\James\\Desktop\\comics\\qclast.txt
C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt
strip\"
src=\"
\"
Pevious
Next
f=\"
\"
EDIT: removed working code, to narrow down the problem area and updated code to use a direct reference rather then a relative one.
I found the problem in the end.
The problem was that it was reading in the \n at the end of each line in my details file, and of course the \n isn't anywhere in the website data I'm reading. Removing the last character of each read did the trick:
baseurl = baseurl[:-1]
orgurl = orgurl[:-1]
lasturlfile = lasturlfile[:-1]
linksfile = linksfile[:-1]
search1 = search1[:-1]
search2 = search2[:-1]
search3 = search3[:-1]
search4 = search4[:-1]
search5 = search5[:-1]
search6 = search6[:-1]
I might not be right, but I think this is what's happening.
You're saying this works fine:
file = open('C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt', "wt")
But this doesn't:
# After reading three lines
linksfile = read.readline()
file = open(linksfile, "wt")
There is a difference between these two. In the first piece of code, the double slashes are escapes. They resolve to single slashes when Python is done parsing. Like so:
>>> print 'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt'
C:\Users\James\Desktop\comics\comiclinksqc.txt
But when you read that same text from the file, there's no parsing of the text. That means that the string stored in your variable still has double slashes.
Try this command out. I bet it fails the same way as when you read the file path in:
file = open(r'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt', "wt")
The r stands for "raw"; it prevents Python from interpreting escape characters. If it does fail the same way, then the double slashes are your problem. To fix it, in your file, you need to remove the double slashes:
C:\Users\James\Desktop\comics\comiclinksqc.txt
This isn't a problem in CPython 2.7; I'm betting it's not in 3.x, either. CPython interprets double slashes in some manner that they are effectively a single slash (in most cases, at least). So this may be an issue specific to Jython.
If unclean paths cause errors, you might want to consider doing something to clean them up. os.path.abspath might be helpful, although I can't say if Jython's implementation works as well as CPython's:
>>> print os.path.abspath(r'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt')
C:\Users\James\Desktop\comics\comiclinksqc.txt
>>> print os.path.abspath(r'C:/Users/James/Desktop/comics/comiclinksqc.txt')
C:\Users\James\Desktop\comics\comiclinksqc.txt
I am trying to create a script which will list the datasource name and will show the connection pool utilization(pooled connection, Free Pool Size ext.)
But facing the issue when list the connection pool, if the data source name having space in between the name like "Default Datasource"
then it is listing list "Default Datasource and it is not parsing the datasource name correctly to the next function.
datasource = AdminConfig.list('DataSource', AdminConfig.getid( '/Cell:'
+ cell + '/')).splitlines()
for datasourceID in datasource:
datasourceName = datasourceID.split('(')[0]
print datasourceName
Request you to help if possible drop me mail at bubuldey#gmail.com
Regards,
Bubul
I'm working with a legacy system and need to find a way to insert files into a pre-existing Postgres 8.2 bytea column using Perl.
So far my searching has lead me to believe the following:
there is no consensus best approach for this.
lo_import looks promising, but I'm apparently too perl-tarded to get it to work.
I was hoping to do something like the following
my $bind1 = "foo"
my $bind2 = "123"
my $file = "/path/to/file.ext"
my $q = q{
INSERT INTO generic_file_table
(column_1,
column_2,
bytea_column
)
VALUES
(?, ?, lo_import(?))
};
my $sth = $dbh->prepare($q);
$sth->execute($bind1, $bind2, $file);
$sth->finish();`
My script works w/o the lo_import/bytea part. But with it I get this error:
DBD::Pg::st execute failed: ERROR: column "contents" is of type bytea but expression is >of type oid at character 176
HINT: You will need to rewrite or cast the expression.
What I think I'm doing wrong is that I'm not passing the actual binary file to the DB properly. I think I'm passing the file path, but not the file itself. If that's true then what I need to figure out is how to open/read the file into a tmp buffer, and then use the buffer for the import.
Or am I way off base here? I'm open to any pointers, or alternative solutions as long as they work with Perl 5.8/DBI/PG 8.2.
Pg offers two ways to store binary files:
large objects, in the pg_largeobject table, which are referred to by an oid. Often used via the lo extension. May be loaded with lo_import.
bytea columns in regular tables. Represented as octal escapes like \000\001\002fred\004 in PostgreSQL 9.0 and below, or as hex escapes by default in Pg 9.1 and above eg \x0102. The bytea_output setting lets you select between escape (octal) and hex format in versions that have hex format.
You're trying to use lo_import to load data into a bytea column. That won't work.
What you need to do is send PostgreSQL correctly escaped bytea data. In a supported, current PostgreSQL version you'd just format it as hex, bang a \x in front, and you'd be done. In your version you'll have to escape it as octal backslash-sequences and (because you're on an old PostgreSQL that doesn't use standard_conforming_strings) probably have to double the backslashes too.
This mailing list post provides a nice example that will work on your version, and the follow-up message even explains how to fix it to work on less prehistoric PostgreSQL versions too. It shows how to use parameter binding to force bytea quoting.
Basically, you need to read the file data in. You can't just pass the file name as a parameter - how would the database server access the local file and read it? It'd be looking for a path on the server.
Once you've read the data in, you need to escape it as bytea and send that to the server as a parameter.
Update: Like this:
use strict;
use warnings;
use 5.16.3;
use DBI;
use DBD::Pg;
use DBD::Pg qw(:pg_types);
use File::Slurp;
die("Usage: $0 filename") unless defined($ARGV[0]);
die("File $ARGV[0] doesn't exist") unless (-e $ARGV[0]);
my $filename = $ARGV[0];
my $dbh = DBI->connect("dbi:Pg:dbname=regress","","", {AutoCommit=>0});
$dbh->do(q{
DROP TABLE IF EXISTS byteatest;
CREATE TABLE byteatest( blah bytea not null );
});
$dbh->commit();
my $filedata = read_file($filename);
my $sth = $dbh->prepare("INSERT INTO byteatest(blah) VALUES (?)");
# Note the need to specify bytea type. Otherwise the text won't be escaped,
# it'll be sent assuming it's text in client_encoding, so NULLs will cause the
# string to be truncated. If it isn't valid utf-8 you'll get an error. If it
# is, it might not be stored how you want.
#
# So specify {pg_type => DBD::Pg::PG_BYTEA} .
#
$sth->bind_param(1, $filedata, { pg_type => DBD::Pg::PG_BYTEA });
$sth->execute();
undef $filedata;
$dbh->commit();
Thank you to those who helped me out. It took a while to nail this one down. The solution was to open the file and store it. then specifically call out the bind variable that is type bytea. Here is the detailed solution:
.....
##some variables
my datum1 = "foo";
my datum2 = "123";
my file = "/path/to/file.dat";
my $contents;
##open the file and store it
open my $FH, $file or die "Could not open file: $!";
{
local $/ = undef;
$contents = <$FH>;
};
close $FH;
print "$contents\n";
##preparte SQL
my $q = q{
INSERT INTO generic_file_table
(column_1,
column_2,
bytea_column
)
VALUES
(?, ?, ?)
};
my $sth = $dbh->prepare($q);
##bind variables and specifically set #3 to bytea; then execute.
$sth->bind_param(1,$datum1);
$sth->bind_param(2,$datum2);
$sth->bind_param(3,$contents, { pg_type => DBD::Pg::PG_BYTEA });
$sth->execute();
$sth->finish();
I exported tables and queries from SQL.
The ruby (1.9+) way to read csv appears to be:
require 'csv'
CSV.foreach("exported_mysql_table.csv", {:headers=>true}) do |row|
puts row
end
Which works great if your data is like this:
"name","email","potato"
"Bob","bob#bob.bob","omnomnom"
"Charlie","char#char.com","andcheese"
"Doug","diggyd#diglet.com","usemeltattack"
Works fine (The first line is a header, the attributes). However, if the data is like this:
"id","name","email","potato"
1,"Bob","bob#bob.bob","omnomnom"
2,"Charlie","char#char.com","andcheese"
4,"Doug","diggyd#diglet.com","usemeltattack"
Then we get the error:
.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/csv.rb:1894:in `block (2 levels) in shift': Missing or stray quote in line 2 (CSV::MalformedCSVError)
I think this is because the id is stored as a number, not a string, and thus has no quotes, and the csv parser expects ALL the entries to have quotes. Ideally I'd like to read "Bob" as a string and 1 as a number (and stuff it into a Hash of hashes)
(Have tried 'FasterCSV', that gem became 'csv' since ruby 1.9)
EDIT:
Was pointed out that the example worked fine (derp), was looking in the wrong place, it was an error with multi-line fields, question moved to Ruby CSV read multiline fields
Using the input you provided, I am unable to reproduce this.
1.9.3p194 :001 > require 'csv'
=> true
1.9.3p194 :002 > CSV.foreach("test.txt", {:headers => true}) { |row| puts row }
1,Bob,bob#bob.bob,omnomnom
2,Charlie,char#char.com,andcheese
4,Doug,diggyd#diglet.com,usemeltattack
=> nil
The only difference I see between our environments is that you are using rbenv, and I am using RVM. I also verified this on another machine I have with ruby 1.9.3-p194. Does the input you provided exactly match what is in your csv?
I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?
I envision something like the following:
-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);
-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
field1, field2;
Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.
You can use PigStorage by specify -tagsource as following
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
The first field in each Tuple will contain input path (INPUT_FILE_NAME)
According to API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
Dan
The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:
Example
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
#Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}
-tagSource is deprecated in Pig 0.12.0 .
Instead use
-tagFile - Appends input source file name to beginning of each tuple.
-tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath');
DUMP A ;
will give you the full file path as first column
( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P&C,P&CR,UC,40)
Refrence: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/PigStorage.html
A way to do this in Bash and PigLatin can be found at: How Can I Load Every File In a Folder Using PIG?.
What I've been doing lately though, and find to be much cleaner is embedding Pig in Python. That let's you throw all sorts of variables and such between the two. A simple example is:
#!/path/to/jython.jar
# explicitly import Pig class
from org.apache.pig.scripting import Pig
# COMPILE: compile method returns a Pig object that represents the pipeline
P = Pig.compile(
"a = load '$in'; store a into '$out';")
input = '/path/to/some/file.txt'
output = '/path/to/some/output/on/hdfs'
# BIND and RUN
results = P.bind({'in':input, 'out':output}).runSingle()
if results.isSuccessful() :
print 'Pig job succeeded'
else :
raise 'Pig job failed'
Have a look at Julien Le Dem's great slides as an introduction to this, if you're interested. There's also a ton of documentation at http://pig.apache.org/docs/r0.9.2/cont.pdf.