I am trying to use CSVLoader from Piggybank. Below are the first two lines of my code:
register 'piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
It throws the following error:
2013-10-24 14:26:51,427 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file
system at: file:///
2013-10-24 14:26:52,029 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.pig.piggybank.storage.CSVLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Can someone tell me what's going on? I am executing this script from the same folder where my piggybank.jar is located.
I ran into a similar problem when I was experimenting with pig, although it was the XMLLoader for me. The solution that worked for me was to register the entire path to the jar, instead of the relative path. so if the jar is located at /usr/lib/pig/piggybank.jar run the code as follows:
register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
I checked out the code from the url 'http://svn.apache.org/repos/asf/pig/trunk/' and re-built the jar file. IT works fine now. :)
The same is working fine
register 'piggybank.jar' ;
A = load '/xmlinput/demo.xml' using org.apache.pig.piggybank.storage.XMLLoader('property') as (x:chararray);
B = foreach A generate REPLACE(x,'[\n]','') as x;
C = foreach B generate REGEX_EXTRACT_ALL(x,'.(?:)([^<]).(?:)([^<]).*');
D =FOREACH C GENERATE FLATTEN (($0));
STORE D INTO 'xmlcsvpig' USING org.apache.pig.piggybank.storage.CSVExcelStorage();
Related
I have a simple beam pipeline, as follows:
with beam.Pipeline() as pipeline:
output = (
pipeline
| 'Read CSV' >> beam.io.ReadFromText('raw_files/myfile.csv',
skip_header_lines=True)
| 'Split strings' >> beam.Map(lambda x: x.split(','))
| 'Convert records to dictionary' >> beam.Map(to_json)
| beam.io.WriteToBigQuery(project='gcp_project_id',
dataset='datasetID',
table='tableID',
create_disposition=bigquery.CreateDisposition.CREATE_NEVER,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND
)
)
However upon runnning I get a typeError, stating the following:
line 2147, in __init__
self.table_reference = bigquery_tools.parse_table_reference(if isinstance(table,
TableReference):
TypeError: isinstance() arg 2 must be a type or tuple of types
I have tried defining a TableReference object and passing it to the WriteToBigQuery class but still facing the same issue. Am I missing something here? I've been stuck at this step for what feels like forever and I don't know what to do. Any help is appreciated!
This probably occurred since you installed Apache Beam without the GCP modules. Please make sure to do following (in a virtual environment).
pip install apache-beam[gcp]
It's a weird error though so feel free to file a Github issue against the Apache Beam project.
I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);
I'm using Hue for PIG scripts on amazon EMR. I am using the declare and default statements as mentioned in the documentation.
I have some %default and %declare statements and it looks like they are
not preprocessed within Hue. Therefore, although the parameters are defined
in my script, the editor keeps popping in a parameter configuration window. If I leave the parameter blank, the job fails with an error.
Sample Script
%declare OUTPUT_FOLDER 'testingOutput01';
ts = LOAD 's3://testbucket1/input/testdata-00000.gz' USING PigStorage('\t');
STORE ts INTO 's3://testbucket1/$OUTPUT_FOLDER' USING PigStorage('\t');
Upon execution, it shows the pop-up window asking for values for OUTPUT_FOLDER. If I leave it blank it fails with the following error:
2015-06-23 20:15:54,908 [main] ERROR org.apache.pig.Main - ERROR 2997:
Encountered IOException. org.apache.pig.tools.parameters.ParseException:
Encountered "<EOF>" at line 1, column 12.
Was expecting one of:
<IDENTIFIER> ...
<OTHER> ...
<LITERAL> ...
<SHELLCMD> ...
Is that the expected behavior? Is this a known issue or am I missing something?
Configuration details:
AMI version:3.7.0
Hadoop distribution:Amazon 2.4.0
Applications:Hive 0.13.1, Pig 0.12.0, Impala 1.2.4, Hue
The same behavior is seen with default instead of declare.
If you need any clarifications then please do comment on this question. I will update it as needed.
Hue does not support %declare with a default statement. It will be fixed with: https://issues.cloudera.org/browse/HUE-2508
The current temporary workaround is to put any value in the popup.
Got a project with Apache FOP, have to make a server based application which will use Apache FOP and pick XML+XSLT files, convert it to XSL:FO and then output an PDF file.
Everything works fine until it comes to XSL:FO=>PDF, Im getting a error in my console which tells me:
"could not connect to java server at line 15"
I'm a newbie programmer, and this might be a simple task to complete but I just can't figure it out how to run this bloody java server ... so my code might be working. Any help would be great. (FYI Im working on Windows)
Here is the Perl Code:
use XML::LibXSLT;
use XML::LibXML;
use XML::ApacheFOP;
my $parser = XML::LibXML->new();
my $xslt = XML::LibXSLT->new();
my $source = $parser->parse_file('books.xml');
my $style_doc = $parser->parse_file('books.xsl');
my $stylesheet = $xslt->parse_stylesheet($style_doc);
my $results = $stylesheet->transform($source);
my $Fop = XML::ApacheFOP->new();
$Fop->fop( xml => "books.xml", xsl => "books.xsl", outfile => "temp.pdf" )
or die "cannot create pdf: " . $Fop->errstr;
Would be glad to get some help.
Cheers.
You need to run JavaServer by this command
/path/to/java -classpath \
/path/to/JavaServer.jar\
:/usr/local/xml-fop/build/fop-0.20.5-RFC3066-patched.jar\
:/usr/local/xml-fop/lib/avalon-framework-cvs-20020806.jar\
:/usr/local/xml-fop/lib/batik.jar\
:/usr/local/xml-fop/lib/xalan-2.4.1.jar\
:/usr/local/xml-fop/lib/xercesImpl-2.2.1.jar \
com.zzo.javaserver.JavaServer
This works for me but with fop 0.20 with fop-0.20.5-RFC3066-patched.jar
I need to be able to read in a path file from my simple_switch.py application.I have added the following code to my simple_switch.py in python.
LOG = logging.getLogger(__name__)
CONF = cfg.CONF
CONF.register_cli_opts([
cfg.StrOpt('path-file', default='test.txt',
help='path-file')
])
I attempt to start the application as follows.
bin/ryu-manager --observe-links --path-file test.txt ryu/app/simple_switch.py
However I get the following error.
usage: ryu-manager [-h] [--app-lists APP_LISTS] [--ca-certs CA_CERTS]
[--config-dir DIR] [--config-file PATH]
[--ctl-cert CTL_CERT] [--ctl-privkey CTL_PRIVKEY]
[--default-log-level DEFAULT_LOG_LEVEL] [--explicit-drop]
[--install-lldp-flow] [--log-config-file LOG_CONFIG_FILE]
[--log-dir LOG_DIR] [--log-file LOG_FILE]
[--log-file-mode LOG_FILE_MODE]
[--neutron-admin-auth-url NEUTRON_ADMIN_AUTH_URL]
[--neutron-admin-password NEUTRON_ADMIN_PASSWORD]
[--neutron-admin-tenant-name NEUTRON_ADMIN_TENANT_NAME]
[--neutron-admin-username NEUTRON_ADMIN_USERNAME]
[--neutron-auth-strategy NEUTRON_AUTH_STRATEGY]
[--neutron-controller-addr NEUTRON_CONTROLLER_ADDR]
[--neutron-url NEUTRON_URL]
[--neutron-url-timeout NEUTRON_URL_TIMEOUT]
[--noexplicit-drop] [--noinstall-lldp-flow]
[--noobserve-links] [--nouse-stderr] [--nouse-syslog]
[--noverbose] [--observe-links]
[--ofp-listen-host OFP_LISTEN_HOST]
[--ofp-ssl-listen-port OFP_SSL_LISTEN_PORT]
[--ofp-tcp-listen-port OFP_TCP_LISTEN_PORT] [--use-stderr]
[--use-syslog] [--verbose] [--version]
[--wsapi-host WSAPI_HOST] [--wsapi-port WSAPI_PORT]
[--test-switch-dir TEST-SWITCH_DIR]
[--test-switch-target TEST-SWITCH_TARGET]
[--test-switch-tester TEST-SWITCH_TESTER]
[app [app ...]]
ryu-manager: error: unrecognized arguments: --path-file
It does look like I need to register a new command line option somewhere before I can use it.Can some-one point out to me how to do that? Also can someone explain how to access the file(text.txt) inside the program?
You're on the right track, however the CONF entry that you are creating actually needs to be loaded before your app is loaded, otherwise ryu-manager has no way of knowing it exists!
The file you are looking for is flags.py, under the ryu directory of the source tree (or under the root installation directory).
This is how the ryu/tests/switch/tester.py Ryu app defines it's own arguments, so you might use that as your reference:
CONF.register_cli_opts([
# tests/switch/tester
cfg.StrOpt('target', default='0000000000000001', help='target sw dp-id'),
cfg.StrOpt('tester', default='0000000000000002', help='tester sw dp-id'),
cfg.StrOpt('dir', default='ryu/tests/switch/of13',
help='test files directory')
], group='test-switch')
Following this format, the CONF.register_cli_opts takes an array of config types exactly as you have done it (see ryu/cfg.py for the different types available).
You'll notice that when you run the ryu-manager help, i.e.
ryu-manager --help
the list that comes up is sorted by application (e.g. the group of arguments under 'test-switch options'). For that reason, you will want to specify a group name for your set of commands.
Now let us say that you used the group name 'my-app' and have an argument named 'path-file' in that group, the command line argument will be --my-app-path-file (this can get a little long), while you can access it in your application like this:
from ryu import cfg
CONF = cfg.CONF
path_file = CONF['my-app']['path_file']
Note the use of dash versus the use of underscores.
Cheers!