Hadoop: wrong classpath in map reduce job - apache

I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:
error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found
So, it seems that the map process doesnt find the class. So I tried this:
1) add the hbase.jar to the HADOOP_CLASSPATH on every node
2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code
nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.
If I take a look into the job.xml I see the following entry:
name=tmpjars value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar
This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?
Here is the java code I try to execute:
configuration = new Configuration(false);
configuration.set("mapred.job.tracker", "192.168.2.41:8021");
configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
configuration.set("hbase.zookeeper.property.clientPort", "2181");
Job job = new Job(configuration, "HBase Bulk Import for "
+ tablename);
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path("myfile1"));
FileOutputFormat.setOutputPath(job, new Path("myfile2"));
job.waitForCompletion(true);
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
configuration);
loader.doBulkLoad(new Path("myFile3"), hTable);
EDIT:
I tried a little bit more and its totaly strange. I add the following line to the java code:
job.setJarByClass(HFileOutputFormat.class);
after I executed this, the error is gone, but another class not found exception appear:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found
HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.
After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.
While searching for the eroor I saw the following strcuture in the the job staging direcotry:
and inside the libjars direcotry I saw the relevant jar files
so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?

I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:
Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.
Submitting the job
Meaning getting the classpath right locally for when your driver runs:
$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....
On the server
In your hadoop-env.sh
export HADOOP_CLASSPATH=$(hbase claspath)
or use
TableMapReduceUtil.addDependencyJars

I found a "hacked" solution which worked for me, but I'm not happy with it because it's not really practicable.
My "hacked" solution:
create one big Jar with all necessary class files, I called it "big.jar" and add it to the local (eclipse) classpath
add the line: job.setJarByClass(MyMapperClass.class) ... the MyMapperClass has to be in the big.jar
When I execute this the big.jar will be copied for every job to the filesystem. No errors anymore. The problem is, that the jar is 80mb in size and have to be copied every time.
If anywone knows a better way I would be tahnkful if he could tell me how.
EDIT:
Now I try to execute jobs with Apache Pig and have exactly the same problem. My hacked soultion doesn't work in this case because pig creats the jobs automaticly. Here is the pig error:
java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found

Related

sqlite3.OperationalError: When trying to connect to S3 Airflow Hook

I'm currently exploring implementing hooks in some of my DAGs. For instance, in one dag, I'm trying to connect to s3 to send a csv file to a bucket, which then gets copied to a redshift table.
I have a custom module written which I import to run this process. I am trying to currently set up an S3Hook to undergo this process instead. But I'm a little confused in setting up the connection, and how everything works.
First, I input the hook
from airflow.hooks.S3_hook import S3Hook
Then I try to make the hook instance
s3_hook = S3Hook(aws_conn_id='aws-s3')
Next I try to set up the client
s3_client = s3_hook.get_conn()
However when I run the client line above, I received this error
OperationalError: (sqlite3.OperationalError)
no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?
LIMIT ? OFFSET ?]
[parameters: ('aws-s3', 1, 0)]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
I'm trying to diagnose the error, but the tracebook is long. I'm a little confused on why sqlite3 is involved here, when I'm trying to utilize s3 here. Can anyone unpack this? Why is this error being thrown when trying to set up the client?
Thanks
Airflow is not just a library - it's also an application.
To execute Airflow code you must have airflow instance running this mean also having a database with the needed schema.
To create the tables you must execute airflow init db.
Edit:
After the discussion in comments. Your issue is that you have working Airflow application inside docker but your DAGs are written on your local disk. Docker is closed environment if you want Airflow to recognize your dags you must move the files to the DAG folder in the docker.

Running manifests (classes) from a task or plan in Puppet Enterprise

TL;DR
In Puppet Enterprise, how do I run a manifest (testpp.pp) from a task or plan (not Bolt).
plan base_windows::testplan (
  TargetSpec $targets,
  Optional[String] $contents = undef,
  String $filename,
){
  $apply_prep($targets)
  $apply_results = apply($targets, '_catch_errors' => true) {
    class { 'base_windows::testpp': }
  }
  $apply_results.each | $result | {
    notice($result.report)
  }
}
apply_prep seems to succeed, but apply is failing with the following error:
{
"msg" : "Evaluation Error: Unknown function: 'report'. (file: /opt/puppetlabs/server/data/orchestration-services/code/environments/development/modules/base_windows/plans/testplan.pp, line: 16, column: 19)",
"kind" : "bolt/plan-failure",
"details" : {
"class" : "Bolt::PAL::PALError"
}
}
If I change the code to:
plan base_windows::testplan (
  TargetSpec $targets,
  Optional[String] $contents = undef,
  String $filename,
){
  apply_prep($targets)
  $apply_results = apply($targets, '_catch_errors' => true) {
# Is this how to call a class? I cannot find an example.    
class { 'base_windows::testpp': }
  }
  $apply_results.each |$result| {
$target = $result.target.name
if $result.ok {
  out::message("${target} returned a value: ${result.value}")
} else {
 out::message("${target} errored with a message: ${result.error.message}")
}
  }
}
The plan tells me it has failed, but there are no errors in the node's report. In fact, there is no entry for the time the plan was executed.
I cannot find any examples on how to call a class from a plan, so the above apply() is a guess, based on this documentation.
I have installed the puppetlabs_reboot module and successfully ran a plan using it, therefore, I conclude my system is set up correctly, it's just my code that is wrong.
Background
I may be going about this all wrong, so here is some background to the problem. Currently, I have a series of manifests that install various packages from the public Chocolatey repository depending on a node's classification. Package definitions are stored in Hiera data and each package' version is set to latest. At the end of the Package{} resource, some manifests include a reboot.
These manifests are used to provision new nodes and keep existing nodes up-to-date with the latest package version.
The Puppet agent is set to run once per hour and if the source package is updated in the Chocolatey repo, on the next Puppet run, the manifest will update the package, rebooting the node, if required.
Goal
New nodes are provisioned with the latest package version.
Prevent package updates at undetermined times on existing nodes.
Continue to allow Puppet agent runs every hour.
Make use of existing manifests.
Ideas
Split out the package{} code from the profile manifest and place them in tasks / plans, allowing packages to be updated out-of-hours.
Specify the actual package version in Hiera. Although this is more declarative and idempotent, it means keeping an eye on over 100 package version. I guess it would be fairly simple to interrogate the Chocolatey repos with code to pull the latest version number, but even so I am no better off.
Create a task with a script that runs choco upgrade all, however, the next Puppet run would revert package versions according to the version defined in Hiera, meaning Hiera still needs to be kept up-to-date.
Problems
As per the main crux of this question, how do I run manifests (classes) from plans? If I understand correctly, tasks are for ad-hoc scripts, whereas plans can run tasks and manifests. As a lot of time has been invested in writing manifests, I would prefer not to rewrite all my manifests as scripts.
I am confused by the Puppet documentation as it seems to switch between PE and Bolt syntax. I am using Puppet Enterprise where Puppet says they don't recommend using Bolt but their examples seem to site Bolt commands.
No errors in the node' report. apply_prep() reports executed successfully, albeit taking far longer to execute than puppetlabs_reboot module, but apply() results in a failure, but nothing is logged in the node's reports.
Using puppetlabs_reboot module as a reference, it appears their plan uses a bunch of tasks. It appears that they don't use apply() to run their reboot{} class. Is this not duplicating the work?
If anyone has any suggestions or ideas, I'd be grateful if you could share.
I've got it to work. The class I was trying to run, required parameters that I hadn't provided!
plan base_windows::testplan (
TargetSpec $targets,
Optional[String] $contents = undef,
String $filename,
){
apply_prep($targets)
$apply_results = apply($targets, '_catch_errors' => true) {
class { 'base_windows::testpp':
filename => $filename,
contents => $contents,
}
}
}
# Output the whole result_set in the PE console
return $apply_results
I found this out using the logs.
Turn on debug level logging in /etc/puppetlabs/puppetserver/logback.xml (root level="debug")
Tail the following logs:
tail -f /var/log/puppetlabs/bolt-server/bolt-server.log
tail -f /var/log/puppetlabs/puppetserver/puppetserver.log | grep -B 5 -A 5 'testplan'
tail -f /var/log/puppetlabs/orchestration-services/orchestration-services.log

Create a repository on a remote server with RDF4J

I've been trying to create a new repository on a remote GraphDB server using RDF4J, but I'm having problems.
This runs, but is seemingly not correct
HTTPRepositoryConfig implConfig = new HTTPRepositoryConfig(address);
RepositoryConfig repoConfig = new RepositoryConfig("test", "test", implConfig);
Model m = new
However, based on the info I get from "edit repository" in the workbench, the result doesn't look right. All the values are empty, except for id and title.
This fails
I tried to copy the settings from an existing repository that I created on the workbench, but that failed with:
org.eclipse.rdf4j.repository.config.RepositoryConfigException:
Unsupported repository type: owlim:MonitorRepository
The code for that attempt is inspired by the one found here . Except that the config file is based on an existing repo, as explained above. I also tried to config file provided in the example, but that failed aswell:
org.eclipse.rdf4j.repository.config.RepositoryConfigException:
Unsupported Sail type: graphdb:FreeSail
Anyone got any tips?
UPDATE
As Henriette Harmse correctly pointed out, I should have provided my code, not simply linked to it. That way I might have discovered that I hadn't done a complete copy after all, but changed the important first bits that she points out in her answer. Full code below:
String address = "serveradr";
RemoteRepositoryManager repositoryManager = new RemoteRepositoryManager( address);
repositoryManager.initialize();
// Instantiate a repository graph model
TreeModel graph = new TreeModel();
InputStream config = Rdf4jHelper.class.getResourceAsStream("/repoconf2.ttl");
RDFParser rdfParser = Rio.createParser(RDFFormat.TURTLE);
rdfParser.setRDFHandler(new StatementCollector(graph));
rdfParser.parse(config, RepositoryConfigSchema.NAMESPACE);
config.close();
// Retrieve the repository node as a resource
Resource repositoryNode = graph.filter(null, RDF.TYPE, RepositoryConfigSchema.REPOSITORY).subjects().iterator().next();
// Create a repository configuration object and add it to the repositoryManager
RepositoryConfig repositoryConfig = RepositoryConfig.create(graph, repositoryNode);
It fails on the last line.
ANSWERED #HenrietteHarmse gives the correct method in her answer below. The error is caused by missing dependencies. Instead of using RDF4J directly, I should have used the graphdb-free-runtime.
There are a number of issues here:
(1) RepositoryManager repositoryManager = new LocalRepositoryManager(new File(".")); will create a repository where ever your Java application is running from.
(2) Changing to new LocalRepositoryManager(new File("$GraphDBInstall/data/repositories")) will cause the repository to be created under the control of GraphDB (assuming you have a local GraphDB instance) only if GraphDB is not running. If you start GraphDB after running your program, you will be able to see the repository in GraphDB workbench.
(3) What you need to do is get the repository manager of the remote GraphDB, which can be done with RepositoryManager repositoryManager = RepositoryProvider.getRepositoryManager("http://IPAddressOfGraphDB:7200");.
(4) In the way you have specified the config, you cause the RDF graph config to be lost. The correct way to specify it is:
RepositoryConfig repositoryConfig = RepositoryConfig.create(graph, repositoryNode);
repositoryManager.addRepositoryConfig(repositoryConfig);
(5) A minor issue is that GraphUtil.getUniqueSubject(...) has been deprecated, for which you can use something like the following:
Model model = graph.filter(null, RDF.TYPE, RepositoryConfigSchema.REPOSITORY);
Iterator<Statement> iterator = model.iterator();
if (!iterator.hasNext())
throw new RuntimeException("Oops, no <http://www.openrdf.org/config/repository#> subject found!");
Statement statement = iterator.next();
Resource repositoryNode = statement.getSubject();
EDIT on 20180408:
(5) Or you can use the compact option as #JeenBroekstra suggested in the comments:
Models.subject(
graph.filter(null, RDF.TYPE, RepositoryConfigSchema.REPOSITORY))
.orElseThrow(() -> new RuntimeException("Oops, no <http://www.openrdf.org/config/repository#> subject found!"));
EDIT on 20180409:
For convenience I have added the complete code example here.
EDIT on 20180410:
So the actual culprit turned out to be an incorrect pom.xml. The correct version is as below:
<dependency>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb-free-runtime</artifactId>
<version>8.4.1</version>
</dependency>
I believe I just had the same issue. I used the example code from GraphDB Free for running with RDF4J as a remote service and ran into the same exception as you (Unsupported Sail type: graphdb:FreeSail). Henriette Harmse's answer does not directly address this issue but one should follow the suggestions given there to avoid running into issues later. In addition, based on a look into the RDF4J code you need the following dependency in your pom.xml file (assuming GraphDB 8.5):
<dependency>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb-free-runtime</artifactId>
<version>8.5.0</version>
</dependency>
This seems to be because there is some kind of service loading going on with META-INF, which I frankly am not familiar with. Maybe someone can provide more details in the comments. The requirement for adding this dependency in also seems to be absent from the instructions, so if this works for you, please let me know. Others who followed the same steps we did should be able to resolve this issue as well then.

MsTest, DataSourceAttribute - how to get it working with a runtime generated file?

for some test I need to run a data driven test with a configuration that is generated (via reflection) in the ClassInitialize method (by using reflection). I tried out everything, but I just can not get the data source properly set up.
The test takes a list of classes in a csv file (one line per class) and then will test that the mappings to the database work out well (i.e. try to get one item from the database for every entity, which will throw an exception when the table structure does not match).
The testmethod is:
[DataSource(
"Microsoft.VisualStudio.TestTools.DataSource.CSV",
"|DataDirectory|\\EntityMappingsTests.Types.csv",
"EntityMappingsTests.Types#csv",
DataAccessMethod.Sequential)
]
[TestMethod()]
public void TestMappings () {
Obviously the file is EntityMappingsTests.Types.csv. It should be in the DataDirectory.
Now, in the Initialize method (marked with ClassInitialize) I put that together and then try to write it.
WHERE should I write it to? WHERE IS THE DataDirectory?
I tried:
File.WriteAllText(context.TestDeploymentDir + "\\EntityMappingsTests.Types.csv", types.ToString());
File.WriteAllText("EntityMappingsTests.Types.csv", types.ToString());
Both result in "the unit test adapter failed to connect to the data source or read the data". More exact:
Error details: The Microsoft Jet database engine could not find the
object 'EntityMappingsTests.Types.csv'. Make sure the object exists
and that you spell its name and the path name correctly.
So where should I put that file?
I also tried just writing it to the current directory and taking out the DataDirectory part - same result. Sadly, there is limited debugging support here.
Please use the ProcessMonitor tool from technet.microsoft.com/en-us/sysinternals/bb896645. Put a filter on MSTest.exe or the associate qtagent32.exe and find out what locations it is trying to load from and at what point in time in the test loading process. Then please provide an update on those details here .
After you add the CSV file to your VS project, you need to open the properties for it. Set the Property "Copy To Output Directory" to "Copy Always". The DataDirectory defaults to the location of the compiled executable, which runs from the output directory so it will find it there.

extra-paths not added to python path with zc.recipe.testrunner

I am trying to run tests by adding a version of tornado downloaded from github.com in the sys.path.
[tests]
recipe = zc.recipe.testrunner
extra-paths = ${buildout:directory}/parts/tornado/
defaults = ['--auto-color', '--auto-progress', '-v']
But when I run bin/tests I get the following error :
ImportError: No module named tornado
Am I not understanding how to use extra-paths ?
Martin
Have you tried looking into generated bin/tests script if it contains your path? It will tell definitely if your buildout.cfg is correct or not. Maybe problem is elsewhere. Because it seem that your code is ok.
If you happen to regularly include various branches from git/mercurial or elsewhere to buildout, you might be interested in mr.developer. mr.developer can download and add package to develop =. You wont need to set extra-path in every section.