yarn-site.xml vs. yarn-default.xml in YARN - hadoop-yarn

What is the difference between yarn-site.xml vs. yarn-default.xml in YARN? It looks like yarn-default.xml is deprecated in Hadoop 2.2?

In all of Hadoop, *-default.xml files mostly serve as the documentation for the default values. Default values are all hard-coded anyways into the source. *-site.xml files represent any site specific changes and override the defaults specified in *-default.xml files.
To the specific question, yarn-default.xml and yarn-site.xml are both read by Hadoop YARN daemons, with yarn-default.xml denoting the defaults and yarn-site.xml representing custom configuration values that override those in yarn-default.xml.

Related

Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Connector dependencies should be in default scope
This is what flink-quickstart-scala suggests:
<!-- Add connector dependencies here. They must be in the default scope (compile). -->
<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
It also aligns with Flink project configuration:
We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).
Hive connector dependencies should be in provided scope
However, Flink Hive Integration docs suggests the opposite:
If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.
Why?
The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.

Sesame And RDF4J Custom Server Data Location

I have a Tomcat instance running an openrdf-sesame environment. By default the location of my openrdf-sesame database configuration and data is at %APPDATA%\aduna. I am trying to change where this data saves to something custom like C:\aduna. I have looked at documentation online, but it does not specify if this is defined in a configuration file somewhere or if it is an hard coded location. I also saw that RDF4J is a new replacement for openrdf-sesame? I wouldn't mind upgrading if I could achieve the result of specifying where to save my data. Any ideas?
OpenRDF Sesame is no longer maintained, it has been succeeded by the Eclipse RDF4J project. There is a migration guide available to help you figure out what to do when updating your project.
Although the Sesame project is no longer maintained, a documentation archive is available, and of course a lot of the RDF4J documentation also applies to Sesame, albeit with slightly different package names.
As for your specific question: the directory Sesame Server uses is determined by the system property info.aduna.platform.appdata.basedir. Set this property (at JVM startup, using the -D flag) to the desired location. See the archived documentation about application directory configuration for more details. As an aside: note that in RDF4J this property has been renamed (to org.eclipse.rdf4j.appdata.basedir), so if you upgrade to RDF4J, be sure to change this.

Where are IDEA run/debug configuration *defaults* stored?

It's easy to share run configurations instances in IDEA - simply instantiate a configuration and check "Share":
I'm already version controlling the resulting files in .idea/runConfigurations (in the relevant project) and part of ~/.IntelliJIdea* (for puppetising desktops). However, I can't find where IDEA stores the configuration defaults - it doesn't seem to be in either of these places. They must obviously be persisting it somewhere, because it works across restarts. The official documentation is unusually unhelpful in this case:
This check box is not available when editing the run/debug configuration defaults.
The particular use case is that I'd like all future "Behave" configurations to have the environment variable DISPLAY set to :1 to run browser tests in VNC rather than in the foreground.
Defaults (the ones that you configure under Defaults node from your screenshot) are per-project .. and therefore stored together with other non-shared configs in .idea/workspace.xml (which is not supposed to be stored under VCS as it contains developer/computer specific settings).
You can find such entries in the aforementioned file under <component name="RunManager" node. Default entries will have default="true" attribute.
There is no defaults of defaults for run/debug configs that you can edit/provision (configs that would be applied to any new projects). They are not stored in separate config file(s) on IDE level but initiated directly from plugin code .

Ivy: <ivy:settings> vs. <ivy:configure>

I have a master Ivy project that others include in their project via a svn:externals property. The project contains the Ivy jar, the default ivysettings.xml file that connects to our project, and a few Ant macros that allows me to standardize the way we build jars, etc. (For example, users use <jar.macro> vs. <jar>. The <jar.macro> uses the same parameters, but also automatically embeds the pom.xml in the jar and adds in Jenkins build information into the Manifest).
We also use Jenkins as our continuous integration system. One of the things I want to do is to clean the Ivy cache for each build, so we don't have any jar issues due to cache problems. To do this, I've setup my ivysettings.xml file to define a separate cache for each Jenkins Executor:
<ivysettings>
<property name="env.EXECUTOR_NUMBER" value="0" override="false"/>
<caches
defaultCacheDir="${ivy.default.ivy.user.dir}/cache-${env.EXECUTOR_NUMBER}"
resolutionCacheDir="${ivy.dir}/../target/ivy.cache"/>
<settings defaultResolver="default"/>
<include file="${ivy.dir}/ivysettings-public.xml"/>
<include url="${ivy.default.settings.dir}/ivysettings-shared.xml"/>
<include url="${ivy.default.settings.dir}/ivysettings-local.xml"/>
<include url="${ivy.default.settings.dir}/ivysettings-main-chain.xml"/>
<include url="${ivy.default.settings.dir}/ivysettings-default-chain.xml"/>
</ivysettings>
I originally used the <ivy:settings> task to configure our projects with Ivy. However, all of the Jenkins executors were using the same Ivy cache which caused problems. I switched from <ivy:settings> to <ivy:configure> and the problem went away. Apparently, <ivy:configure> sets up Ivy immediately (and thus sets up the caches correctly) while <ivy:settings> doesn't set Ivy up until <ivy:resolve> is called.
I've seen some emails on Nabble about <ivy:configure> being deprecated (or maybe not). I see nothing in the Ivy online documentation stating <ivy:configure> is being deprecated.
So, when would you use <ivy:settings> vs. <ivy:configure>. In my case, since I needed separate caches for each Jenkins executor, I needed to use <ivy:configure>, but is there a reason I might use <ivy:settings> over <ivy:configure>? And, is <ivy:configure> deprecated?
here's what I found:
<ivy:settings> is newer and the preferred way.
<ivy:configure> may or may not be deprecated.
<ivy:settings> doesn't set my Ivy settings until <ivy:resolve> is called while <ivy:configure> sets all Ivy settings as soon as the task is executed.
The last one is my issue. Since I have parallel Jenkins builds going on, and I want to start out each build with a completely clean cache, I use customized cache settings depending upon the Jenkins executor number. The caches are labeled cache-0 through cache-5.
However, since <ivy:settings> isn't executed until I call <ivy:resolve>, my customized cache settings aren't picked up. I call <ivy:cleancache> before I call Ivy resolve which causes the builds to clean out a common cache. Hilarity ensues. Using <ivy:cofnfigure> fixes this problem.

Apache Ivy Terms & Ambiguities

I'm learning how to augment my build with Ivy using a "brute force" method of just trying to get a few sample projects up and running. I've poured over the official docs and read several online tutorials, but am choking on a few terms that seem to be used vaguely, ambiguously and/or in conflicting ways somehow. I'm just looking for an experienced Ivy connoisseur to help bring some clarity to these terms for me.
"Resolution" Cache vs. "Repository" Cache vs. "Ivy" Cache
The "Ivy Repository", as opposed to my normal SCM which is a server running SVN
What's the difference between these 3 types of cache? What's the difference between the "Ivy Repository" and my SVN?
Thanks to anyone who can help!
"Resolution" Cache vs. "Repository" Cache vs. "Ivy" Cache
The ivy cache is basically a folder, where ivy stores artifacts and configurations. If not configured differently it can be found in UserHome/.ivy2
The ivy cache consists of the resolution cache and a repository cache.
The repository cache contains the artifacts from a repository, that were downloaded by ivy. It is caching the repository, so that ivy won't need to query the repository every time it tries to resolve/download an artefact. If it finds an suitable artifact in the repository cache it will not query the repository. Thus saving the cost to query the repository. If and how the cache is used is a bit more complicated and depends on the dependencies/configuration.
The resolution cache is a collection of ivy-specific files, that tell ivy how an artifact was resolved (downloaded).
The "Ivy Repository", as opposed to my normal SCM which is a server running SVN
A Repository in the ivy world is a location, which contains artifacts(jar) files. This can be the local filesystem or a web server. It has no versioning system. Each version of an artifact is contained in a seperate folder. You can't commit artifacts, you just add them to the file system. See the terminology
org\artifact\version1\artifact.jar
org\artifact\version2\artifact.jar
A repository is accessed via a resolver, which has to know the layout of the repository.
From the doc on caches:
Cache types
An Ivy cache is composed of two different parts:
the repository cache
The repository cache is where Ivy stores data downloaded from module repositories, along with some meta information concerning these artifacts, like their original location.
This part of the cache can be shared if you use a well suited lock strategy.
the resolution cache
This part of the cache is used to store resolution data, which is used by Ivy to reuse the results of a resolve process.
This part of the cache is overwritten each time a new resolve is performed, and should never be used by multiple processes at the same time.
While there is always only one resolution cache, you can define multiple repository caches, each resolver being able to use a separate cache.