Pig XmlLoader code - apache-pig

Could someone please help with how to write a pig xmlloader for this kind of data.
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="9" CreationDate="2012-01-17T21:03:59.200" Score="30" ViewCount="698" Body="<p>From the front end, <code>\[InvisibleApplication]</code> can be entered as <kbd>Esc</kbd> <kbd>#</kbd> <kbd>Esc</kbd>, and is an invisible operator for <code>#</code>!. By an unfortunate combination of key-presses (there may have been a cat involved), this crept up in my code and I spent a great deal of time trying to figure out why in the world <code>f x</code> was being interpreted as <code>f[x]</code>. Example:</p>
<p><img src="http://i.stack.imgur.com/2Hxll.png" alt="enter image description here"></p>
<p>Now there is no way I could've spotted this visually. The <code>*Form</code>s weren't of much help either. If you're careful enough, you can see an invisible character between <code>f</code> and <code>x</code> if you move your cursor across the expression. Eventually, I found this out only by looking at the contents of the cell. </p>
<p>There's also <code>\[InvisibleSpace]</code>, <code>\[InvisibleComma]</code> and <code>\[ImplicitPlus]</code>, which are analogous to the above. There must be some use for these (perhaps internally), which is why it has been implemented in the first place. I can see the use for invisible space (lets you place superscripts/subscripts without needing anything visible to latch on to), and invisible comma (lets you use indexing like in math). It's the invisible apply that has me wondering...</p>
<p>The only advantage I can see is to sort of visually obfuscate the code. Where (or how) is this used (perhaps internally?), and can I disable it? If it's possible to disable, will there be any side effects?</p>
" OwnerUserId="5" LastEditorUserId="5" LastEditDate="2012-04-29T04:50:20.303" LastActivityDate="2013-10-22T10:48:32.560" Title="Usage of \[InvisibleApplication] and other related invisible characters" Tags="<front-end><syntax>" AnswerCount="4" CommentCount="1" FavoriteCount="4" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="42" CreationDate="2012-01-17T21:10:34.680" Score="49" ViewCount="1347" Body="<p><code>Cases</code>, <code>Select</code>,<code>Pick</code> and <code>Position</code> each have different syntaxes and purposes, but there are times when you can express the same calculation equivalently using either of them. So with this input:</p>
<pre><code>test = RandomInteger[{-25, 25}, {20, 2}]
{{-15, 13}, {-8, 16}, {-8, -19}, {7, 6}, {-21, 9}, {-3, -25}, {21, -18}, {4, 4}, {2, -2}, {-24, 8}, {-17, -8}, {4, -18}, {22, -24}, {-4, -3}, {21, 0}, {19, 18}, {-23, -8}, {23, -25}, {14, -2}, {-1, -13}}
</code></pre>
<p>You can get the following equivalent results:</p>
<pre><code>Cases[test, {_, _?Positive}]
{{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}
Select[test, #[[2]] &gt; 0 &amp;]
{{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}
Pick[test, Sign[test[[All, 2]] ], 1]
{{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}
test[[Flatten#Position[test[[All, 2]], _?Positive] ]]
{{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}
</code></pre>
<p>Are there performance or other considerations that should guide which you should use? For example, is the pattern-matching used in <code>Cases</code> likely to be slower than the functional tests used in <code>Select</code>? Are there any generic rules of thumb, or is testing the particular case you are using the only solution?</p>
" OwnerUserId="8" LastEditorUserId="8" LastEditDate="2012-01-20T04:45:34.940" LastActivityDate="2012-01-20T04:45:34.940" Title="What best practices or performance considerations are there for choosing between Cases, Position, Pick and Select?" Tags="<performance-tuning><pattern-matching>" AnswerCount="4" CommentCount="0" FavoriteCount="28" />
</posts>

If you want load the xml data below is the following code
A = LOAD '$input' using
org.apache.pig.piggybank.storage.XMLLoader('row')
as (x:chararray);
B = FOREACH A GENERATE x;
dump B;

Related

Filtering numbers or strings in a comma delimited object

I am using multi-select to filter out data.
<Multiselect
v-model="roles"
class="input1"
placeholder="Select Roles"
mode="tags"
:searchable="true"
:options="roleOptions"
/>
<Multiselect
v-model="sub_organization"
class="input1"
placeholder="Select sub-organization"
mode="tags"
:searchable="true"
:options="suborgOptions"
/>
data:() => ({
mode: "tags",
closeOnSelect: false,
roleOptions: [],
suborgOptions: [],
searchable: true,
sub_organization: [],
roles: [],
filteredData: [],
fetchedData: [],
}),
searchResult() {
this.filteredData = this.fetchedData.filter((data) => {
// var intRoles = parseInt(data.roles.split(", "))
// var intSuborgs = parseInt(data.suborgs.split(", "))
return (
// intSuborgs == this.sub_organization &&
// intRoles == this.roles
data.suborgs.includes(this.sub_organization) &&
data.roles.includes(this.roles)
);
});
},
data.roles = {1, 3},
{1, 4, 5, 7},
{10, 14},
{1, 9},
{2, 4, 6, 8},
{4, 5},
{4, 10},
{9, 1, 4}
for example:
when I use includes and I searched 1 it returns all data.roles with 1 in it including 10, 14, 4, 10 etc.
using includes():
searching 1 returns {1, 3}, {1, 4, 5, 7}, {10, 14}, {1, 9}, {4, 10}, {9, 1, 4}
searching 4 returns {1, 4, 5, 7}, {10, 14}, {4, 10}, {9, 1, 4}, {4, 5}, {2, 4, 6, 8}
as you can see I commented out intRoles and intSuborgs, I tried using parseInt and then split it, when I search 1 it returns only the objects that have 1 in index 0
using parseInt and split:
searching 1 returns {1, 3}, {1, 4, 5, 7}, {1, 9}
searching 4 returns {4, 5}, {4, 10},
What I want to happen is when I search 1 it would return only the objects that has 1 in it excluding double digits with 1, or since I am using multi-select searching 1 and 4 returns objects with 1 and 4 in it excluding double digits with 1 and 4 also.

The attribute 'children_' of Agglomerative clustering

I am writing a very basic program with observations not exceeding 20 values (X1 is the original dateset).
X1_test=X1_df.iloc[0:20,]
from sklearn.cluster import AgglomerativeClustering
ag= AgglomerativeClustering(n_clusters=6, affinity= 'euclidean', linkage='ward', compute_full_tree= True, compute_distances=True)
ag.fit(X1_test)
When I run the attribute ag.chilren_ the values come as
array([[10, 13],
[ 1, 6],
[16, 18],
[ 2, 19],
[ 4, 20],
[ 8, 15],
[12, 23],
[14, 21],
[ 0, 17],
[ 9, 26],
[22, 27],
[11, 24],
[ 5, 29],
[ 7, 25],
[ 3, 28],
[30, 31],
[32, 35],
[33, 36],
[34, 37]], dtype=int64)
how come values in this output are coming more than 20 since i have only 20 observations?
Please help
According to what I can understand from reading Scikit-learn's documentation, each array represents a node that has two values. If the value is smaller than your sample size, it represents a leaf and you can therefore consider its value to be the sample index. But if the value is bigger or equal to your sample size than this is not a leaf, it is a different node (that merged earlier). Which node is it? The node that is stored in index [value - n_samples] in the children_ attribute.
So for example, if your sample size is 20 and you have a node that merges 3 with 28, you can understand that 3 is the leaf of your third sample and 28 is the node of children_[8] (because 28-20=8). So it will be the node of [14, 21] in your case.

How input data in line chart in Pentaho?

I want to make this chart in Pentaho CDE:
based in this chart (I think that is the most similar from among CCC Components):
(The code is in this link.)
but I don't know how I can adapt my data input to that graph.
For example, I want to consume the data with this format:
[Year, customers_A, customers_B, cars_A, cars_B] [2014, 8, 4, 23, 20]
[2015, 20, 6, 30, 38]
How I can input my data in this chart?
Your data should come as an object such as this:
data = {
metadata: [
{ colName: "Year", colType:"Numeric", colIndex: 1},
{ colName: "customers_A", colType:"Numeric", colIndex: 2},
{ colName: "customers_B", colType:"Numeric", colIndex: 3},
{ colName: "cars_A", colType:"Numeric", colIndex: 4},
{ colName: "cars_B", colType:"Numeric", colIndex: 5}
],
resultset: [
[2014, 8, 4, 23, 20],
[2015, 20, 6, 30, 38]
],
queryInfo: {totalRows: 2}
}

Is it possible to use highcharts heat map chart using rally sdk?

I'm building a rally custom HTML app. I would like to create a heat map chart. How can I do that?
I've tried to create Rally Chart of the type 'heatmap' (code below). As a result, I see the 404 error message in the console:
GET https://rally1.rallydev.com/slm/panel/highcharts/heatmap.js?_dc=1558434971290 404
When I try to use Highcharts library directly I'm getting a conflict with https://rally1.rallydev.com/apps/2.1/lib/analytics/analytics-all.js - highcharts being loaded twice. As a result analytics lib do not load I guess and I get an error like Lumenize.Time undefined.
this.chart = this.add(
Ext.create('Rally.ui.chart.Chart', {
loadMask: false,
chartData: {
series: [{
type:'heatmap',
name: 'Sales per employee',
borderWidth: 1,
data: [[0, 0, 10], [0, 1, 19], [0, 2, 8], [0, 3, 24], [0, 4, 67], [1, 0, 92], [1, 1, 58], [1, 2, 78], [1, 3, 117], [1, 4, 48], [2, 0, 35], [2, 1, 15], [2, 2, 123], [2, 3, 64], [2, 4, 52], [3, 0, 72], [3, 1, 132], [3, 2, 114], [3, 3, 19], [3, 4, 16], [4, 0, 38], [4, 1, 5], [4, 2, 8], [4, 3, 117], [4, 4, 115], [5, 0, 88], [5, 1, 32], [5, 2, 12], [5, 3, 6], [5, 4, 120], [6, 0, 13], [6, 1, 44], [6, 2, 88], [6, 3, 98], [6, 4, 96], [7, 0, 31], [7, 1, 1], [7, 2, 82], [7, 3, 32], [7, 4, 30], [8, 0, 85], [8, 1, 97], [8, 2, 123], [8, 3, 64], [8, 4, 84], [9, 0, 47], [9, 1, 114], [9, 2, 31], [9, 3, 48], [9, 4, 91]],
dataLabels: {
enabled: true,
color: '#000000'
}
}]
},
chartConfig: {
chart: {
marginTop: 40,
marginBottom: 80,
plotBorderWidth: 1
},
title: {
text: 'Sales per employee per weekday'
},
xAxis: {
categories: ['Alexander', 'Marie', 'Maximilian', 'Sophia', 'Lukas', 'Maria', 'Leon', 'Anna', 'Tim', 'Laura']
},
yAxis: {
categories: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
title: null
},
colorAxis: {
min: 0,
minColor: '#FFFFFF',
//maxColor: Highcharts.getOptions().colors[0]
},
legend: {
align: 'right',
layout: 'vertical',
margin: 0,
verticalAlign: 'top',
y: 25,
symbolHeight: 280
},
tooltip: {
formatter: function () {
return '<b>' + this.series.xAxis.categories[this.point.x] + '</b> sold <br><b>' +
this.point.value + '</b> items on <br><b>' + this.series.yAxis.categories[this.point.y] + '</b>';
}
},
}
})
);
The problem you are going to hit is that the Highcharts v3.0.10 is bundled into the analytics code. This has been done so that there is a known Highcharts library used with the added Lumenize code - i.e. they match. The analytics library is dynamically loaded if the SDK can't find a Highcharts library as it starts up (i.e. window.Highcharts is undefined).
The result of this is that it looks like it is quite kludgy to change the Highcharts arrangement to add heatmaps. Someone who is better at javascript library loading/overloading might have a different view. You would have to load the heatmap.js file into your app after the app has started (in 'launch'?) to get around the dynamic loading.
I know that this is not directly related to your Highcharts question, but if it is just a 'heatmap' you are after, not necessarily a Highcharts heatmap, I have started to use d3 to visualise stuff in Rally. There is a d3 heatmap example here if you are interested: http://bl.ocks.org/tjdecke/5558084
I have done a bit of work to get d3 working inside a Rally custom app. There are a few examples on my github, but have a look here: https://github.com/nikantonelli/Radial-Density

Mathematica: dynamic number of menus

I am trying to make a dynamic number of drop-down menus in a plot, to plot a various number of curves.
I have previously requested help to plot this data, and it worked well.
First thing
Needs["PlotLegends`"]
Here is a example of data (not actual numbers, as they are waaay too long).
data={{year, H, He, Li, C, O, Si, S},
{0, .5, .1, .01, 0.01, 0.01, 0.001, 0.001},
{100, .45, .1, .01, 0.01, 0.01, 0.001, 0.001},
{200, .40, .1, .01, 0.01, 0.01, 0.001, 0.001},
{300, .35, .1, .01, 0.01, 0.01, 0.001, 0.001}}
The compounds variable is the number of compounds+1
compounds=8
For now, my code is this one
Manipulate[
ListLogLogPlot[
{data[[All, {1, i}]],
data[[All, {1, j}]],
data[[All, {1, k}]]},
PlotLegend -> {data[[1, i]],
data[[1, j]],
data[[1, k]]}
],
{{i, 2, "Compound 1"},Thread[Range[2, compounds] -> Drop[data[[1]], 1]]},
{{j, 3, "Compound 2"},Thread[Range[2, compounds] -> Drop[data[[1]], 1]]},
{{k, 4, "Compound 2"},Thread[Range[2, compounds] -> Drop[data[[1]], 1]]},
ContinuousAction -> False
]
As you can see, I can easily add a compound by duplicating each of the 3 lines (data, legend and menu descriptor), but it's lame and inefficient. Plotting a set takes about 20 seconds, so it's about 1 minute here (and I use a pretty efficient cluster).
Is there a solution to add a little menu or field where I can add the number of compounds to plot, so the right number of menus will display? I don't need more than 7 plots, but efficiency...
The numbers 2, 4, 16 are the default values to plot. I can make a list with the default values (2, 14, 16, and some others I may pick), or they could all be set to 2.
Thanks
You could do something like this
Manipulate[
ListLogLogPlot[data[[All, {1, #}]] & /# i],
{{n, 3, "# compounds"}, Range[7],
Dynamic[If[Length[i] != n, i = PadRight[{2, 4, 16}, n, 2]];
PopupMenu[#, Range[7]]] &},
{{i, {2, 4, 16}}, ControlType -> None},
Dynamic[Column[
Labeled[PopupMenu[Dynamic[i[[#]]],
Thread[Range[2, compounds] -> Drop[data[[1]], 1]]],
Row[{"Compound ", #}], Left] & /# Range[n]]
]
]
Without PlotLegend, this runs quite fast for a random data set of about 1000x1000 elements. If I include the PlotLegend option in ListLogLogPlot, it slows down quite a lot so that might be the reason why your code was so slow.
I thought I'd add a DM version. If you're like me you may find that easier than using manipulate. It is essentially a DM version of Heike's answer.
DynamicModule[{data,compounds,n=1,c={2},labels},
data=yourData;
compounds=Length[data[[1]]];
labels=Rule###Transpose[{Range[7],data[[1,2;;]]}];
Column[{
Dynamic[
Grid[
Join[
{{"no. of compounds",PopupMenu[Dynamic[n],Range[7]]}},
Table[
With[{i=i},
c=PadRight[c,n,2];
{"compound"<>ToString[i], PopupMenu[Dynamic[c[[i]]],labels]}
],
{i,n}
]
],
Alignment->{{Right,Left},Center}
],
TrackedSymbols:>{n}
],
Dynamic#ListLogLogPlot[data[[All,{1,#}]]&/#c]
}]
]
I've used Grid because it allows you to easily keep all the controllers and their labels aligned. PadRight[c,n,2] allows you to keep current settings if you change the value of n. I'd avoid plot legends and always make your own.
How about something like:
Manipulate[
Manipulate[ ListLogLogPlot[Table[Subscript[x, n], {n, 1, numCompounds}]],
Evaluate#Apply[Sequence,Table[{{Subscript[x, n], n + 1, "Compound " <> ToString#n},
Thread[Range[2, compounds] -> Drop[data[[1]], 1]]}, {n, 1,
numCompounds}]], ContinuousAction -> False],
{{numCompounds, 3}, 1, compounds - 1, 1}]