U-SQL vertex graph does not show ROW_COUNT per vertex - azure-data-lake

U-SQL vertex graph does not show ROW_COUNT per vertex at least since Monday 17-Apr. See the pic below.

It looks like this is a regression in the EU North region (and may also happen in the other regions once the refresh rollout has completed). A file that is needed by the tool to get the row count is not being provided anymore. We are still investigating the root cause and will fix it as soon as possible for future jobs.

Related

Detect breakdown voltage in an AC waveform

I need to monitor an AC Voltage waveform and record the RMS value when the breakdown happens. I roughly know how to acquire data from videos I have watched, however, it is difficult for me to produce a solution that reads the Breakdown Voltage Value. Ideally, I would also take a screenshot along with the breakdown voltage value,
In case you are not familiar with this topic, When a breakdown happens the voltage will drop immediately to zero. So what I need is to measure the voltage just before it falls to zero, and if possible take a screenshot. This is an image of a normal waveform (black) with a breakdown one (red).
Naive solution*:
Take the data and get the Y values (this would depend on the datatype you have, which would depend on how you acquire the data).
Find the breakdown point by iterating over the values and maintaining a couple of flags (I would probably say track "got higher than X" and once that's true, track "got lower than Y").
From that, I would just say take the last N points (Get Array Subset) and get the array max. Or just track the maximum value as you run.
Assuming you have the graph in a control, you can just right click and select Create>>Invoke Node>>Export Image.
I would suggest trying playing with that with a VI with static data which you can repeatedly run to check how your code behaves.
*I don't know the problem domain and an not overly familiar with the various analysis VIs that ship with LV, so there are quite possibly more efficient ways of doing this.

How is duration being calculated in a Spark Structured Streaming UI?

We have a Spark SQL job that we would like to optimize. We are trying to
figure out which part of our pipeline is slower/faster.
In the attached SQL query graph, there are 3 WholeStageCodegen boxes,
all with the same duration: 2.9s, 2.9s, 2.9s. See the below picture:
But if we check the Stage graph, it shows 3 seconds for the total stage. See the below picture:
So the durations in the WholeStageCodegen boxes do not add up, it seems
that these durations refer to the sum of the whole stage. Do we miss
something here? Is there a way to figure out the duration for the
individual boxes?
Sometimes there is some difference in the duration, but not more than
0.1s, examples:
18.3s, 18.3s, 18.4s
968ms, 967ms, 1.0s
The Stage duration is always as much as one of the WholeStageCodegen's
duration, or at most 0.1-0.3sec larger.
How can one figure out the duration for each of the WholeStageCodegen parts, and is that actually measured? I suspect that Spark would have to trace individual operations as units of generated functions. Is that measurement actually performed there, or are these numbers more like a placeholder for a feature that does not exist?

What is requirement coverage in testing?

I was looking at graphwalker which is a model-based testing tool. It creates a model like an oriented graph and uses a generator and a stop condition to walk on that graph, something like:
random(edge_coverage(100)) //covers the graph random till all edges are selected (100%)
random(vertex_coverage(100)) //covers the graph random till all vertices are selected (100%)
There is another stop condition called requirement_coverage: usage random(requirement_coverage(100)).
From the description on the website it says:
requirement_coverage( an integer representing percentage of desired requirement coverage )
The stop criteria is a percentage number. When, during execution, the percentage of traversed requirements is reached, the test is stopped. If requirement is traversed more than one time, it still counts as 1, when calculating the percentage coverage.
What exactly are those traversed requirements?
This might be a little belated answer, but what I have found is:
https://github.com/GraphWalker/graphwalker-project/wiki/Requirements
Basically you can use REQTAG keyword on your vertices, mapping to some external requirement document reference (i.e. REQTAG: requirement1), and GraphWalker collects these requirements and applies the stop condition based on random(requirement_coverage(x)).
So in below example vertices are marked with the requirements tag, and using random(requirement_coverage(50)) would result in stopping after visiting two vertices, etc...

How to group nearby latitude and longitude locations stored in SQL

Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability