Apache giraph process graph with a custom algorithm - apache

I have a custom algorithm for processing a graph which accepts a txt file as input. Because it is a large scale graph I want to implement it in the apache giraph framework. I' ve done a lot of research but I am still not sure if I am in the right path.
I am reading a .csv file which contain the graph data and using a parser I am converting it to the txt file and uploading to the HDFS file system of hadoop.
I have read the SimpleShortestPathsVertex example from the apache quick start guide and I can see that processes the data from a file in HDFS using the jar-with-dependencies jar file.
My problem is that I haven't yet understood how can I add my algorithm in the apache giraph framework and start the process of the graph. Can I add my algorithm to apache framework using eclipse and modify it from there or there is any other way?
Thank you!

Have a look here:
https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example
Where you able to run this example?
If yes.
Familiarize yourself with the different Writable formats of hadoop! Else it is hard to use these to your algorithm.
All computation concerning the graph is done in the compute() function.
(If you're more advanced have a look into the workerContext preSuperstep and Aggregators!)
You can change the example, but as soon as you use other data types you have to change your VertexReader and VertexWriter.
If you have a specific Algorithm in mind, make up your mind what you need for the computation and specify the layout of your input file. Then adapt your VertexReader and -Writer. And then finaly start the implement your compute() function!
Of course you can use eclipse! Simply Reference the Giraph jar (For me it is "giraph-0.1-jar-with-dependencies.jar") And start coding.
All you need is a instance of these files specific to your algorithm:
YourGiraphJob (the file starting the Hadoop/Giraph job)
YourVertex (Specifies your compute() function executed on each Vertex)
YourInputFormat (Specifying the Writable formats of YourReader)
YourOutputFormat (Specifying the Writable formats of YourWriter)
YourReader (Specifies how your inputFile is transformed e.g. that for each line a Vertex can be initialized using given information)
YourWriter (Specifies how your outputFile is generated from the vertices)
(optionaly a WorkerContext if you want to use Aggregators.)
Simply checkout: http://giraph.apache.org/source-repository.html
using eclipse and you should have the code including an example application you can toy around with!

Related

How to inject data in a .bin file in a post compilation script?

Purpose
I want my build system to produce one binary file that includes:
The bootloader
The application binary
The application header (for the bootloader)
Here's a small overview of the memory layout (nothing out of the ordinary here)
The build system already concatenates the bootloader and the application in a post-compilation script.
In other words, only the header is missing.
Problem
What's the best way to generate and inject the application header in the memory?
Possible solutions
Create a .bin file just for the header and use cat to inject it in my final binary
Use linker file to hardcode the header (is this possible?)
Use a script to read the final binary and hardcode the header
Other?
What is the best solution for injecting data in memory in a post compilation script?
SRecord is a great tool for doing all kinds of manipulation on binary and other file types used for embedded code images.
In this case, given a binary bootheader.bin to insert at offset 0x8000 in image.bin:
srec_cat bootheader.bin −binary −offset 0x8000 −o image.bin
The tool is somewhat arcane, but the documentaton includes numerous examples covering various common tasks.

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
#dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!
Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.
For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.
For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.
This is not supported by the system.
Components should use the system-provided paths.
This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.
Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.
In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).

Read Sentinel-2 L1C view angles from rasterio

I am trying to read the view angles from a Sentinel-2 image (L1C SAFE compact format) for executing an atmospheric correction algorithm. I can get those values by parsing the file MTD_TL.xml, but I am not able to get them through rasterio.
I have tried to access to those data using the xml:SENTINEL2 and the xml:VRT metadata domains, but I can only access to the values from the file MTD_MSIL1C.xml (the main metadata file).
The whole point of using rasterio is being able of using GDAL's virtual file system, as the images will be read from S3 buckets. Any alternatives for easily reading MTD_TL.xml through the virtual file system would be also valid (and really appreciated).
Thank you!!
I answer to myself.
I could not find how to get the values I require, but according to https://gdal.org/user/virtual_file_systems.html the function VSIFOpenL may be used for opening the file. After that, manual parsing will do the trick :)
Ps. I must read the documentation slowly.

Using Apparat dump with FDT and ant

I am totally new to flash development, don't even know ActiveScript yet.
I have to improve some existing flash application, so at first I need to understand the code.
I want to use some tool for code analysis, something to visualize class dependencies and code structure. I googled and found out about Apparat tool. Now I'm struggling with it because I can not find documentation that describes how to use Apparat. I'm frustrated, but it seems to be the only such tool.
So I started with example.
I've set up apparat running on FDT following this guide:
http://www.webdevotion.be/blog/2010/06/02/how-to-get-up-and-running-with-apparat/
The example (http://blog.joa-ebert.com/2010/05/26/new-apparat-example/) builds well and creates two SWF files. (I'm using ANT builder)
Now I want to analyze existing swf and see a PNG with class dependencies.
How should I do that?
What do I have to add and where?
Or maybe someone can explain how to use dump from windows command line? Something like
dump example.swf exampleAnalysis.png
After resolving all dependencies (which was tricky), I managed to get dump running
dump -i example.swf -uml
But it saves the UML diagram in .DOT format which is really hard to read as Graphviz GVedit cannot zoom and exports to PNG only what you see (messy impossible to read zoomed out graph), smyrna doesn't work and zgrviewer fails to load some files.

Any working binary differ tool implements GDIFF(Generic Diff Format, NOT Graphical file difference)?

I've seen GDIFF(Generic Diff Format) in wikipedia, and I wander is there any command line tool implements this standard. Now the best I have is LibXDiff, but it's a library, I'll need some extra work to make it run.
I know when it comes to binary-differ, VCDIFF(xdelta, etc) and bsdiff would have better compression rate, but in my case I really need a straight forward one. VCDIFF copies anything before current window(if my poor English reading was right about this article), and bsdiff's patch file format would be more complex.
update
Finally I found VCDIFF with xdelta3 is actually good and working, when "disable small string-matching" and "disable external decompression" is toggled, AND it has a pretty good "printdelta" command that prints very useful(for my app) information so that I don't really neet to extract VCDIFF format from the patch file.
Javaxdelta library implements xdelta and GDIFF patches. It can be used as command-line application like this:
# create patch
java -cp javaxdelta-2.0.1.jar:trove-1.0.2.jar com.nothome.delta.Delta source.file target.file patch.gdiff
# apply patch
java -cp javaxdelta-2.0.1.jar:trove-1.0.2.jar com.nothome.delta.GDiffPatcher unpatched.file patch.gdiff patched.file
I wrote once a wrapper around it to support directories patching (GDIFF files for directory are packed into one ZIP patch).