Ideas for a distributed processing project? - project

I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?

Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.

It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..

Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.

The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/

If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree

Related

Curious on how to use some basic machine learning in a web application

A co-worker and I had an idea to create a little web game where a user enters a chunk of data about themselves and then the application would write for them to sound like them in certain structures. (Trying to leave the idea a little vague.) We are both new to ML and thought this could be a fun first dive.
We have a decent bit of background with PHP, JavaScript (FE and Node), Ruby a little bit of other languages, and have had interest in learning Python for ML. Curious if you can run a cost efficient ML library for text well with a web app, being most servers lack GPUs?
Perhaps you have to pay for one of the cloud based systems, but wanted to find the best entry point for this idea without racking up too much cost. (So far I have been reading about running Pytorch or TensorFlow, but it sounds like you lose a lot of efficiency running with CPUs.)
Thank you!
(My other thought is doing it via an iOS app and trying Apple's ML setup.)
It sounds like you are looking for something like Tensorflow JS
Yes, before jumping into training something with Deep Learning; (this might even be un-necessary for your purpose) try to build a nice and simple baseline for this.
Before Deep Learning (just a few yrs ago) people did similar tasks using n-gram feature based language models. https://web.stanford.edu/~jurafsky/slp3/3.pdf
Essentially you try to predict the next few words probabilistically given a small context(of n-words; typically n is small like 5 or 6)
This should be a lot of fun to work out and will certainly do quite well with a small amount of data. Also such a model will run blazingly fast; so you don't have to worry about GPUs and compute .
To improve on these results with Deep Learning, you'll need to collect a ton of data first; and it will be work to get it to be fast on a web based platform

Optimization algorithms optimizing an existing system connections

i am currently working on an existing infrastructure where i have about a 1000 customer sites connected to about 5 different hubs. A customer site may connect to one or two hubs to ensure reliability but each customer site is connected to at least one hub. I want to ensure if the current system is the best or can be optimised to have better connection from customer sites to hubs, to help improve connectivity and reliability. Can you suggest good Optimisation Algorithms to look into?. Thank you
Sounds like you're doing some variation of the Facility Problem.
This is a well-known problem, and while there are algorithms that can solve for the global optimum (Djiskra's Algorithm, or other variants of Dynamic Programming), they do not scale well (i.e. you run into the curse of dimensionality). You could try this, but 1000 sounds already pretty big (depends on your exact problem formulation though).
I'd recommend taking a look at this coursera mooc Discrete Optimization. You don't have to take the whole course, but in the "Assignments" section of the video lectures, he also explains a variant of the Facility problem, some possible approaches to think about, and once you've decided which one you want to use, you can look deeper into that particular approach.

Is there any ETL tool for any Smalltalk dialect?

...like Talend for Java, for instance, but that allows to implement processes programatically.
Multiple data sources, orchestration, calculated fields, pivot tables are some of the features I would like to have.
We've build on top of Moose for a ERP data conversion project. Works well with smaller amounts of data (that fit in a 32-bit image). In ETL with multiple sources, just use an image for each input stream/step, connect them together through files or sockets. The visualization was important for us. It allowed the domain experts to steer the process. Short feedback loop was essential.
Nearly 5 years later it is time to revisit this answer. Pharo and Moose support 64 bits. The garbage collector is not yet up to handling very large heaps, the incremental collector to avoid large pauses is in active development now. If the work is partitionable, use a solution like ImageWorker to use multiple cores with all data in one image, or TelePharo to remote control multiple images. Perhaps use MQTT to integrate. For visualization there are Roassal2 and 3 or the whole GToolkit

Does anybody actually use the PSP (Personal Software Process)?

I've been reading a bit about this recently but it looks to be a bit heavy. Does anybody have real world experience using it?
Are there any light weight alternatives?
The Personal Software Process is a personal improvement process. The full-blown PSP is quite heavy and there are several forms, templates, and documents associated with it. However, a key point is that you are supposed to tailor the PSP to your specific needs.
Typically, when you are learning about the PSP (especially if you are learning it in a course), you will use the full PSP with all of its forms. However, as Watts S. Humphrey says in "PSP: A Self-Improvement Process for Software Engineers", it's important to "use a process that both works for you and produces the desired results". Even for an individual, multiple projects will probably require variations on the process in order to achieve the results you want to.
In the book I mentioned above, "PSP: A Self Improvement Process for Software Engineers", the steps that you should follow when defining your own process are:
Determine needs and priorities
Define objectives, goals, and quality criteria
Characterize the current process
Characterize the target process
Establish a strategy to develop the process
Validate the process
Enhance the process
If you are familiar with several process models, it should be fairly easy to take pieces from all of them and create a process or workflow that works on your particular project. If you want more advice, I would suggest picking up the book. There's an entire chapter dedicated to extending and modifying the PSP as well as creating your own process.
The Personal Software Process itself is a subset of the Capability Maturity Model (CMM) processes. There are no light weight alternatives available as of now.

Commercial uses for grid computing?

I keep hearing from associates about grid computing which, from what I can gather, is highly distributed stuff along the lines of SETI#Home.
Is anyone working on these sort of systems for business use? My interest is in figuring out if there's a commercial reason for starting software development in this field.
Rendering Farms such as Pixar
Model Evaluation e.g. weather, financials, military
Architectural Engineering e.g. earthquakes.
To list a few.
Grid computing is really only needed if you have a lot of WORK that needs to be done, like folding proteins, otherwise a simple server farm will likely be plenty.
Obviously Google are major users of Grid Computing; all their search service relies on it, and many others.
Engines such as BigTable are based on using lots of nodes for storage and computation. These are commercially very useful because they're a good alternative to a small number of big servers, providing better redundancy and cost effective scaling.
The downside is that the software is fiendishly difficult to write, but Google seem to manage that one ok :)
So anything which requires big storage and/or lots of computation.
I used to work for these guys. Grid computing is used all over. Anyone who makes computer chips uses them to test designs before getting physical silicon cut. Financial websites use grids to calculate if you qualify for that loan. These days they are starting to replace big iron in a lot of places, as they tend to be cheaper to maintain over the long term.