I'm planning to develop a small J2ME utility for viewing local public transport schedules using a mobile phone. The data part for those is mostly a big bunch of numbers representing the times when the buses arrive or leave.
What I'm trying to figure out is what is the best way to store that data. The representation needs to
be considerably small (because of persistent storage limitations of a mobile phone)
fit into a single file (for the ease of updating the schedule database afterwards over HTTP) fit into a constant number of files, i.e. (routes.dat, times.dat, ..., agencies.dat), and not (schedule_111.dat, schedule_112.dat, ...)
have a random access ability (unserializing the whole data object into memory would be just too much for a mobile phone :))
if there's some library for accessing that data format, a Java implementation should be present
In other words, if you had to squeeze a big part of GTFS-like data into a mobile device, how would you do that?
Google Protocol Buffers seemed like a good candidate for defining data but it didn't have random access.
What would you suggest?
Persistent storage on J2ME is a tricky business; see this related question for more general background: Best practice for storing large amounts of data with J2ME
In my experience, J2ME persistent storage tends to work best/most reliably with many small records rather than a few monolithic ones. Think about how the program is going to want to access the data, then try to store it in those increments in the J2ME persistent store.
I'd generally recommend decoupling your client-server protocol for downloading updates from the on-device storage format. You can change the latter with every code update, but you're pretty much stuck supporting a client-server protocol forever, unless you want to break older clients out in the field.
Finally, I know there are some people on the Transit Developers group who have built offline transit apps in J2ME, so it's worth asking for tips there.
I made app like this and I used xml-s generated with php. This enabled us to have a single provider for 3 presentation layers which were:
j2me app
website for mobile phones
usual website
We used xslt to convert xml to html on websites and kXML - very light pull parser to do it on j2me app. This worked well even on very old phones with b/w screens and small amounts of memory.
Besides on j2me there is no concept of file. You have the db in which you can store information.
This is a link to "mobile" website.
http://mobi.krakow.pl/rozklady/
and here to the app:
http://www.mobi.krakow.pl/rozklady/j2me/rjk.jar
This is in polish, but I think it's not hard to figure out what's this and that.
If you want, I can provide you with more help and advice or if this is a commercial product then I think we can figure out something too ;)
I think your issue is requirement 2.
Updating 10MB of data just because 4 digits changed somewhere in the middle of the file seems highly inefficient.
Spliting the data into several files allows for a better update granularity that will be well worth the added code complexity.
Real-time public transport schedules are usually modified one bus/train/tram line at a time.
Related
I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .
Does needing just a single word voice recognition reduce the complexity of the task enough to be able to fully perform voice recognition processing offline, on an iOS or Android smartphone? (E.g., could a reasonably accurate counter for the number of times that a single, pre-programmed word was spoken while the microphone is active be developed to work offline on a standard iOS or Android smartphone?).
I've found plenty of tools and examples capturing voice and sending it to an online service (e.g., the Google cloud voice-to-text), but does the single-word focus reduce the complexity enough for the recognition to be doable offline today? If so, do you have any libraries to suggest or where would you start?
Cloud services are good for various reasons relating to your question:
It makes deployment of new versions of the algorithm (which happen much more frequently than most people realize) a lot easier
It allows the developer to collect your data and use it in future algorithm development (or whatever they please)
From a practical standpoint, most deployed models (at least the effective ones) can be quite large and take up quite a bit of space on a mobile device.
In addition to the above, I don't think that the singular word focus changes much, if anything. The model has to not just account for words, but also for the different ways those words can be said (volume, tone, accents, inflection, etc, etc).
So what you are asking can be done but there's also good reasons why it's on the cloud.
...like Talend for Java, for instance, but that allows to implement processes programatically.
Multiple data sources, orchestration, calculated fields, pivot tables are some of the features I would like to have.
We've build on top of Moose for a ERP data conversion project. Works well with smaller amounts of data (that fit in a 32-bit image). In ETL with multiple sources, just use an image for each input stream/step, connect them together through files or sockets. The visualization was important for us. It allowed the domain experts to steer the process. Short feedback loop was essential.
Nearly 5 years later it is time to revisit this answer. Pharo and Moose support 64 bits. The garbage collector is not yet up to handling very large heaps, the incremental collector to avoid large pauses is in active development now. If the work is partitionable, use a solution like ImageWorker to use multiple cores with all data in one image, or TelePharo to remote control multiple images. Perhaps use MQTT to integrate. For visualization there are Roassal2 and 3 or the whole GToolkit
I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree
I am charged with designing a web application that displays very large geographical data. And one of the requirements is that it should be optimized so the PC still on dial-ups common in the suburbs of my country could use it as well.
Now I am permitted to use Flash and/or Silverlight if that will help with the limited development time and user experience.
The heavy part of the geographical data are chunked into tiles and loaded like map tiles in Google Maps but that means I need a lot of HTTP requests.
Should I go with just javascript + HTML? Would I end up with a faster application regarding Flash/Silverlight? Since I can do some complex algorithm on those 2 tech (like DeepZoom). Deploying desktop app though, is out of the question since we don't have that much maintenance funds.
It just needs to be fast... really fast..
p.s. faster is in the sense of "download faster"
I would suggest you look into Silverlight and DeepZoom
Is something like Gears acceptable? This will let you store data locally to limit re-requests.
I would also stay away from flash and Silverlight and go straight to javascript/AJAX. jQuery is a ton-O-fun.
I don't think you'll find Flash or Silverlight is going to help too much for this application. Either way you're going to be utilizing tiled images and the images are going to be the same size in both scenarios. Using Flash or Silverlight may allow you to add some neat animations to the application but anything you gain here will be additional overhead for your clients on dialup connections. I'd stick with plain Javascript/HTML.
You may also want to look at asynchronously downloading your tiles via one of the Ajax libraries available. Let's say your user can view 9 tiles at a time and scroll/zoom. Download those 9 tiles they can see plus whatever is needed to handle the zoom for those tiles on the first load; then you'll need to play around with caching strategies for prefetching other information asynchronously.
At one place I worked a rules engine was taking a bit too long to return a result so they opted to present the user with a "confirm this" screen. The few seconds it took the users to review and click next was more than enough time to return the results. It made the app look lightening fast to the user when in reality it took a bit longer. You have to remember, user perception of performance is just as important in some cases as the actual performance.
I believe Microsoft's Seadragon is your answer. However, I am not sure if that is available to developers.
It looks like some of it has found its way into Silverlight