Dataset for testing KeyDB - redis

I'm new to the world of in-memory dbs, expecially the KeyDb is known to me for just like a couple of days, I wonder if the dataset that KeyDb/Redis had been tested on, is publicly available. I would like to do testing myself. Thanks in advance.

You can use benchmarking software to generate different loads for KeyDB or Redis. KeyDB has a built in benchmarking software 'keydb-benchmark', or 'redis-benchmark' for redis. However the best benchmarking tools are Memtier by RedisLabs which is great for testing throughput, and YCSB by Yahoo which is great for testing latency under different loads. YCSB has several predefined workloads you can use and test on many different databases for a comparison. Each tool has pretty good documentation on using it.
You can take a look here at some redis benchmark examples. Also see this section of keydb blog outlining considerations for avoiding bottlenecks while benchmarking keydb with memtier. Both documents include some different examples/commands.

Related

Performance benchmark between boost graph and tigerGraph, Amazon Neptune, etc

This might be a controversial topic, but I am concerned about the performance of boost graph vs commercial software such as TigerGraph, since we need to choose one.
I am inclined to choose Boost, but I am concerned whether performance-wise, boost is good enough.
Disregarding anything around persistence and management, I am concerned with boost graph's core performance of algorithms.
If it is good enough, we can build our application logic on top of it without worry.
Also, I got below benchmarks of LDBC SOCIAL NETWORK BENCHMARK.
LDBC benchmark
seems that TuGraph is the fastest...
Is LDBC's benchmark authoritative in the realm of graph analysis software?
Thank you
I would say that any benchmark request is a controversial topic as they tend to represent a singular workload, which may or may not be representative of your workload. Additionally, performance is only one of the aspects you should look at as each option is built to target different workloads and offers different features:
Boost is a library, not a database, so anything around persistence and management would fall on the application to manage.
TigerGraph is an analytics platform that is focused on running real-time graph analytics, such as deep link analysis.
Amazon Neptune is a fully managed service focused on highly concurrent transactional graph workloads.
All three have strong capabilities and will perform well when used in the manner intended. I'd suggest you figure out which option best matches the type of workload you are looking to run, the type of support you need, and the amount of operational work you are willing to onboard to make the choice more straightforward.

How to run a single MediaWiki on many servers?

I couldn't find a tutorial on this, all I found was info on how to run multiple wikis from one server. Issues are things like high-speed shared storage for images between servers and good performance with some sort of centralized caching.
Does anybody know of any guides?
It's really hard to give advice to such vague requirements, but doing my best.
Image storage: what are the storage/load needed? For a Wikipedia-sized cluster the only solution known to work is OpenStack Swift, however it's a huge PITA to use so NFS is probably the sane choice.
Shared caching: memcached.

Ideas for a distributed processing project?

I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree

Katta in production environment

According to the website Katta is a scalable, failure tolerant, distributed, indexed, data storage.
I would like to know if it is ready to be deployed into production environment. Anyone already using it and has advices? Any pitfalls? Recommendations? Testimonials? Please share.
Any answer would be greatly appreciated.
We have tried using katta and for what its worth - found it very stable, relatively easy to manage (as compared to managing plain vanilla lucene)
Only pitfall I can think of is lack of realtime updates - when we tested it (about 9-10 months back) update meant, updating index using a separate process (hadoop job or what have you...) and replacing the live index, this was a deal-breaker for us.
If you are looking into distributed lucene you should really tryout ElasticSearch or Solandra

Web Application Infrastructure

I have custom coded several enterprise applications for mid to large organizations to use internally (some with a minimal external footprint). I now have plans for a web project that may (hopefully) see a large userbase with more daily traffic than my previous projects have ever attained. Obviously I want my design to be scalable and maintainable. The problem is that from a physical layout perspective (servers/VMs) I do not know what to expect.
The question: What are some good resources for this? Books? Websites? I have found plenty on scalable application design, but nothing on scalable physical design.
It's hard to give exact answer without knowing something about what technologies you plan to use. The approach to the application can't be completely unaware of planned physical infrastructure if scaling is a major driver.
Caching would have to be a big concern. Also ways to expand the hardware where your data lives.
A very interesting and instructive read is the real world bio of live journal, a history of scaling, and how they grew their physical presence with a massive growth in their website. One major offshoot of their work was a new caching technology, memcached, which is now used by FaceBook among others. It is surprisingly honest.
The High Scalability blog is good. You can look at some of their examples that go over the physical parts of large sites. I would say the common first level physical scaling technique would be a load balancer. That is pretty easy but at the simplest you still have a database that is a potential bottleneck. Most of the physical parts of scaling require you to just add more and the real issues come in where you are forced to use just one of something.