Data Sunrise has been aggressively working to win our business. We are using mongodb with three shards in 3 AZs across 2 regions. Does anyone have any experience or opinions about usage of Data Sunrise?
Related
Is there currently a way to incorporate traffic patterns into OptaPlanner with the package and delivery VRP problem?
Eg. Let's say I need to optimize 500 pickup and deliveries today and tomorrow amongst 30 vehicles where each pickup has a 1-4hr time window. I want to avoid busy areas of the city during rush hours when possible.
New pickups can also be added (or cancelled in the meantime).
I'm sure this is a common problem. Does a decent solution exist for this in OptaPlanner?
Thanks!
Users often do this, but there is no out-of-the-box example of it.
There are several ways to do it, but one way is to add a 3th dimension to the distanceMatrix, indicating the departureTime. Typically that use a granularity of 15 minutes, 30 minutes or 1 hour.
There are 2 scaling concerns here:
memory. 15 minutes means 24 * 4 = 96 per day. Given that with 2 dimensions, a 10k locations distanceMatrix uses almost 2 GB RAM, clearly memory can become a concern.
pre-calculation time. Calculation the distance matrix can be time consuming. "bulk algorithms" can help here. For example, graphhopper community doesn't support bulk distance calculations, but their enterprise version - as well OSRM (which is free) does. Getting a 3 dimensional matrix from the remote Google Maps API, or the remote enterprise Graphhopper API, can result in bandwidth concerns (see above, the distance matrix can become several GB in size, especially in non-binary formats such as JSON or CSV).
In any case, one that 3 dimensional matrix is there, it's just a matter of adjust OptaPlanner examples's ArrivalTimeUpdateListener to use getDistance(from, to, departureTime).
We are streaming around a million records per day into BQ and a particular string column has categorical values of "High", "Medium" and "Low".
I am trying to understand if Biq Query does storage optimisations other than compression at its own end and what is the scale of that? Looked for documentation on this and was unable to find explanations on the same.
For example if i have:
**Col1**
High
High
Medium
Low
High
Low
**... 100 Million Rows**
Would BQ Store it internally as follows
**Col1**
1
1
2
3
1
3
**... 100 Million Rows**
Summary of noteworthy (and correct!) answers:
As Elliott pointed out in the comments, you can read details on BigQuery's data compression here.
As Felipe notes, there is no need to consider these details as a user of BigQuery. All such optimizations are done behind the scenes, and are being improved continuously as BigQuery evolves without any action on your part.
As Mikhail notes in the comments, you are billed by the logical data size, regardless of any optimizations applied at the storage layer.
BigQuery constantly improves the underlying storage - and this all happens without any user interaction.
To see the original ideas behind BigQuery's columnar storage, read the Dremel paper:
https://ai.google/research/pubs/pub36632
To see the most recent published improvements in storage, see Capacitor:
https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format
BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time.
You may gather more detail from the "BigQuery under the hood" page.
Recently I had a discussion with our Network and System team about putting SQL files on different SAN LUNs. They believed that now a days because of SAN EMC Management process it is wasting the time and energy to put SQL files (Data/Log/Lob/Index/Backups especially TLogs) on separate drives with different spindles. So, could you help me by participating and stating your idea and vision about this discussion please.
I would tend to agree with your SAN administrators on this one. Most SANs today are running RAID-10 or similar technologies, spanning many drives and handling very high IOPS. Physically separating spindles for SQL Server data and logs goes back to the days of local storage with low numbers of drives and low IOPS capabilities.
so - its definitely worth placing your SQL data on separate LUNs, if only for scaling purposes. Definitely don't partition a LUN into multiple filesystems, I have seen that - its a road to destruction.
Putting different volumes on different physical spindles - this depends on a lot of factors.
Whats the workload, OLTP or OLAP (transactional or analytics)?
Whats the storage array? Is it traditional RAID (LUNs are on raidgroups) or is it virtualized provisioning (LUNs on "extents" Pools, extents on raidgroups (example VNX, VMAX, Unity))?
Are you using thin provisioning?
How are you going to scale?
Measure what workload are you dishing out to your current storage devices. Main thing, measure IOPS but also IO block size. IOPS alone is a meaningless number, you want to know the size of IO operations to determine volume placement. Determine what technology are you using, traditional or virtualized. Use latency as the ultimate performance measure.
This should get the conversation with your storage guys started.
Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
NOSQL
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.
I have been given the task to design and develop a web application for a NGO (Non Govt Org) which runs primary schools in many towns and villages. The application will keep a record of all the schools, students, volunteers and the teachers of every school. Currently there are about 30 schools in the NGO's umbrella but they have very ambitious plan to increase the number very rapidly.
We will host the app on Windows Azure using SQL Azure as the database. Now I am facing a tough task on how to design its database with minimum expenditure (as the NGO is funded completely by charities and donations). As you might know that the databases in SQL Azure is offered in specific sizes like 5,10,20 to 50 GB, it puts a restriction on the maximum size of each database. I have came out with following approaches:
1) For every school create a separate database of 5 or 10 GB size. Each database will have tables like 'student','subject','attendance' etc. The problem in this approach is that a lot of databases will have to be created. One for every school. This would drastically shoot up the cost. Also initially a large portion of 10 GB size will be under-utilized, but in future it may happen that 10GB would seem less for storing a school's data.
2) Keep a single database with tables like 'school','student','attendance' etc. This would keep the cost low initially but during course of time the database would start filling up and may reach the max limit of 50GB as more schools are opened by the NGO. Also a single table for 'student' and especially 'attendance' will have huge number of records and would make the queries slow. Even if we add another database in future then how easy would it be to split the tables across several databases.
Keeping the limitations in mind we are unable to proceed further.
Any approach or suggestion by you will be very helpful for us.
Thanks in advance.
EDIT: Thanks a lot to the people answered my question. i got the point: 5O GB is a huge space and it would not get filled any time soon. But that brings me a question: Consider a situation when the number of schools grow up to 200, 300 or 1000 !! Then how should be my database design ? I suppose 50 GB would not be big at that situation.
I used to work for a company that makes school systems; although 50GB would be considered large for most of them, a few had databases that were much larger. Historical records are typically the issue here, specially if you will add additional features over time, such as lead import.
You described two scenarios: a linear shard and a scale up architecture. The linear shard implements a database per school. The scale up puts them all in the same database. There are additional options to consider with SQL Azure. See one of my blob posts about a white paper I published regarding various scalability models: http://geekswithblogs.net/hroggero/archive/2010/12/23/multitenant-design-for-sql-azure-white-paper-available.aspx
Also SQL Azure announced an upcoming feature called Data Federation. This is most likely for you. Here are two blog posts you may find relevant:
http://geekswithblogs.net/hroggero/archive/2011/07/23/preparing-for-data-federation-in-sql-azure.aspx
http://geekswithblogs.net/hroggero/archive/2011/09/07/sharding-library-for-sql-azure-data-federation.aspx
The last link discusses an open-source library, called the Enzo Shard, that I am building to assist developers in taking advantage of the future capabilities of SQL Azure Data Federation. The version that supports data federation is in Beta and allows parallel queries to be performed across federation members (i.e. databases).
Finally, don't miss the posts by Cihan (from Microsoft) that discusses this feature in greater details: http://blogs.msdn.com/b/cbiyikoglu/
In summary, the field of scalability in SQL Azure is evolving. However many capabilities will be coming providing significant data growth and performance oppotunities.
50 Gigabytes is an awful lot of data. School personnel and attendance is a pretty small problem. A properly designed database is unlikely to approach 50 gigabytes for decades, at least.
Even 60 schools should not generate that much data, even if you're tracking standardized testing data of some kind. If there is a secondary school from grades 6 to 12 (I'm using the U.S. for a reference) on the quarter system, with an average of 6 classes per student and 1000 students in the school, there will only be 24,000 class records per year. Not all 30 schools are going to be secondary schools. 50GB should be plenty. I worked with the a database containing enrollment, testing, student and teacher information for one of the biggest school districts in the United States. After 7+ years their database barely approached 30GB.
Also, check out the new Elastic Scale feature in Azure SQL DB : which can help you scale out instead of scaling up.
I would suggest you take a look at Azure Table Storage as well to keep your costs down without worrying about growing size. Obviously the challenge would be to design your application for Table Storage which is "Non-Relational" in nature.
You'll never hit 50GB with just names and a couple other strings/text. Even with all the schools in the same db you'll be good with 5GB. I've admin'd millions of rows of more complex data and never hit 50GB (unless there was a problem!) :)