Can I take advantage of Yugabytes compatability? - sql

Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?

These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.

Related

Databricks: Best practice for creating queries for data transformation?

My apologies in advance for sounding like a newbie. This is really just a curiosity question I have as an outsider observing my team clash with our client. Please ask any questions you have, and I will try my best to answer it.
Currently, we are storing our transformation queries in a DynamoDB table. When needed, we pull into Databricks and execute the query. Simple as that. Our client has called this out as “hard coding” (more on that soon)
Our client has come up with an alternative that involves creating JSON config files containing the transformation rules (all tables/attributes required, target table names, Alias names, join keys, etc. etc.). From here, the SQL query is dynamically created. This approach is still “hard coding” since these config files would need to be manually edited anytime there is a change in the rules.
The way I see this: I think storing the transform rules in JSON is more business user friendly, but that’s about where I see the pros end. It brings in much more complexity to the code and likely will need to be continuously developed to support new queries. Also, I don’t see anyway to prevent “hard coding”. The client business leads seem to think there is some magical tool to convert plain English text to complex SQL queries
I just wanted to get some experts thoughts on this. Which solution is better, or is there another approach that should be taken?

Best way to migrate data from Access to SQL Server

The problem
Ok, sorry that my question is somewhat abstract and subjective, but will try to make it as specific as possible. So, the situation I am in is simple - I am remaking a very old MS Access application on a new website using ASP.NET MVC. As currently the MVC site is using SQL Server 2008 (for many well known reasons) I need to find a way to migrate the tables AND the data, because the information in the old database will be used in the new application.
Alright, so far so good, however there are a few problems. The old application is written in a different language, meaning that I want to translate table names, field names, and all other names that are there to English. Furthermore, I will be making some changes on the models themselves (change the type of some fields, add additional fields to some tables, remove old unnecessary ones and more). So technically I'll be 'having my way' with everything.
Researched solutions
With those things in mind I researched for the ways to migrate data from Access database to a SQL Server. Of course, there is a lot of information on the matter, in Stack Overflow alone there are more than a few questions and solutions. So why am I struggling to find the answer ? Well I found a few solutions that will be sufficient to some extend (actually will definitely solve my problems) but I am writing to ask if someone experienced has a better perspective on it than I do. Alright, the solutions and why I am still looking for advice: /I'll be listing just a couple of the most common and popular ones that I found, many of the others share the same capabilities and/or results /
Upsize Wizzard (Access) - this is a tool devised specifically for migrating tables and data from Access. It is my most favourite one for the moment as I find it kind of straightforward to work with and it provides good overall results. I was able to migrate the tables to SQL Server (along with the data of course) which more or less is what I am intending to do. It is fast, it seems like it allows you to migrate indexes, primary keys and even to my knowledge foreign keys (table relationships). The downsides of this tool, however, include that it ignores your queries (which I don't really need honestly) and it doesn't provide a way to change the model, names or types of the properties of the table you migrate - which is the thing I kind of prefer, because I will have to make more than a few changes, adding, renaming, deleting, etc. And then continue with the development process (of the application) which will lead to a few additional minor changes. And finally I would need to apply all changes (migration + all changes) on the production server, which overall is prone to mistakes as I will be doing it by hand (and there are more than a few tables).
SQL Server Migration Assistant (SSMA) - ok, this is a separate tool (not included in Access) with again the same idea - to migrate data from Access to ... possibly everywhere, haven't researched that. Overall it offers more functionality and customizing from the Upsize Wizard, but of course it does it in a more complicated way. I haven't put enough effort to make a migration with this tool yet, as it involves a lot of installations and additional work, but according to my research it provides almost all (if not all) of the functionality I require. The downside however comes with the naming. As I mentioned it allows you to apply changes on the tables, schema, fields, indexes, keys and probably everything, but the articles advice that I change the names in Access first, as it will be easier and the migration process will run more smoothly. I am not allowed to make changes on the original Access database, as it will remain functional until the publish of the 'renewed' project, and the data inside it is being used, so a mere copy of the file is a solution I am not particularly fond of, because I might loose new records. Also I cant predict the changes I would want to make in the development process (as I said I believe I would want/need to apply some additional changes later on when I find 'weaknesses' in my data design in the development process) so I find it to be a little half baked solution.
Conclusion
The options presented, the way I see them, are two:
Use the Upsize Wizard to migrate the access tables, then write a script that applies the changes I want to make. Then in the development process add any additional changes to the script. When ready to publish on the production server, reapply the migration with the wizard, run the changes script and pray everything is fine.
Get more involved with the SSMA tool and try producing an updated version of the tables with the migration process. (See how efficient the renaming is and decide whether to use copied file to rename and then find a way to migrate only new records or do it all in the SSMA). Then again write a script for the changes that occur in the development process and re-do and apply it all on the production server when ready and then pray everything is fine.
Option I have not yet seen, apply it and then pray everything is fine.
I have researched the matter for a couple of days now, and found a few more solutions that I do not believe are better by the mentioned. However I include the possibility of missing the 'big red X on the map', a practical and easy solution which seems like it was designed specifically for me (though I doubt that a little). Anyway, reducing all the madness that I have written so far to a few simple questions will look like:
Is anyone aware if my conclusions are correct? I am leaning towards option one as it is easier to accomplish.
Has anyone experienced/found a better way to do that, or just found some 'logic-leaps' in my writings as I am overthinking the entire thing a little and may be doing some obvious miscalculation.
Very sorry for asking a trivial question and one that includes decision making that may involve deeper understanding of my project and situation, yet I am working with rather sensitive data and would appreciate feedback, even if only to improve my confidence into the chosen approach.
There is one other tool/method you might want to consider that seems to cater to your specific needs more. This would be to use the data import/export tool that ships with sqlserver to do a complete copy of all data into a temporary location within sql server and then write custom queries to reorganize the names and other changes you want to make. Is a bit more work but you could use the end product as a seed method for your migrations ;) (if you are doing code first anyway)

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.