Databricks: Best practice for creating queries for data transformation? - sql

My apologies in advance for sounding like a newbie. This is really just a curiosity question I have as an outsider observing my team clash with our client. Please ask any questions you have, and I will try my best to answer it.
Currently, we are storing our transformation queries in a DynamoDB table. When needed, we pull into Databricks and execute the query. Simple as that. Our client has called this out as “hard coding” (more on that soon)
Our client has come up with an alternative that involves creating JSON config files containing the transformation rules (all tables/attributes required, target table names, Alias names, join keys, etc. etc.). From here, the SQL query is dynamically created. This approach is still “hard coding” since these config files would need to be manually edited anytime there is a change in the rules.
The way I see this: I think storing the transform rules in JSON is more business user friendly, but that’s about where I see the pros end. It brings in much more complexity to the code and likely will need to be continuously developed to support new queries. Also, I don’t see anyway to prevent “hard coding”. The client business leads seem to think there is some magical tool to convert plain English text to complex SQL queries
I just wanted to get some experts thoughts on this. Which solution is better, or is there another approach that should be taken?

Related

Can I take advantage of Yugabytes compatability?

Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?
These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.

MongoDB or SQL for text file?

I have a 25GB's text file with that structure(headers):
Sample Name Allele1 Allele2 Code metaInfo...
So it just one table with a few millions of records. I need to put it to database coz sometimes I need to search that file looking, for example, specific sample. Then I need to get all row and equals to file. This would be a basic application. What is important? File is constant. It no needed put function coz all samples are finished.
My question is:
Which DB will be better in this case and why? Should I put a file in SQL base or maybe MongoDB would be a better idea. I need to learn one of them and I want to pick the best way. Could someone give advice, coz I didn't find in the internet anything particular.
Your question is a bit broad, but assuming your 25GB text file in fact has a regular structure, with each line having the same number (and data type) of columns, then you might want to host this data in a SQL relational database. The reason for choosing SQL over a NoSQL solution is that the former tool is well suited for working with data having a well defined structure. In addition, if you ever need to relate your 25GB table to other tables, SQL has a bunch of tools at its disposal to make that fast, such as indices.
Both MySQL and MongoDB are equally good for your use-case, as you only want read-only operations on a single collection/table.
For comparison refer to MySQL vs MongoDB 1000 reads
But I will suggest going for MongoDB because of its aggeration pipeline. Though your current use case is very much straight forward, in future you may need to go for complex operations. In that case, MongoDB's aggregation pipeline will come very handy.

Best way to migrate data from Access to SQL Server

The problem
Ok, sorry that my question is somewhat abstract and subjective, but will try to make it as specific as possible. So, the situation I am in is simple - I am remaking a very old MS Access application on a new website using ASP.NET MVC. As currently the MVC site is using SQL Server 2008 (for many well known reasons) I need to find a way to migrate the tables AND the data, because the information in the old database will be used in the new application.
Alright, so far so good, however there are a few problems. The old application is written in a different language, meaning that I want to translate table names, field names, and all other names that are there to English. Furthermore, I will be making some changes on the models themselves (change the type of some fields, add additional fields to some tables, remove old unnecessary ones and more). So technically I'll be 'having my way' with everything.
Researched solutions
With those things in mind I researched for the ways to migrate data from Access database to a SQL Server. Of course, there is a lot of information on the matter, in Stack Overflow alone there are more than a few questions and solutions. So why am I struggling to find the answer ? Well I found a few solutions that will be sufficient to some extend (actually will definitely solve my problems) but I am writing to ask if someone experienced has a better perspective on it than I do. Alright, the solutions and why I am still looking for advice: /I'll be listing just a couple of the most common and popular ones that I found, many of the others share the same capabilities and/or results /
Upsize Wizzard (Access) - this is a tool devised specifically for migrating tables and data from Access. It is my most favourite one for the moment as I find it kind of straightforward to work with and it provides good overall results. I was able to migrate the tables to SQL Server (along with the data of course) which more or less is what I am intending to do. It is fast, it seems like it allows you to migrate indexes, primary keys and even to my knowledge foreign keys (table relationships). The downsides of this tool, however, include that it ignores your queries (which I don't really need honestly) and it doesn't provide a way to change the model, names or types of the properties of the table you migrate - which is the thing I kind of prefer, because I will have to make more than a few changes, adding, renaming, deleting, etc. And then continue with the development process (of the application) which will lead to a few additional minor changes. And finally I would need to apply all changes (migration + all changes) on the production server, which overall is prone to mistakes as I will be doing it by hand (and there are more than a few tables).
SQL Server Migration Assistant (SSMA) - ok, this is a separate tool (not included in Access) with again the same idea - to migrate data from Access to ... possibly everywhere, haven't researched that. Overall it offers more functionality and customizing from the Upsize Wizard, but of course it does it in a more complicated way. I haven't put enough effort to make a migration with this tool yet, as it involves a lot of installations and additional work, but according to my research it provides almost all (if not all) of the functionality I require. The downside however comes with the naming. As I mentioned it allows you to apply changes on the tables, schema, fields, indexes, keys and probably everything, but the articles advice that I change the names in Access first, as it will be easier and the migration process will run more smoothly. I am not allowed to make changes on the original Access database, as it will remain functional until the publish of the 'renewed' project, and the data inside it is being used, so a mere copy of the file is a solution I am not particularly fond of, because I might loose new records. Also I cant predict the changes I would want to make in the development process (as I said I believe I would want/need to apply some additional changes later on when I find 'weaknesses' in my data design in the development process) so I find it to be a little half baked solution.
Conclusion
The options presented, the way I see them, are two:
Use the Upsize Wizard to migrate the access tables, then write a script that applies the changes I want to make. Then in the development process add any additional changes to the script. When ready to publish on the production server, reapply the migration with the wizard, run the changes script and pray everything is fine.
Get more involved with the SSMA tool and try producing an updated version of the tables with the migration process. (See how efficient the renaming is and decide whether to use copied file to rename and then find a way to migrate only new records or do it all in the SSMA). Then again write a script for the changes that occur in the development process and re-do and apply it all on the production server when ready and then pray everything is fine.
Option I have not yet seen, apply it and then pray everything is fine.
I have researched the matter for a couple of days now, and found a few more solutions that I do not believe are better by the mentioned. However I include the possibility of missing the 'big red X on the map', a practical and easy solution which seems like it was designed specifically for me (though I doubt that a little). Anyway, reducing all the madness that I have written so far to a few simple questions will look like:
Is anyone aware if my conclusions are correct? I am leaning towards option one as it is easier to accomplish.
Has anyone experienced/found a better way to do that, or just found some 'logic-leaps' in my writings as I am overthinking the entire thing a little and may be doing some obvious miscalculation.
Very sorry for asking a trivial question and one that includes decision making that may involve deeper understanding of my project and situation, yet I am working with rather sensitive data and would appreciate feedback, even if only to improve my confidence into the chosen approach.
There is one other tool/method you might want to consider that seems to cater to your specific needs more. This would be to use the data import/export tool that ships with sqlserver to do a complete copy of all data into a temporary location within sql server and then write custom queries to reorganize the names and other changes you want to make. Is a bit more work but you could use the end product as a seed method for your migrations ;) (if you are doing code first anyway)

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have been working on sql server and front end coding and have usually faced problem formulating queries.
I do understand most of the concepts of sql that are needed in formulating queries but whenever some new functionality comes into the picture that can be dont using sql query, i do usually fails resolving them.
I am very comfortable with select queries using joins and all such things but when it comes to DML operation i usually fails
For every query that i never done before I usually finds uncomfortable with that while creating them. Whenever I goes for an interview I usually faces this problem.
Is it their some concept behind approaching on formulating sql queries.
Eg.
I need to create an sql query such that
A table contain single column having duplicate record. I need to remove duplicate records.
I know i can find the solution to this query very easily on Googling, but I want to know how everyone comes to the desired result.
Is it something like Practice Makes Man Perfect i.e. once you did it, next time you will be able to formulate or their is some logic or concept behind.
I could have get my answer of solving above problem simply by posting it on stackoverflow and i would have been with an answer within 5 to 10 minutes but I want to know the reason. How do you work on any new kind of query. Is it a major contribution of experience or some an implementation of concepts.
Whenever I learns some new thing in coding section I tries to utilize it wherever I can use it. But here scenario seems to be changed because might be i am lagging in some concepts.
EDIT
How could I test my knowledge and
concepts in Sql and related sql
queries ?
Typically, the first time you need to open a child proof bottle of pills, you have a hard time, but after that you are prepared for what it might/will entail.
So it is with programming (me thinks).
You find problems, research best practices, and beat your head against a couple of rocks, but in the process you will come to have a handy set of tools.
Also, reading what others tried/did, is a good way to avoid major obsticles.
All in all, with a lot of practice/coding, you will see patterns quicker, and learn to notice where to make use of what tool.
I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.
The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.
A table contain single column having
duplicate record. I need to remove
duplicate
I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.
The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.
The next step is isolation.
Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!
The next step is execution, or essentially the "how do I get this done" part of a request.
A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.
The last step is implementation. This is just where I take everything and make the query...
Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.
Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.
Kris
I think in the same way I cook dinner. I have some ingredients (tables, columns etc.), some cooking methods (SELECT, UPDATE, INSERT, GROUP BY etc.) then I put them together in the way I know how.
Sometimes I will do something weird and find it tastes horrible, or that it is amazing.
Occasionally I will pick up new recipes from the internet or friends, then use parts of these in my own.
I also save my recipes in handy repositories, broken down into reusable chunks.
On the "Delete a duplicate" example, I'd come to the result by googling it. This scenario is so rare if the DB is designed properly that I wouldn't bother keeping this information in my head. Why bother, when there is a good resource is available for me to look it up when I need it?
For other queries, it really is practice makes perfect.
Over time, you get to remember frequently used patterns just because they ARE frequently used. Rare cases should be kept in a reference material. I've simply got too much other stuff to remember.
Find a good documentation to your software. I am using Mysql a lot and Mysql has excellent documentation site with decent search function so you get many answers just by reading docs. If you do NOT get your answer at least you are learning something.
Than I set up an example database (or use the one I am working on) and gradually build my SQL. I tend to separate the problem into small pieces and solve it step by step - this is very successful if you are building queries including many JOINS - it is best to start with some particular case and "polute" your SQL with many conditions like WHEN id = "123" which you are taking out as you are working towards your solution.
The best and fastest way to learn good SQL is to work with someone else, preferably someone who knows more than you, but it is not necessarry condition. It can be replaced by studying mature code written by others.
Your example is a test of how well you understand the DISTINCT keyword and the GROUP BY clause, which are SQL's ways of dealing with duplicate data.
Examples and experience. You look at other peoples examples and you create your own code and once it groks, you don't need to think about it again.
I would have a look at the Mere Mortals book - I think it's the one by Hernandez. I remember that when I first started seriously with SQL Server 6.5, moving from manual ISAM databases and Access database systems using VB4, that it was difficult to understand the syntax, the joins and the declarative style. And the SQL queries, while powerful, were very intimidating to understand - because typically, I was looking at generated code in Microsoft Access.
However, once I had developed a relatively systematic approach to building queries in a consistent and straightforward fashion, my skills and confidence quickly moved forward.
From seeing your responses you have two options.
Have a copy of the specification for whatever your working on (SQL spec and the documentation for the SQL implementation (SQLite, SQL Server etc..)
Use Google, SO, Books, etc.. as a resource to find answers.
You can't formulate an answer to a problem without doing one of the above. The first option is to become well versed into the capabilities of whatever you are working on.
The second option allows you to find answers that you may not even fully know how to ask. You example is fairly simplistic, so if you read the spec/implementation documentaion you would know the answer right away. But there are times, where even if you read the spec/documentation you don't know the answer. You only know that it IS possible, just not how to do it.
Remember that as far as jobs and supervisors go, being able to resolve a problem is important, but the faster you can do it the better which can often be done with option 2.