COPY CSV files, then analyze? step by step?

COPY CSV files, then analyze? step by step? - sql

I just started to upload CSV files and creating tables in the database of the company I work for. Would anyone be so kind to explain the proper steps to ensure the copy was complete and there are no mistakes?
My boss told me a few steps of how they do things:
CREATE TABLE -> COPY
or
INSERT DATA -> CREATE INDEX/CONSTRAINTS (if necessary) -> TABLE ANALYSIS
The table analysis part is the confusing part for me. They told me to analyze the table, then check for errors, then get the estimate rows. What do I do with the estimate rows? I used ANALYZE table_name but nothing really shows on the data output.
Please help!

My answer is going to take a slightly different tack.
Clearly your boss has given you instructions and you don't understand them. In my opinion it is important that you go back to your boss and keep asking questions until you understand.
There are a number of important reasons for this:
1. You understand what you are being asked to do (rather than us guessing).
2. If it goes wrong you have done what you were asked, and
3. You might learn something.
The attitude that asking questions ("asking noob questions again") makes you look stupid or ignorant is very dangerous and will, in fact, make you stupid and ignorant.
After 30 years developing some very complex software systems, I still ask questions whenever I don't understand something. The result? In the end, I understand.
This is the only way to actually get better. None of us was born knowing how to do everything.

This sounds like a big misunderstanding. Your boss probably just wants you to run
ANALYZE table_name;
for every table to update the statistics (including row estimates). The query planner uses those statistics to chose how to best execute queries. Read the fine manual about ANALYZE.
Better ask your boss next time if you don't understand instructions.

Typically you'll want to check for a few things
The number of rows inserted is consistent
Data Types are sufficient (Ensure you didn't allocate too much or too little space to a variable)
Data Types are consistent (I.e. You're using integer data types for integers if it's in the design)
There weren't errors caused special characters (You shouldn't have this problem if you used a proper delimiter)
My assumption is "get estimate rows" is he simply wants the number of rows returned. I'll leave that up to you to figure out how to determine that.
If the CSV file was created correctly, I wouldn't sweat it too much. Don't be afraid to ask for help or advice from your colleagues, that's how you learn!
Best of luck!

Related

Fixing a slow running SQL query

I have been asked this in many interviews:
What is the first step to do if somebody complains that a query is running slowly?
I say that I run sp_who2 <active> and check the queries running to see which one is taking the most resources and if there is any locking, blocking or deadlocks going on.
Can somebody please provide me their feedback on this? Is this the best answer or is there a better approach?
Thanks!

This is one of my interview questions that I've given for years. Keep in mind that I do not use it as a yes/no, I use it to gauge how deep their SQL Server knowledge goes and whether they're server or code focused.
Your answer went towards how to find which query is running slow, and possibly examine server resource reasons as to why it's suddenly running slow. Based on your answer, I would start to label you as an operational DBA type. These are exactly the steps that an operational DBA performs when they get the call that the server is suddenly running slow. That's fine if that's what I'm interviewing for and that's what you're looking for. I might dig further into what your steps would be to resolve the issue once you find deadlocks for example, but I wouldn't expect people to be able to go very deep. If it's not a deadlock or blocking, better answers here would be to capture the execution plan and see if there are stale stats. It's also possible that parameter sniffing is going on, so a stored proc may need to be "recompiled". Those are the typical problems I see the DBA's running into. I don't interview for DBA's often so maybe other people have deeper questions here.
If the interview is for a developer job however, then I would expect the answer more to make an assumption that we've already located which query is running slowly, and that it's reproducible. I'll even go ahead and state as much if needed. The things that a developer has control over are different than what the operational DBA has control over, so I would expect the developer to start looking at the code.
People will often recommend looking at the execution plan at this point, and therefore recommend it as a good answer. I'll explain a little later why I don't necessarily agree that this is the best first step. If the interviewee does happen to mention the execution plan at this point however, my followup questions would be to ask what they're looking for on the execution plan. The most common answer would be to look for table scans instead of seeks, possibly showing signs of a missing index. The answers that show me more experience working with execution plans have to do with looking for steps with the highest percentage of the whole and/or looking for thick lines.
I find a lot of query tuning efforts go astray when starting with the execution plans and solutions get hacky because the people tuning the queries don't know what they want the execution plan to look like, just that they don't like the one they have. They'll then try to focus on the seemingly worst performing step, adding indexes, query hints, etc, when it may turn out that because of some other step, the entire execution plan is flipped upside down, and they're tuning the wrong piece. If, for example, you have three tables joined together on foreign keys, and the third table is missing an index, SQL Server may decide that the next best plan is to walk the tables in the opposite direction because primary key indexes exist there. The side effect may be that it looks like the first table is the one with the problem when really it's the third table.
The way I go about tuning a query, and therefore what I prefer to hear as an answer, is to look at the code and get a feel for what the code is trying to do and how I would expect the joins to flow. I start breaking up the query into pieces starting with the first table. Keep in mind that I'm using the term "first" here loosely, to represent the table that I want SQL Server to start in. That is not necessarily the first table listed. It is however typically the smallest table, especially with the "where" applied. I will then slowly add in the additional tables one by one to see if I can find where the query turns south. It's typically a missing index, no sargability, too low of cardinality, or stale statistics. If you as the interviewee use those exact terms in context, you're going to ace this question no matter who is interviewing you.
Also, once you have an expectation of how you want the joins to flow, now is a good time to compare your expectations with the actual execution plan. This is how you can tell if a plan has flipped on you.
If I was answering the question, or tuning an actual query, I would also add that I like to get row counts on the tables and to look at the selectivity of all columns in the joins and "where" clauses. I also like to actually look at the data. Sometimes problems just aren't obvious from the code but become obvious when you see some of the data.

I can't really say which is the best answer, but I'd answer: analyze the Actual Execution Plan. That should be a basis to check for performance issues.
There is plenty of information to be found on the internet about analyzing Execution Plans. I suggest you check it out.

Use SQL Profiler. Do needed settings and run your Stored procedure and check which statement is taking more duration. execute those statements separate, get execution plan. check for missed indices, joining order (join smaller tables first.). Try to use temp tables joininig tables.

I guess I'm a coder based on Bruce's interview model, but I'm currently working with a slow query problem that led me here. We're using nHibernate as our ORM, and some poor technology I've never seen before that doesn't take Lazy Loading into account when it talks to nHibernate. As such, the slow query is slow because it's in fact a horrible query, joining every table it can (the generated query fills two pages of the screen). The same query when we re-wrote it using Linq removed all the joins.
No matter what role you're in, I think asking the question: is this the right query, needs to be the number one question, no matter the role. Even as a DBA, looking at the query you might recommend changing the query if it's a bad one. Focusing on the query plan and indexes and other optimization fine tuning should be secondary to making sure you're optimizing what you actually want. I like Bruce's answer for this focus.

For really complex reports, do people sometimes code in their language rather than in sql?

I have some pretty complex reports to write. Some of them... I'm not sure how I could write an sql query for just one of the values, let alone stuff them in a single query.
Is it common to just pull a crap load of data and figure it all via code instead? Or should I try and find a way to make all the reports rely on sql?
I have a very rich domain model. In fact, parts of code can be expanded on to calculate exactly what they want. The actual logic is not all that difficult to write - and it's nicer to work my domain model than with SQL. With SQL, writing the business logic, refactoring it, testing it and putting it version control is a royal pain because it's separate from your actual code.
For example, one the statistics they want is the % of how much they improved, especially in relation to other people in the same class, the same school, and compared to other schools. This requires some pretty detailed analysis of how they performed in the past to their latest information, as well as doing a calculation for the groups you are comparing against as a whole. I can't even imagine what the sql query would even look like.
The thing is, this % improvement is not a column in the database - it involves a big calculation in of itself by analyzing all the live data in real-time. There is no way to cache this data in a column as doing this calculation for every row it's needed every time the student does something is CRAZY.
I'm a little afraid about pulling out hundreds upon hundreds of records to get these numbers though. I may have to pull out that many just to figure out 1 value for 1 user... and if they want a report for all the users on a single screen, it's going to basically take analyzing the entire database. And that's just 1 column of values of many columns that they want on the report!
Basically, the report they want is a massive performance hog no matter what method I choose to write it.
Anyway, I'd like to ask you what kind of solutions you've used to these kind of a problems.

Sometimes a report can be generated by a single query. Sometimes some procedural code has to be written. And sometimes, even though a single query CAN be used, it's much better/faster/clearer to write a bit of procedural code.
Case in point - another developer at work wrote a report that used a single query. That query was amazing - turned a table sideways, did some amazing summation stuff - and may well have piped the output through hyperspace - truly a work of art. I couldn't have even conceived of doing something like that and learned a lot just from readying through it. It's only problem was that it took 45 minutes to run and brought the system to its knees in the process. I loved that query...but in the end...I admit it - I killed it. ((sob!)) I dismembered it with a chainsaw while humming "Highway To Hell"! I...I wrote a little procedural code to cover my tracks and...nobody noticed. I'd like to say I was sorry, but...in the end the job ran in 30 seconds. Oh, sure, it's easy enough to say "But performance matters, y'know"...but...I loved that query... ((sniffle...)) Anybody seen my chainsaw..? >;->
The point of the above is "Make Things As Simple As You Can, But No Simpler". If you find yourself with a query that covers three pages (I loved that query, but...) maybe it's trying to tell you something. A much simpler query and some procedural code may take up about the same space, page-wise, but could possibly be much easier to understand and maintain.
Share and enjoy.

Sounds like a challenging task you have ahead of you. I don't know all the details, but I think I would go at it from several directions:
Prioritize: You should try to negotiate with the "customer" and prioritize functionality. Chances are not everything is equally useful for them.
Manage expectations: If they have unrealistic expectations then tell them so in a nice way.
IMHO SQL is good in many respects, but it's not a brilliant programming language. So I'd rather just do calculations in the application rather than in the database.
I think I'd go for some delay in the system .. perhaps by caching calculated results for some minutes before recalculating. This is with a mind towards performance.

The short answer: for analysing large quantities of data, a SQL database is probably the best tool around.
However, that does not mean you should analyse this straight off your production database. I suggest you look into Datawarehousing.

For a one-off report, I'll write the code to produce it in whatever I can best reason about it in.
For a report that'll be generated more than once, I'll check on who is going to be producing it the next time. I'll still write the code in whatever I can best reason about it in, but I might add something to make it more attractive to use to that other person.

People usually use a third party report writing system rather than writing SQL. As an application developer, if you're spending a lot of time writing complex reports, I would severely question your manager's actions in NOT buying an off-the-shelf solution and letting less-skilled people build their own reports using some GUI.

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have been working on sql server and front end coding and have usually faced problem formulating queries.
I do understand most of the concepts of sql that are needed in formulating queries but whenever some new functionality comes into the picture that can be dont using sql query, i do usually fails resolving them.
I am very comfortable with select queries using joins and all such things but when it comes to DML operation i usually fails
For every query that i never done before I usually finds uncomfortable with that while creating them. Whenever I goes for an interview I usually faces this problem.
Is it their some concept behind approaching on formulating sql queries.
Eg.
I need to create an sql query such that
A table contain single column having duplicate record. I need to remove duplicate records.
I know i can find the solution to this query very easily on Googling, but I want to know how everyone comes to the desired result.
Is it something like Practice Makes Man Perfect i.e. once you did it, next time you will be able to formulate or their is some logic or concept behind.
I could have get my answer of solving above problem simply by posting it on stackoverflow and i would have been with an answer within 5 to 10 minutes but I want to know the reason. How do you work on any new kind of query. Is it a major contribution of experience or some an implementation of concepts.
Whenever I learns some new thing in coding section I tries to utilize it wherever I can use it. But here scenario seems to be changed because might be i am lagging in some concepts.
EDIT
How could I test my knowledge and
concepts in Sql and related sql
queries ?

Typically, the first time you need to open a child proof bottle of pills, you have a hard time, but after that you are prepared for what it might/will entail.
So it is with programming (me thinks).
You find problems, research best practices, and beat your head against a couple of rocks, but in the process you will come to have a handy set of tools.
Also, reading what others tried/did, is a good way to avoid major obsticles.
All in all, with a lot of practice/coding, you will see patterns quicker, and learn to notice where to make use of what tool.

I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.
The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.
A table contain single column having
duplicate record. I need to remove
duplicate
I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.
The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.
The next step is isolation.
Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!
The next step is execution, or essentially the "how do I get this done" part of a request.
A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.
The last step is implementation. This is just where I take everything and make the query...
Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.
Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.
Kris

I think in the same way I cook dinner. I have some ingredients (tables, columns etc.), some cooking methods (SELECT, UPDATE, INSERT, GROUP BY etc.) then I put them together in the way I know how.
Sometimes I will do something weird and find it tastes horrible, or that it is amazing.
Occasionally I will pick up new recipes from the internet or friends, then use parts of these in my own.
I also save my recipes in handy repositories, broken down into reusable chunks.

On the "Delete a duplicate" example, I'd come to the result by googling it. This scenario is so rare if the DB is designed properly that I wouldn't bother keeping this information in my head. Why bother, when there is a good resource is available for me to look it up when I need it?
For other queries, it really is practice makes perfect.
Over time, you get to remember frequently used patterns just because they ARE frequently used. Rare cases should be kept in a reference material. I've simply got too much other stuff to remember.

Find a good documentation to your software. I am using Mysql a lot and Mysql has excellent documentation site with decent search function so you get many answers just by reading docs. If you do NOT get your answer at least you are learning something.
Than I set up an example database (or use the one I am working on) and gradually build my SQL. I tend to separate the problem into small pieces and solve it step by step - this is very successful if you are building queries including many JOINS - it is best to start with some particular case and "polute" your SQL with many conditions like WHEN id = "123" which you are taking out as you are working towards your solution.
The best and fastest way to learn good SQL is to work with someone else, preferably someone who knows more than you, but it is not necessarry condition. It can be replaced by studying mature code written by others.

Your example is a test of how well you understand the DISTINCT keyword and the GROUP BY clause, which are SQL's ways of dealing with duplicate data.

Examples and experience. You look at other peoples examples and you create your own code and once it groks, you don't need to think about it again.

I would have a look at the Mere Mortals book - I think it's the one by Hernandez. I remember that when I first started seriously with SQL Server 6.5, moving from manual ISAM databases and Access database systems using VB4, that it was difficult to understand the syntax, the joins and the declarative style. And the SQL queries, while powerful, were very intimidating to understand - because typically, I was looking at generated code in Microsoft Access.
However, once I had developed a relatively systematic approach to building queries in a consistent and straightforward fashion, my skills and confidence quickly moved forward.

From seeing your responses you have two options.
Have a copy of the specification for whatever your working on (SQL spec and the documentation for the SQL implementation (SQLite, SQL Server etc..)
Use Google, SO, Books, etc.. as a resource to find answers.
You can't formulate an answer to a problem without doing one of the above. The first option is to become well versed into the capabilities of whatever you are working on.
The second option allows you to find answers that you may not even fully know how to ask. You example is fairly simplistic, so if you read the spec/implementation documentaion you would know the answer right away. But there are times, where even if you read the spec/documentation you don't know the answer. You only know that it IS possible, just not how to do it.
Remember that as far as jobs and supervisors go, being able to resolve a problem is important, but the faster you can do it the better which can often be done with option 2.

Fastest way to become a MySQL expert?

I have been using MySQL for years, mainly on smaller projects until the last year or so. I'm not sure if it's the nature of the language or my lack of real tutorials that gives me the feeling of being unsure if what I'm writing is the proper way for optimization purposes and scaling purposes.
While self-taught in PHP I'm very sure of myself and the code I write, easily can compare it to others and so on.
With MySQL, I'm not sure whether (and in what cases) an INNER JOIN or LEFT JOIN should be used, nor am I aware of the large amount of functionality that it has. While I've written code for databases that handled tens of millions of records, I don't know if it's optimum. I often find that a small tweak will make a query take less than 1/10 of the original time... but how do I know that my current query isn't also slow?
I would like to become completely confident in this field in the ability to optimize databases and be scalable. Use is not a problem -- I use it on a daily basis in a number of different ways.
So, the question is, what's the path? Reading a book? Website/tutorials? Recommendations?

EXPLAIN is your friend for one. If you learn to use this tool, you should be able to optimize your queries very effectively.
Scan the the MySQL manual and read Paul DuBois' MySQL book.
Use EXPLAIN SELECT, SHOW VARIABLES, SHOW STATUS and SHOW PROCESSLIST.
Learn how the query optimizer works.
Optimize your table formats.
Maintain your tables (myisamchk, CHECK TABLE, OPTIMIZE TABLE).
Use MySQL extensions to get things done faster.
Write a MySQL UDF function if you notice that you would need some
function in many places.
Don't use GRANT on table level or column level if you don't really need
it.
http://dev.mysql.com/tech-resources/presentations/presentation-oscon2000-20000719/index.html

The only way to become an expert in something is experience and that usually takes time. And a good mentor(s) that are better than you to teach you what you are missing. The problem is you don't know what you don't know.

Research and experience - if you don't have the projects to warrant the research, make them. Make three tables with related data and make up scenarios.
E.g.
Make a table of movies their data
make a table of user
make a table of ratings for users
spend time learning how joins work, how to get movies of a particular rating range in one query, how to search the movies table ( like, regex) - as mentioned, use explain to see how different things affect speed. Make a day of it; I guarantee your
handle on it will be greatly increased.
If you're still struggling for case-scenarios, start looking here on SO for questions and try out those scenarios yourself.

I don't know if MIT open courseware has anything about databases... Well whaddya know? They do: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-830Fall-2005/CourseHome/
I would recommend that as one source based only on MITs reputation. If you can take a formal course from a university you may find that helpful. Also a good understanding of the fundamental discrete mathematics/logic certainly would do no harm.

As others have said, time and practice is the only real approach.
More practically, I found that EXPLAIN worked wonders for me personally. Learning to read the output of that was probably the biggest single leap I made in being able to write efficient queries.
The second thing I found really helpful was SQL Tuning by Dan Tow, which describes a fairly formal methodology for extracting performance. It's a bit involved, but works well in lots of situations. And if nothing else, it will give you a much better understanding of the way joins are processed.

Start with a class like this one: https://www.udemy.com/sql-mysql-databases/
Then use what you've learned to create and manage a number of SQL databases and run queries. Getting to the expert level is really about practice. But of course you need to learn the pieces before you can practice.

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey

Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...

You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.

This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.

Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?

I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.

Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).

If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.

You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).

Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas