Quasi-automatic import realization from *csv - sql

My problem is that I have to realize an easy-refreshable database system. At start, I have 2 tables imported from *csv format, and I need to restructure them and make certain new tables. I also have to add ID-s to certain properties, to make searching more faster (it is going to contain loads of data, and is going to be refreshed almost every day). It is going to contain millions of records with many rows. My first question is how to import? I need only selective rows from the original file, but the *csv layout will remain the same.
My second question is how to create indexes in the new tables (that were generated by the queries mentioned) to certain properties to make searching faster?
If you think, my questions are not exact enough, please comment, and I'm going to answer you with my best knowledge.
Thanks in advance.

Related

Optimizing View with 54 SELECTs -- Updated with Process used

I've started a new job and am looking at some of their views that they wanted me to get familiar with.
One of the views that I'm going through has 54 SELECTs. I've NEVER seen such a massive View before and after browsing through it, I'm fairly certain that I can optimize it.
What I'm looking for is an easier way to compare the JOINs between each of the selects to find commonality WITHOUT just having to sift through them all by hand.
I'm fairly certain that THAT is what I'll have to do, but I was hoping that someone would have an easier or more efficient and less time consuming way to do this...
Anyone? :)
Update 12/4/2019:
TLDR; I separated each main SELECT into it's own file, then using Notepad++ on the full view code file I found common values to sort and organize each individual SELECT into a folder named the common value. From there I used WinMerge to compare each group of SELECTs in their respective folders against each other to find differences and commonalities. I then started refactoring and minimizing the excessive code (and table calls >.< ) using CTE's (because this is Oracle and a View).
I ended up splitting out the views SELECTs by looking for one of the field names that's a column in the view. This meant that I'd get the beginnings of each SELECT.
I then copied and pasted each one into a different .sql file. I also saved the full view into it's own .sql file.
Using Notepad++ I searched for each instance of the view column name and took down the column value that is the main information needed for the view. Using that information, I grouped each SELECT into a folder based on that value
Once I'd gotten all the SELECTs into their respective folders, I used WinMerge to the SELECTs to the others in their same folder. This helped me to locate commonalities and differences, some differences that would be harder to lump the datasource table into a CTE with the others.
I've since been updating, refactoring, and fixing the view one folder grouped SELECTs at a time.
The project I'm in has had me touching each one of the main code values that I had used to organize the SELECTs into folders, so I WILL end up recoding the ENTIRE view. :)
Identifying similar lines of code is a difficult problem, and as far as I know Clone Doctor is the only program that can do it for PL/SQL.
I only used it once, about 10 years ago. But I remember being surprised at how well it found different lines of code that were really duplicates at a deep level. It wasn't perfect, and had some bugs and took a while to get working. But there's a free trial version.

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

Does column with lot of data cause issue

I have a table with say 10 columns and 3 of them have textFields that has some Html in them, now because of these 3 fields row size increases resulting in increase of size of datatable to more than 4GB.
My question is whether these field where we are storing large data affect performance of the application, these columns aren't there in joins but have there place in table.
Will normalizing them improve performance of application?
I have to take it to senior colleagues, but before I go to them with the suggestion just wanted to know if someone has tried doing so and whether or not it worked
A properly implemented database (PostgreSQL, for instance) would only store a limited amount of data directly in the table, where it could affect performance. The remainder is stored separately, keeping only a reference, of maybe the starting fragment directly in the table. Hence the impact on search performance may not be very big. Of course, when you retrieve the data, reading really large column surely will not be faster.
I am also facing same problem with one of table but I solved my problem to create indexes and separate out table . please learn about indexes and normalization there are many way to handle this.
Thanks.
You can only know if you try it out. In principle if you have proper indexes in the tables, it should be all fine. But it depends on your particular RDBMS implementation.

Permanent table, temp tables or php session?

My web app offers personalized recommendations. When a user starting to use it, about 1000+ rows are being inserted to one big recommendation table, correlating with other tables in the database. Every item the user votes for affects all of those 1000+ rows.
Since the recommendation info is only useful during the session, and since the recommendation table is getting huge, we'd like to switch to a more appropiate method. There's the possibility of deleting the relevant rows as soon as the user session is over. I guess PHP session array or temp tables are better for this case?
One temp table per session will lead to catalog pollution, so not really recommended.
Have you considered actually keeping the data, so as periodically mine it to improve the suggestions?
First: consider redesigning your data structure, I think it is not optimal.
Store a user's recommendation in a table user-recommendeditem-score: I don't see any need for a temp table or anything else.
Otherwise, you could start using sessions, but you should encapsulate the code carefully, making it easy to change if/when this solution is no more maintainable.
I suspect that the method is flawed - 1000+ recommendations per user? How many of them do they ever look at? If you don't know the answer to that question - then you need to spend some time thinking about why you don't know the answer.
Every item the user votes for affects all of those 1000+ rows
Are you sure your data is properly normalised?
But leaving that aside for the moment. The right place to generate / store that is in the database - a relational database is explicitly designed, and a lot more efficient about generating and maintaining tabular sets of data then a conventional programming language.

Lazy loading of fields that are part of the main record

I'm a relative newbie at NHibernate, so I'll beg forgiveness in advance if this a stupid question. I've googled it and searched the documentation, and am getting all wrapped around the axle.
I'm maintaining/enhancing an existing application that uses NHibernate for a relatively straightforward table. The table has about 10-12 fields, and no foreign key relations. The table contains somewhere around a dozen or so rows, give or take.
Two of the fields are huge blobs (multi-megabytes). As a result, the table is taking an excessive amount of time (4 minutes) to load when working with a remote DB.
The thing is that those two fields are not needed until a user selects one of the rows and begins to work on it, and then they are only needed for the one row that he selects.
This seems like exactly what lazy loading was meant for. I just can't quite figure out how to apply it unless I break up the existing DB schema and put those columns in their own table with one-to-one mapping, which I don't want to.
If it matters, the program is using NHiberate.Mapping.Attributes rather than hbm files, so I need to be able to make alterations in the attributes of the Domain objects that will propagate to the hbm.
Thanks for any help.
You need lazy properties IN NHibernate 3 to accomplish this. I assume, but don't know, that you can set that using attributes.