I'm working on a large project, for which several thousand branch types are defined, and would like to quickly retrieve a list of "my" branch types. This can be achieved either by listing branch types created by me, or by listing branch types whose names start with my username.
As the full list is huge and lstype runs for approximately an hour normally, is there a way to formulate a query that can be answered quickly?
I never found a native command able to return quickly an answer.
When looking at the cleartool lstype command, the technote "LSTYPE performance improvements" does mention that:
The -short, -nostatus and -unsorted options can be used to improve performance of the cleartool lstype command
But as with everything with ClearCase, this doesn't stand the test of the real world, where the number of (here) types quite be really big...
So what I usually do for this kind of request, considering I don't create a brtype every 5 minutes, is to have a batch job running every 2 hours, updating a list of brtype with the informations I need (owner, date, ...).
I can then at any time filter that file (at least the most updated version of that file) in order to extract the list of brtype I need.
There is the risk this list isn't up-to-date, but in practice this works relatively well.
Related
In the sample installation and configuration instructions, it is seemingly suggested that OpenGrok requires two staging areas, with the rationale being, that one area is an index-regeneration-work-area, and the other is a production area, and they are rotated with every index regen.
Is that really necessary? Can I only have one area instead of two?
I'm looking for an answer that is specific to opengrok, and not a general list of race conditions one might encounter.
Strictly said, this is not necessary. In fact, I am pretty sure overwhelming majority of the deployments are without staging area.
That said, you need to decide if you are comfortable with a window of inconsistency that could result in some failed/imprecise searches. Let's assume that the source was updated (e.g. via git pull in case of Git) and the indexer has not finished processing the new changes yet. Thus, the index still contains the data reflecting the old state of the source. Let's say the changes applied to the source removed a file. Now if someone initiates a search that matches the contents of the removed file, the search result will probably end with an error. This is probably the better alternative - consider the case when more subtle change is done to a file such as removal/addition of couple of lines of code. In that case the symbol definitions will be off so the search results will bring you to the wrong line of code. Or, not so subtle change, when e.g. a function definition is removed from a file, the search results for references of this function will contain invalid places.
The length of the inconsistency window stems from the indexing time that is largely dependent on 2 things, at least currently:
size of the changes applied to the source
size of the source directory tree
The first is relevant because of history processing. The more incoming history changes (e.g. changesets in Git), the more work the indexer will have to do to generate history cache and/or history fields for the index (assuming history handling is on).
The second is relevant because the indexer traverses the whole source directory tree to find out which files have changed which might incur lots syscalls and potentially lots of I/O. At least until https://github.com/oracle/opengrok/issues/3077 is implemented and that will help only Source Code Management systems based on changesets.
I'm having hard time with this project, I'm building a boat configurator which is divided in categories / packages / extras.
Each category is disconnected with the other, so it's not a big problem.
The problem comes out with packages and extras. Extras are options that can be chosen within a package (increasing the total price). I'll explain all dependencies that can exists between these 2 objects:
There are times where you can also purchase a package all toghether, which could have an extra (or 2) that are upgrades to that package
There are times where you can buy a single package between 4-5 and additionally you can buy some extras in another 6th package
Sometimes an extra can be bought only if you have at least one item in a given package
Sometimes an extra can be bought only if you have a specific extra
At the moment I don't have any other dependency in my mind (but I'm sure there are others possible).
I don't know which approach I should take to store all this dependencies, I have 3 basic sql tables (category, package, extra which are not connected because a Package => Extra dependency could be different for other categories) and CategoryRelationship, PackageRelationship, ExtraRelationship but I'm having hard time in express some dependencies, expecially the 2nd which is not limited to a single id field.
How is normally handled all these interdependencies?
I never faced this problem, thanks for any suggestion
Edit 1:
I'm thinking about changing the approach to 1 Table for each "type" of dependency, can it be considered a good way to handle instead of a single table with all type of dependencies?
Because no one answered this, I'm posting the approach I finally used for this project and which I find interesting.
I created 1 table for each type of dependency (as I told in Edit 1). Each table can have multiple reference to an item of any id and is possible in this way to organize quite well all columns that represents dependencies.
The approach is inspired by CakePHP Validation model, where each validation is a class that has a validate method which will be run.
Hopefully this will help someone else; I'll mark this as answer if no one provide an answer with a better suggestion.
I'm working on an ASP.net web application that uses SQL as a database back-end. One issue that I have is that it sometimes takes a while to get my DBA to create or modify tables in the database which under no circumstance am I allowed to modify on my own.
Here is something that I do is when I expect users to upload files with their data.
Suppose the user uploads a new record for a table called Student_Records. The user uploads a record with fname Bob and lname Smith. The record is assigned primary key 123 The user also uploads two files: attendance_record.pdf and homework_record.pdf. Let's suppose that I have a network share: \\foo\bar where the files are saved.
One way of handling this situtation would be to have a table Student_Records_Files that associates the key 123 with Bob Smith. However, since I have trouble getting tables created, I've gone and done something different: When I save the files on the server, I call them 123_attendance_record.pdf and 123_homework_record.pdf. That way, I can easily identify what table record each file is associated with without having to create a new SQL table. I am, in essence, using the file system itself as a join table (Obviously, the file system is a type of database).
In my code for retrieving the files, I scan the directory \\foo\bar and look for files that begin with each primary key number from Student_Records.
It seems to work very well, but is it good practice?
There is nothing wrong with using the file system to store files. It's what it is used for.
There are a few things to keep in mind though.
I would consider a better method of storing the files - perhaps a directory for each user, rather than simply appending the user id to the filename.
Ensure that the file store is resilient and backed up with the same regularity as your database. If your database is configured to give you a backup every 10 minutes, but your file store only does a backup every day (or worse week) then you might be in for a world of pain.
Also consider what would happen if the user uploads two documents that are the same name.
First of all, I think it's a bad practice, in general, to design your architecture based on how responsive your DBA is. Any given compromise based on this approach may or may not be a big deal, but over time it will result in a poorly designed system.
Second, making the file name this critical seems dangerous to me; there's no protection against a person or application modifying the filename without realizing its importance.
Third, one of the advantages of having a table to maintain the join between the person and the file is that you can add additional data, such as: when was the file uploaded, what is the MIME type, has the file been read by anyone through the system, is this file a newer version of a previous file, etc. etc. Metadata can be very powerful, and the filesystem offers only limited ways to store it.
There are really two questions here. One is, given that for administrative reasons you cannot get changes made to the database schema, is it acceptable to devise some workaround. To that I'd have to say yes. What else can you do? In theory, if it takes two weeks to get the DBA to make a schema change for you, then this two weeks should be added to any deadline that you are given. In practice, this almost never happens. I've often worked places where some paperwork or whatever required two weeks before I could even begin work, and then I'd be given two weeks and one day to do the project. Sometimes you just have to put it together with rubber bands and bandaids.
Two is, is it a good idea to build a naming convention into file names and use this to identify files and their relationship to other data. I've done this at times and it's generally worked for me, though I have a perhaps irrational emotional feeling that it's not a good idea.
On the plus side, (a) By building information into a file name, you make it easy for both the computer and a human being to identify file associations. (Human readable as long as the naming convention is straightforward enough, anyway.) (b) By eliminating the separate storage of a link, you eliminate the possibility of a bad link. A file with the appropriate name may not exist, of course, but a database record with appropriate keys may not exist, or the file reference in such a record may be null or invalid. So it seems to solve one problem there without creating any new problems.
Potential minuses are: (a) You may have characters in the key that are not legal in file names. You may be able to just strip such characters out, or this may cause duplicates. The only safe thing to do is to escape them in some way, which is a pain. (b) You may exceed the legal length of a file name. Not as much of an issue as it was in the bad old 8.3 days. (c) You can't share files. If a database record points to a file, then two db records could point to the same file. If you must make two copies of a file, not only does this waste disk space, but it also means that if the file is updated, you must be sure to update all copies. If in your application it would make no sense to share files, than this isn't an issue.
You have to manage the files in some way, but you had to do that anyway.
I really can't think of any over-riding minuses. As I say, I've done this on occassion and didn't run into any particular problems. I'm interested in seeing others' responses.
I think it is not good practice because you are making your working application very dependent on specific implementation details and it would make it pretty hard to work with in the future to maintain, or if other people later needed access to your code/api.
Now weather you should do this or not is a whole different question. If you are really taking that much of a performance hit and it is significantly easier to work with how you have it, then I would say go ahead and break the rules. Ideally its good to follow best practice methods, but sometimes you have to bend the rules a little to make things work.
First, why is this a table change as opposed to a data change? Once you have the tables set up you should only need to update rows in that table every time that a user adds new files. If you have to put up with this one-time, two-week delay then bite the bullet and just get it done right.
Second, instead of trying to work around the problem why don't you try to fix the problem? Why is the process of implementing table changes so slow? Are you at least able to work on a development database (in which you have control to test and try out these changes)? Even if it's your own laptop you can at least continue on with development. Work with your manager, the DBA, and whoever else you need to, in order to improve the process. Would it help to speed things up if your scripts went through a formal testing process before you handed them off to the DBA so that he doesn't need to test the scripts, etc. himself?
Third, if this is a production database then you should probably be building in this two-week delay into your development cycle. You know that it takes two weeks for the DBA to review and implement changes in production, so make sure that if you have a deadline for releasing functionality that you have enough lead time for it.
Building this kind of "data" into a filename has inherent problems as others have pointed out. You have no relational integrity guarantees and the "data" can be changed without knowledge of the rest of the application/database.
It's best to keep everything in the database.
Network file I/O is spotty at best. In addition, its slower than the DB I/O.
If the DBA is difficult in getting small changes into the database, you
may be dealing with:
A political control issue. Maybe he just knows DB stuff and is threatened
when he perceives others moving in on his turf. Whatever his reasons, you need
to GET WORK DONE. Period. Document all the extra time / communication / work
you need to do for each small change and take that up with the management.
If the first level of management is unwilling to see things your way,
(it does not matter what their reasons are), escalate the issue
to the next level of management. In the past, I've gotten results this way.
It was more of a political territory problem than a technical problem.
The DBA eventually gave up and gave me full access to the TEST system BUT
he also stipulated that I would need to learn his testing process,
naming convention, his DB standards and practices, his way of testing, etc.
I was game.
I would also need to fix any database problems arising from changes I introduced.
This was fair and I got to wear the DBA hat in addition to the developer hat.
I got the freedom I needed and he got one less thing to worry about.
A process issue. Maybe the DBA needs to put every small DB change you submit
through a gauntlet of testing and performance analysis. Maybe he has a highly
normalized DB schema and because he has the big picture, he needs to normalize or
denormalize your requested DB changes to fit into the existing schema.
Ask to work with him. Ask him for a full DB design diagram.
Get a good sense of his DB design philosophy. Implement your DB changes with
his DB design philosophy in mind. Show that you understand that he's trying
to keep the DB in good order (understand normalization, relational constraints,
check constraints) Give him less to worry about. He needs to trust that you
will not muck up his database.
Accumulate all the small changes into a lengthy script and submit them to the DBA.
This way, you won't have to wait for each small change to go through all of his
process / testing. In addition, you're giving him a bigger picture view of your
development planning (that is in step with his DB design philosophy) instead of
just the play by play.
I manage a research database with Ruby on Rails. The data that is entered is primarily used by scientists who prefer to have all the relevant information for a study in one single massive table for use in their statistics software of choice. I'm currently presenting it as CSV, as it's very straightforward to do and compatible with the tools people want to use.
I've written many views (the SQL kind, not the Rails HTML/ERB kind) to make the output they expect a reality. Some of these views are quite large and have a fair amount of complexity behind them. I wrote them in SQL because there are many calculations and comparisons that are more easily done with SQL. They're currently loaded into the database straight from a file named views.sql. To get the requested data, I do a select * from my_view;.
The views.sql file is getting quite large. Part of the problem is that we're still figuring out what the data we collect means, so there's a lot of changes being made to the views all the time -- and a ton of them are being created. Many of them need to be repeatable.
I've recently run into issues organizing and testing these views. Rails works great for user interface stuff and business logic, but I'm not aware of much existing structure for handling the reporting we require.
Some options I've thought of:
Should I move them into the most relevant models somehow? Several of the views interact with each other, which makes this situation more complex than just doing a single find_by_sql, so I don't know if they should only be part of the model.
Perhaps they should be treated as a "view" in the MVC sense? (That is, they could be moved into app/views/ and live alongside the HTML, perhaps as files named something like my_view.csv.sql which return CSV.)
How would you deal with a complex reporting problem like this?
UPDATE for Mladen Jablanović
It started by having a couple of views for reporting purposes. My boss(es) decided they wanted more, so I started writing more. Some give couple hundred columns of data, based on the requirements I've been given.
I have a couple thousand lines of views all shoved in a single file now. I don't like that situation, so I want to reorganize/refactor the code. I'd also like an easy way of providing CSVs -- I'm currently running queries and emailing them by hand, which could easily be automated. Finally, I would like to be able to write some tests on the output of the views, since a couple of regressions have already popped up.
I haven't worked much with SQL and views directly, so I can't help you there, but you can certainly build an ActiveRecord model on top of a view, very easily in fact. The book Enterprise Rails has a whole chapter on it (here it is at Google Books).
We are using views in our DB extensively and some of them are exposed as Rails models. You work with them as you would with tables, except for you can't update them of course.
Also, some of the columns may be calculated using other columns (different ratios for example) so we don't do it in the view, but in the model instead (ok, not entirely true, we construct SQL snippet and pass it to :select => '' portion of find call).
Presentation logic (such as date and number formatting) goes to Rails views.
I'm afraid I can't help you with more concrete advice, as the scope of the question is pretty wide.
EDIT:
Hundreds of columns doesn't sound reasonable. Sounds like immense amount of data in one place. How do they use it at all? We have web application where they can drill down and filter the results, narrow timespan and time step etc, so they never have more then 10-20 columns in the reports.
We store our views one view per SQL file. Also, you can combine it with a numerical prefix in order to ensure proper creation order (in case some of them depend on others). No migrations there, whole DB layer is app-agnostic.
For CSV, you can create either a set of scripts you can invoke either manually, or using cron, or you can use FasterCSV from your Rails app and generate CSVs by HTTP request.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great
I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.
I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)
Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.
In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.
One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).