I have problem with perforce.
I like perforce's time-lapse view function very much.
It helps me to find who did mistake.
The problem is when some file is pretty big and frequently changed,
opening time-lapse view takes very long time.
So, I need some function like SQL (select * from time-lapse data top 100), that means I just need lasted 100(or 50? 20?) changed history to find what changed recently.
Do perforce have that function? or is there any kinds of plug-in or perforce's commmand?
Or I want to hear your know-how how to find changed history faster.
Thanks in advance.
I love time lapse view, but I often start with the "File History" view. Since, as you point out, the most interesting changes are the recent ones, I generally look through the recent changes and their descriptions first. Often, I see a change that looks particularly interesting and I study that changelist by itself and see what I'm interested in.
Regarding the speed of time lapse view, I wonder whether the problem is on your server or on your client. A couple things to try:
Is time lapse view also slow when you try it on a colleague's workstation?
If you run 'p4 annotate >tmp', is that also slow?
If 'p4 annotate' is fast, you might find it worth using for those particularly large files with very long histories. Time lapse view is very powerful and easy to read, but it collects a vast amount of information from the server, and then must format that information for display.
In my case, when I bring up time lapse view, I'm generally planning to study the results for some time, so I'm willing to wait a few seconds while it loads.
If the problem is that your server is overloaded, you should contact your Perforce administrator and see what he can do. Perhaps he can add more resources (typically memory) to your server, or perhaps you should consider deploying a read-only replica, which can service operations like time lapse view entirely from the replica without requiring any cycles from the main server. Perforce technical support is always happy to help with problems like these.
Related
I'm working on an ASP.net web application that uses SQL as a database back-end. One issue that I have is that it sometimes takes a while to get my DBA to create or modify tables in the database which under no circumstance am I allowed to modify on my own.
Here is something that I do is when I expect users to upload files with their data.
Suppose the user uploads a new record for a table called Student_Records. The user uploads a record with fname Bob and lname Smith. The record is assigned primary key 123 The user also uploads two files: attendance_record.pdf and homework_record.pdf. Let's suppose that I have a network share: \\foo\bar where the files are saved.
One way of handling this situtation would be to have a table Student_Records_Files that associates the key 123 with Bob Smith. However, since I have trouble getting tables created, I've gone and done something different: When I save the files on the server, I call them 123_attendance_record.pdf and 123_homework_record.pdf. That way, I can easily identify what table record each file is associated with without having to create a new SQL table. I am, in essence, using the file system itself as a join table (Obviously, the file system is a type of database).
In my code for retrieving the files, I scan the directory \\foo\bar and look for files that begin with each primary key number from Student_Records.
It seems to work very well, but is it good practice?
There is nothing wrong with using the file system to store files. It's what it is used for.
There are a few things to keep in mind though.
I would consider a better method of storing the files - perhaps a directory for each user, rather than simply appending the user id to the filename.
Ensure that the file store is resilient and backed up with the same regularity as your database. If your database is configured to give you a backup every 10 minutes, but your file store only does a backup every day (or worse week) then you might be in for a world of pain.
Also consider what would happen if the user uploads two documents that are the same name.
First of all, I think it's a bad practice, in general, to design your architecture based on how responsive your DBA is. Any given compromise based on this approach may or may not be a big deal, but over time it will result in a poorly designed system.
Second, making the file name this critical seems dangerous to me; there's no protection against a person or application modifying the filename without realizing its importance.
Third, one of the advantages of having a table to maintain the join between the person and the file is that you can add additional data, such as: when was the file uploaded, what is the MIME type, has the file been read by anyone through the system, is this file a newer version of a previous file, etc. etc. Metadata can be very powerful, and the filesystem offers only limited ways to store it.
There are really two questions here. One is, given that for administrative reasons you cannot get changes made to the database schema, is it acceptable to devise some workaround. To that I'd have to say yes. What else can you do? In theory, if it takes two weeks to get the DBA to make a schema change for you, then this two weeks should be added to any deadline that you are given. In practice, this almost never happens. I've often worked places where some paperwork or whatever required two weeks before I could even begin work, and then I'd be given two weeks and one day to do the project. Sometimes you just have to put it together with rubber bands and bandaids.
Two is, is it a good idea to build a naming convention into file names and use this to identify files and their relationship to other data. I've done this at times and it's generally worked for me, though I have a perhaps irrational emotional feeling that it's not a good idea.
On the plus side, (a) By building information into a file name, you make it easy for both the computer and a human being to identify file associations. (Human readable as long as the naming convention is straightforward enough, anyway.) (b) By eliminating the separate storage of a link, you eliminate the possibility of a bad link. A file with the appropriate name may not exist, of course, but a database record with appropriate keys may not exist, or the file reference in such a record may be null or invalid. So it seems to solve one problem there without creating any new problems.
Potential minuses are: (a) You may have characters in the key that are not legal in file names. You may be able to just strip such characters out, or this may cause duplicates. The only safe thing to do is to escape them in some way, which is a pain. (b) You may exceed the legal length of a file name. Not as much of an issue as it was in the bad old 8.3 days. (c) You can't share files. If a database record points to a file, then two db records could point to the same file. If you must make two copies of a file, not only does this waste disk space, but it also means that if the file is updated, you must be sure to update all copies. If in your application it would make no sense to share files, than this isn't an issue.
You have to manage the files in some way, but you had to do that anyway.
I really can't think of any over-riding minuses. As I say, I've done this on occassion and didn't run into any particular problems. I'm interested in seeing others' responses.
I think it is not good practice because you are making your working application very dependent on specific implementation details and it would make it pretty hard to work with in the future to maintain, or if other people later needed access to your code/api.
Now weather you should do this or not is a whole different question. If you are really taking that much of a performance hit and it is significantly easier to work with how you have it, then I would say go ahead and break the rules. Ideally its good to follow best practice methods, but sometimes you have to bend the rules a little to make things work.
First, why is this a table change as opposed to a data change? Once you have the tables set up you should only need to update rows in that table every time that a user adds new files. If you have to put up with this one-time, two-week delay then bite the bullet and just get it done right.
Second, instead of trying to work around the problem why don't you try to fix the problem? Why is the process of implementing table changes so slow? Are you at least able to work on a development database (in which you have control to test and try out these changes)? Even if it's your own laptop you can at least continue on with development. Work with your manager, the DBA, and whoever else you need to, in order to improve the process. Would it help to speed things up if your scripts went through a formal testing process before you handed them off to the DBA so that he doesn't need to test the scripts, etc. himself?
Third, if this is a production database then you should probably be building in this two-week delay into your development cycle. You know that it takes two weeks for the DBA to review and implement changes in production, so make sure that if you have a deadline for releasing functionality that you have enough lead time for it.
Building this kind of "data" into a filename has inherent problems as others have pointed out. You have no relational integrity guarantees and the "data" can be changed without knowledge of the rest of the application/database.
It's best to keep everything in the database.
Network file I/O is spotty at best. In addition, its slower than the DB I/O.
If the DBA is difficult in getting small changes into the database, you
may be dealing with:
A political control issue. Maybe he just knows DB stuff and is threatened
when he perceives others moving in on his turf. Whatever his reasons, you need
to GET WORK DONE. Period. Document all the extra time / communication / work
you need to do for each small change and take that up with the management.
If the first level of management is unwilling to see things your way,
(it does not matter what their reasons are), escalate the issue
to the next level of management. In the past, I've gotten results this way.
It was more of a political territory problem than a technical problem.
The DBA eventually gave up and gave me full access to the TEST system BUT
he also stipulated that I would need to learn his testing process,
naming convention, his DB standards and practices, his way of testing, etc.
I was game.
I would also need to fix any database problems arising from changes I introduced.
This was fair and I got to wear the DBA hat in addition to the developer hat.
I got the freedom I needed and he got one less thing to worry about.
A process issue. Maybe the DBA needs to put every small DB change you submit
through a gauntlet of testing and performance analysis. Maybe he has a highly
normalized DB schema and because he has the big picture, he needs to normalize or
denormalize your requested DB changes to fit into the existing schema.
Ask to work with him. Ask him for a full DB design diagram.
Get a good sense of his DB design philosophy. Implement your DB changes with
his DB design philosophy in mind. Show that you understand that he's trying
to keep the DB in good order (understand normalization, relational constraints,
check constraints) Give him less to worry about. He needs to trust that you
will not muck up his database.
Accumulate all the small changes into a lengthy script and submit them to the DBA.
This way, you won't have to wait for each small change to go through all of his
process / testing. In addition, you're giving him a bigger picture view of your
development planning (that is in step with his DB design philosophy) instead of
just the play by play.
In our report generation application, there's some pretty hefty queries that take a considerable amount of time to run. User feedback up until this point has been basically zip while the server chugs away at their request. I noticed that there's a tab on the ADA Management Utility that shows progress on the query both as percent complete and estimated seconds remaining. I tried digging through the tables to see if I could find any of this information exposed, as well as picking through the limited documentation available for ADBS and couldn't find anything useful.
Does anyone know if there's a way I can cull this information outside ADA to provide some needed user feedback?
ADA is getting that information from the sp_GetSQLStatements system procedure.
However, the traditional way of providing progress information for any operation is through a callback function.
This isn't an answer to the question but might be useful in helping reduce the time it takes to run the queries in the report. You may have already done this and made it as optimized as it gets. But if not, you might look at the query plan within Advantage Data Architect to check for optimization issues. In the query window where you run a query, you can choose Show Plan from the SQL menu (or click the button in the toolbar). This will display the execution plan with optimization information that might help identify missing indexes.
Another tool that might be helpful in identifying unoptimized queries is through query logging. It is also discussed here.
I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions
The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.
There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.
I'd like to gather metrics on specific routines of my code to see where I can best optimize. Let's take a simple example and say that I have a "Class" database with multiple "Students." Let's say the current code calls the database for every student instead of grabbing them all at once in a batch. I'd like to see how long each trip to the database takes for every student row.
This is in C#, but I think it applies everywhere. Usually when I get curious as to a specific routine's performance, I'll create a DateTime object before it runs, run the routine, and then create another DateTime object after the call and take the milliseconds difference between the two to see how long it runs. Usually I just output this in the page's trace...so it's a bit lo-fi. Any best practices for this? I thought about being able to put the web app into some "diagnostic" mode and doing verbose logging/event log writing with whatever I'm after, but I wanted to see if the stackoverflow hive mind has a better idea.
For database queries, you have a two small problems. Cache: data cache and statement cache.
If you run the query once, the statement is parsed, prepared, bound and executed. Data is fetched from files into cache.
When you execute the query a second time, the cache is used, and performance is often much, much better.
Which is the "real" performance number? First one or second one? Some folks say "worst case" is the real number, and we have to optimize that. Others say "typical case" and run the query twice, ignoring the first one. Others says "average" and run in 30 times, averaging them all. Other say "typical average", run the 31 times and average the last 30.
I suggest that the "last 30 of 31" is the most meaningful DB performance number. Don't sweat the things you can't control (parse, prepare, bind) times. Sweat the stuff you can control -- data structures, I/O loading, indexes, etc.
I use this method on occasion and find it to be fairly accurate. The problem is that in large applications with a fairly hefty amount of debugging logs, it can be a pain to search through the logs for this information. So I use external tools (I program in Java primarily, and use JProbe) which allow me to see average and total times for my methods, how much time is spent exclusively by a particular method (as opposed to the cumulative time spent by the method and any method it calls), as well as memory and resource allocations.
These tools can assist you in measuring the performance of entire applications, and if you are doing a significant amount of development in an area where performance is important, you may want to research the tools available and learn how to use one.
Some times approach you take will give you a best look at you application performance.
One things I can recommend is to use System.Diagnostics.Stopwatch instead of DateTime ,
DateTime is accurate only up to 16 ms where Stopwatch is accurate up to the cpu tick.
But I recommend to complement it with custom performance counters for production and running the app under profiler during development.
There are some Profilers available but, frankly, I think your approach is better. The profiler approach is overkill. Maybe the use of profilers is worth the trouble if you absolutely have no clue where the bottleneck is. I would rather spend a little time analyzing the problem up front and putting a few strategic print statements than figure out how to instrument your app for profiling then pour over gargantuan reports where every executable line of code is timed.
If you're working with .NET, then I'd recommend checking out the Stopwatch class. The times you get back from that are going to be much more accurate than an equivalent sample using DateTime.
I'd also recommend checking out ANTS Profiler for scenarios in which performance is exceptionally important.
It is worth considering investing in a good commercial profiler, particularly if you ever expect to have to do this a second time.
The one I use, JProfiler, works in the Java world and can attach to an already-running application, so no special instrumentation is required (at least with the more recent JVMs).
It very rapidly builds a sorted list of hotspots in your code, showing which methods your code is spending most of its time inside. It filters pretty intelligently by default, and allows you to tune the filtering further if required, meaning that you can ignore the detail of third party libraries, while picking out those of your methods which are taking all the time.
In addition, you get lots of other useful reports on what your code is doing. It paid for the cost of the licence in the time I saved the first time I used it; I didn't have to add in lots of logging statements and construct a mechanism to anayse the output: the developers of the profiler had already done all of that for me.
I'm not associated with ej-technologies in any way other than being a very happy customer.
I use this method and I think it's very accurate.
I think you have a good approach. I recommend that you produce "machine friendly" records in the log file(s) so that you can parse them more easily. Something like CSV or other-delimited records that are consistently structured.
I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.