Are there any tricks or commands for sqlplus that allow for traversal of database records returned by a SELECT query as if they were being send through the Linux command "less"?
I would like to select a huge number of records sorted by date and browse through them easily.
Specifically, I'm considering replacing my log files with a database. This has a log of nice properties for searching, but I'm concerned I'll lose the ability to just look at the log for anything that looks strange.
The equivalent of "less" in SQL would be BOTTOM, as in:
select BOTTOM 100 *
from log
However, Oracle (and most other databases) do not support this feature. So, instead, you can use:
SELECT top 100 *
from log
order by 1 desc
(I would recommend that you have an autoincrementing logid as the first column, so the above query always works. Otherwise, you need to sort by logid desc explicitly or some other field such as the logdatetime field.)
This will work, depending on your system, up to hundreds of thousands or millions of rows. For instance, I have a processing log that has been going snice last September that now has about 90,000 rows, and SQL Server has no problem fetching the data I need from it.
So, if you are adding dozens or hundreds of rows into the log each day, you'll be fine with SQL. If you are adding tens of thousanda of rows, then you might need a more sophisticated approach. In that case, I would suggest having a log history table and a current log table and periodically dumping the current table into the history.
I forgot to mention. There are incredible benefits to having the log in SQL. It gives you reporting flexibility, that ability to pretty easily see "what happened yesterday", and a good platform for summarization.
Related
I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?
First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.
You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.
I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.
Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle
Need to query a database for 12 million rows, process this data and then insert the filtered data into another database.
I can't just do a SELECT * from the database for obvious reasons - far too much data would be returned for my program to handle, and also this is a live database (customer order details) and I can't have the database crawl to a halt for 10 minutes while it runs my query.
I'm looking for inspiration on how to write this program. I have to process each row. I was thinking it might be best to get a count on the rows. Then grab X at a time, wait for Y seconds, and repeat, until the dataset is complete. This way I'm not overloading the database, and since X will be sufficiently small, will run nicely in memmory.
Other suggestions or feedback ?
I'd recommend you read the doc about SELECT...INTO OUTFILE and LOAD DATA FROM INFILE.
These are very fast ways of dumping data to a flat file and then importing it to another database.
You could dump into the flat file, and then run an offline script to process your rows, and then once that's done import the result to the new database.
See also:
http://dev.mysql.com/doc/refman/5.1/en/select.html (search for "INTO OUTFILE")
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Spreading the load over time seems the only practicable solution. Exactly how to do it depends to some extent on your schema, how records change over time in the "live database", and what consistency semantics your processing must have.
In the worst case -- any record can be changed at any time, there is nothing in the schema that lets you easily and speedily check for "recently modified, inserted, or deleted records", and you nevertheless need to be consistent in what you process -- the task is simply unfeasible, unless you can count on some special support from your relational engine and/or OS (such as volume or filesystem "snapshots", like in Linux's LVM, that let you cheaply and speedily "freeze in time" a copy of the volumes on which the DB resides, for later leisurely fetching with another, read-only, database configured to read from the snapshot volume).
But presumably you do have some constraints, something in the schema that helps with the issue, or else, one can hope, you can afford some inconsistency generated by changes in the DB happening at the same time as your processing -- some lines processed twice, some not processed, some processed in older versions and others in newer versions... unfortunately, you have told us next to nothing about any of these issues, making it essentially unfeasible to offer much more help. If you edit your question to provide a LOT more information on platform, schema, and DB usage patterns, maybe more help can be offered.
A flat file or a snapshot are both ideal.
If a flat file does not suit or you do not have access to snapshots theny you could use a sequential id field or create a sequential id in a temp table and then iterate using that.
Something like
#max_id = 0
while exists (select * from table where seq_id > #max_id)
select top n * from table where seq_id > #max_id order by seq_id
... process...
set #max_id = #max seq_id from the last lot
end
If there is no sequential id then you can create a temp table that holds the order like
insert into some_temp_table
select unique_id from table order by your_ordering_scheme
then process like this
... do something with top n from table join some_temp_table on unique_id ...
delete top n from some_temp_table
this way temp_table holds the record identifiers that still need to be processed.
You don't mention which db you are using, but I doubt any db that can hold 12 million rows would actually try to return all the data to your program at once. Your program essentially streams the data in small blocks (say 1000 rows) something that is usually handled by the database driver.
RDBMSs have different transaction levels which can be used to reduce the effort the database spends maintaining consistency guarantees, which will avoid locking up the table.
Databases can also create snapshots of tables to a file for later analysis.
In your position, I would try the simplest thing first, and see how that scales (on a development copy of the db with simulated user access.)
Is it possible with a mysql script, full of just mysql commands that get filtered into the mysql binary, to do a count of current records in insert into a stats table, perhaps with the time and date automatically generated?
I would want to do this, so calculations could be done, eg work out the total number of new records inserted in a given time.
If you are interested in benchmarking your insert statements, you might be able to get what you want by looking at the general query log file. It should show you the date and time of each query executed upon the database. If that isn't sufficient, you might also try looking at the binary log file. That might contain information about how many rows were affected by each query.
Is there a way to find out the number of rows inserted/deleted in a table in MySQL? Is this kind of statistics kept somewhere in the database? If not, what would be the best way to implement something to keep track of these statistics?
When I say how many, I mean within a certain period (last 24 hours, or since server was up, or last week etc)
When I need to keep track of deleted things, I just don't delete.
I change a column value that excludes it from normal user results.
If space is an issue, you can set it's contents you no longer care about to empty.
Inserted you can user COUNT()
The Binary Log contains records of all queries that update or insert data. I don't know if it stores the number of affected rows, however.
There is also a General Query Log, which tracks all queries that were run.
(Information current for MySQL 5.0. If you're using an older version ymmv)
If I want to handle logging my SQL queries, I have 2 possibilities:
Turning the MySQL Log function on
Writting my own 'trace' class
I prefer doing number 2.
Why?
Because it is more controllable. You can easily differ from INSERT DELETE UPDATE and so on queries.
But that is not the only advantage of your own trace class, because creating trace files (so called "logs") makes administrative tasks much more easier.
You can structure the trace output, put it into a separate database, store it into some XML or JSON file.
You can order things as you want them to be.