Is it possible to generate a report from Azure MS SQL Server which shows which records in a table were last read from?
We have a table which we would like to begin cleaning records out of and it would be useful to know which data it contains that is no longer used by the client application. Unfortunately, it does not contain a datetime field which shows when the records were last accessed.
It is not a feature in SQL Server. The reason is that it would make the database a lot slower if we turned every read into a write. Since we have to log everything, we'd generate tons of log write traffic. There is a feature called Temporal Tables which doesn't quite do what you ask but it does have start/end dates for rows. You could track when you don't want to see a row anymore and then it would go into the history table. You can then remove rows from the history table after some period of non-use. The retention feature can be seen here and you can read a conceptual overview of temporal tables here
Related
I am just curious, as to how Tableau talks to a large data source- for example if I have a data source that has 1.4 million records, and I make a simple table with this data, maybe a graph etc, then how does tableau get this data? Does it go query the data source, ask the data source how much it has, then pull in the first 10,000, does it go back and retrieve the next 10k etc? or does it do it in one go? Also I want to know where Tableau stores this data it receives?
Hope my question makes sense - Just trying to understand the underlying mechanisms.
Thank you!
Tableau can work with external data sources in more than one way. You can extract the entire DB content to a local file (called an extract) or you can have a live connection to the database.
If the connection is live, then Tableau sends the DB queries designed to return the data you want not the entire content of the DB. So if you have 1.4m records containing, say, a full year's sales information and you want monthly totals, Tableau will send a query asking the DB to return the monthly totals. This will result in just 12 numbers being returned to Tableau: the DB itself will do the work and Tableau doesn't need to pull 1.4m numbers and add them up. This is how most data sources work: the user requests a result (using SQL queries) and the DB works out how to return that result. This means you don't need to copy the entire database every time you want to add some numbers up.
Live queries won't sample the database: the answers you get will usually be the correct totals (though some sources like Google's BigQuery will use sampling for some statistical aggregates unless told otherwise).
Both Tableau and many databases will cache the results of queries done recently so the results will be faster. Tableau's results will be held locally.
I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).
I want to get the number of users who have used the particular table or all the tables in any of the DML scripts in teradata.
You will need to have enabled Query Logging with OBJECTS to capture this information in the data dictionary (DBC). Typically this data is moved from DBC to a set of historical tables elsewhere on the system for analysis and audit purposes. Check with your DBA team for how they are managing DBQL within your environment.
I am using sql server 2000. I need to get only updated records from remote server and need to insert that record in my local server on daily basis. But that table did not have created date or modified date field.
Use Transactional Replication.
Update
If you cannot do administrative operations on the source then you'll going to have to read all the data every day. Since you cannot detect changes (and keep in mind that even if you'd have a timestamp you still wouldn't be able to detect changes because there is no way to detect deletes with a timestamp) then you have to read every row every time you sync. And if you read every row, then the simplest solution is to just replace all the data you have with the new snapshot.
You need one of the following
a column in the table which flag new or updated records in a fashion or other (lastupdate_timestamp, incremental update counter...)
some trigger on Insert and Update, on the table, which produces some side-effect such as adding the corresponding row id into a separate table
You can also compare row-by-row the data from the remote server against that of the production server to get the list of new or updated rows... Such a differential update can also be produced by comparing some hash value, one per row, computed from the values of all columns for the row.
Barring one the above, and barring some MS-SQL built-in replication setup, the only other possibility I can think of is [not pretty]:
parsing the SQL Log to identify updates and addition to the table. This requires specialized software; I'm not even sure if the Log file format is published/documented, though I have seen this types of tools. Frankly this approach is more one for forensic-type situations...
If you can't change the remote server's database, your best option may be to come up with some sort of hash function on the values of a given row, compare the old and new tables, and pull only the ones where function(oldrow) != function(newrow).
You can also just do a direct comparison of the columns in question, and copy that record over when not all the columns in question are the same between old and new.
This means that you cannot modify values in the new table, or they'll get overwritten daily from the old. If this is an issue, you'll need another table in which to cache the old table's values from the day before; then you'll be able to tell whether old, new, or both were modified in the interim.
I solved this by using tablediff utility which will compare the data in two tables for non-convergence, and is particularly useful for troubleshooting non-convergence in a replication topology.
See the link.
tablediff utility
TO sum up:
You have an older remote db server that you can't modify anything in (such as tables, triggers, etc).
You can't use replication.
The data itself has no indication of date/time it was last modified.
You don't want to pull the entire table down each time.
That leaves us with an impossible situation.
You're only option if the first 3 items above are true is to pull the entire table. Even if they did have a modified date/time column, you wouldn't detect deletes. Which leaves us back at square one.
Go talk to your boss and ask for better requirements. Maybe something that can be done this time.
We have a warehouse database that contains a year of data up to now. I want to create report database that represents the last 3 months of data for reporting purposes. I want to be able to keep the two databases in sync. Right now, every 10 minutes I execute a package that will grab the most recent rows from the warehouse and adds them to the report db. The problem is that I only get new rows but not new updates.
I would like to know what are the various ways of solving this scenario.
Thanks
look into replication, mirroring or log shipping
If you are using SQL 2000 or below, replication is your best bet. Since you are doing this every ten minutes, you should definitely look at transactional replication.
If you are using SQL 2005 or greater, you have more options available to you. Database snapshots, log shipping, and mirroring as SQLMenace suggested above. The suitability of these vary depending on your hardware. You will have to do some research to pick the optimal one for your needs.
You should probably read about replication, or ask your DB admin about it.
Is it possible to add columns to this database? You could add a Last_Activity column to the DB and the write a trigger that updates the date/timestamp on that row to reflect the latest edit. For any new entries, the date/time would reflect the timestamp when the row was added.
This way, when you grab the last three months, you'd be grabbing the last three months' activity, not just the new stuff.