what is the difference between exclude and blacklist in maxwell configuration

what is the difference between exclude and blacklist in maxwell configuration - maxwell

I was looking maxwell code,
https://github.com/zendesk/maxwell
https://github.com/zendesk/maxwell/blob/master/config.properties.example.
Could someone please clarify the difference between exclude and blacklist in maxwell filter configuration?

From the docs:
Note that once Maxwell has been running with a table or database marked as blacklisted, you must continue to run Maxwell with that table or database blacklisted or else Maxwell will halt. If you want to stop blacklisting a table or database, you will have to drop the maxwell schema first. Also note that this is the feature I most regret writing.
The practical difference between the two is that for blacklisted tables, maxwell ignores both the data changes as well as schema changes. For ignored tables, maxwell will ignore data but still track the schema, so that you can un-exclude them later.
Wherever possible, just use exclude. The main reason to blacklist a table if it has schema changes that maxwell is unable to understand, but that should be rare.

Related

How to optimally clear down multiple tables in Oracle SQL?

I have a script that will be clearing down 10+ tables that will be run frequently.
However, 3 of these tables have tens of thousands of records and this causes the script to take over 2 and a half hours to run. This isn't good enough because it will be run during the workday to test a suite of scripts that are migrating data.
We have tried a few different approaches:
DELETE FROM x
Requires nulling some foreign keys but this is the most reliable way we have found. However, like I've mentioned, the runtime is far too long for our purposes.
A mixture of DELETE FROM x and TRUNCATE TABLE x
This is the fastest way we've found. Deleting tables as normal and then truncating the problematic tables. But this requires an explicit order to clearing down the tables. This is awkward in and of itself but our DB schema is also changing frequently which makes this cumbersome.
Disable constraints and then TRUNCATE TABLE x
We have also began writing a script to disable all PK/FK constraints related to the tables we would like to empty, truncate those tables and their child/sibling/parent tables, and then reenable those constraints. This seems like a good approach but is likely overengineering.
There is one further approach that I have considered but I'm not sure would it make a difference. This would be removing all FK constraints and adding a ON DELETE CASCADE to them. This might hopefully optimise approach 1. But I said I would throw it to SO first and see if anybody else has had to deal with this before.

The TRUNCATE statement is way faster than the DELETE statement to empty a table.
In Oracle (as in PostgreSQL), there is a CASCADE option on TRUNCATE statement (as on DELETE statement), which might help about the "overengineering" part (when you want to disable FK, then reable them).
TRUNCATE TABLE x CASCADE;

Truncating is better than deleting, especially for large tables. As you already found out, it is faster, but it also resets the high water mark so queries you run afterwards will run better.
Whichever option you choose (2 or 3) is OK with me, as long as you pick the one you prefer better. Knowing the order matters, so - yes, that's kind of tedious.
I wouldn't remove foreign key constraints; they exist for reason. If you do remove them, don't forget to recreate them afterwards.
You didn't say whether there are other tables involved in that schema. If not, perhaps you could even drop/create user and then recreate tables within the same script. Drawback: "normal" users can't do that, you'll need someone with DBA privileges.

Automatic column mapping in UPDATE step

For inserts, if source and target columns are the same, no mapping or "select values" step is required. But for updates, there seems to be a need to specify list of update fields.
My concern is around manually updating the KTR's each time a source table is altered for columns. Is there a way to enable automatic mapping during the Update step? See screenshot for the "update fields", automatic mapping would mean that update fields section can be left blank.

There are good reason NOT to do so.
Believe me, having a robot to change your ktr is not a good idea. And there are good reasons not to change often the column names in an OPAP schema, unless you like to be in conflict with the Reports Designers and, even worse, with the Dashboard and Front End Javascript guys.
So if a press button is not a solution for you, because maybe you have 1000 tables to update, what you can do is to use a Metadata Injection step. You'll find nice examples on Diethard Steiner's blog or Jens Bleuel's blog. In two words, you make the Update metadata dynamic, but you first have to examine each table to get the column names.

Add/remove columns of a table - code maintenance / optimisation

What is the best way to maintain code of a big project?
Let's say you have 1000 stored procedures, and you have to add a new column to a table (or remove)
There might be 1-2 or 30 stored procedures, that might be affected.
Just a single "search" for the tablename might not be good enough, let's say you only need to know the places where the table has insert/update/delete.
searching for 'insert tablename' might be a good idea, but you might have a space between those 2 words or 2 spaces, or a TAB ... maybe the tablename is written like '[tablename]'
The same for all 3 (insert/update/delete.)
I am basically looking for some kind of 'restricted dependencies'
How is this being handled the best way?
Keep a database table with this kind of information, and change that table every time you make changes to stored procedures?
keep some specific code as comment next to each insert/update/delete, and in this way, you will be able to search for what you need?
Example: 'insert_tablename', 'update_tablename', 'delete_tablename'
anyone having a better idea?

Ideally, changes are backward compatible. Not just so that you can change a table without breaking all of the objects that reference it, but also so that you can deploy all of the database changes before you deploy all of the application code (in a distributed architecture, think a downloadable desktop app or an iPhone app, where folks connect to your database remotely, this is crucial).
For example, if you add a new column to a table, it should be NULLable or have a default value so that INSERT statements don't need to be updated immediately to reference it. Stored procedures can be updated gradually to accept a new parameter to represent this column, and it should be nullable / optional so that the application(s) don't need to be aware of this column immediately. Etc.
This also demands that your original insert statements included an explicit column list. If you just say:
INSERT dbo.table VALUES(#p1, #p2, ...);
Then that makes it much tougher to make your changes backward compatible.
As for removing a column, well, that's a little tougher. Dependencies are not perfect in SQL Server, but you should be able to find a lot of information from these dynamic management objects:
sys.dm_sql_referenced_entities
sys.dm_sql_referencing_entities
sys.sql_expression_dependencies
You might also find these articles interesting:
Keeping sysdepends up to date
Make your database changes backward compatible when adding a new column
Make your database changes backward compatible when dropping a column
Make your database changes backward compatible when renaming an entity
Make your database changes backward compatible when changing a relationship

Date ranges in views - is this normal?

I recently started working at a company with an enormous "enterprisey" application. At my last job, I designed the database, but here we have a whole Database Architecture department that I'm not part of.
One of the stranger things in their database is that they have a bunch of views which, instead of having the user provide the date ranges they want to see, join with a (global temporary) table "TMP_PARM_RANG" with a start and end date. Every time the main app starts processing a request, the first thing it does it "DELETE FROM TMP_PARM_RANG;" then an insert into it.
This seems like a bizarre way of doing things, and not very safe, but everybody else here seems ok with it. Is this normal, or is my uneasiness valid?
Update I should mention that they use transactions and per-client locks, so it is guarded against most concurrency problems. Also, there are literally dozens if not hundreds of views that all depend on TMP_PARM_RANG.

Do I understand this correctly?
There is a view like this:
SELECT * FROM some_table, tmp_parm_rang
WHERE some_table.date_column BETWEEN tmp_parm_rang.start_date AND tmp_parm_rang.end_date;
Then in some frontend a user inputs a date range, and the application does the following:
Deletes all existing rows from
TMP_PARM_RANG
Inserts a new row into
TMP_PARM_RANG with the user's values
Selects all rows from the view
I wonder if the changes to TMP_PARM_RANG are committed or rolled back, and if so when? Is it a temporary table or a normal table? Basically, depending on the answers to these questions, the process may not be safe for multiple users to execute in parallel. One hopes that if this were the case they would have already discovered that and addressed it, but who knows?
Even if it is done in a thread-safe way, making changes to the database for simple query operations doesn't make a lot of sense. These DELETEs and INSERTs are generating redo/undo (or whatever the equivalent is in a non-Oracle database) which is completely unnecessary.
A simple and more normal way of accomplishing the same goal would be to execute this query, binding the user's inputs to the query parameters:
SELECT * FROM some_table WHERE some_table.date_column BETWEEN ? AND ?;

If the database is oracle, it's possibly a global temporary table; every session sees its own version of the table and inserts/deletes won't affect other users.

There must be some business reason for this table. I've seen views with dates hardcoded that were actually a partioned view and they were using dates as the partioning field. I've also seen joining on a table like when dealing with daylights saving times imagine a view that returned all activity which occured during DST. And none of these things would ever delete and insert into the table...that's just odd
So either there is a deeper reason for this that needs to be dug out, or it's just something that at the time seemed like a good idea but why it was done that way has been lost as tribal knowledge.

Personally, I'm guessing that it would be a pretty strange occurance. And from what you are saying two methods calling the process at the same time could be very interesting.
Typically date ranges are done as filters on a view, and not driven by outside values stored in other tables.
The only justification I could see for this is if there was a multi-step process, that was only executed once at a time and the dates are needed for multiple operations, across multiple stored procedures.

I suppose it would let them support multiple ranges. For example, they can return all dates between 1/1/2008 and 1/1/2009 AND 1/1/2006 and 1/1/2007 to compare 2006 data to 2008 data. You couldn't do that with a single pair of bound parameters. Also, I don't know how Oracle does it's query plan caching for views, but perhaps it has something to do with that? With the date columns being checked as part of the view the server could cache a plan that always assumes the dates will be checked.
Just throwing out some guesses here :)
Also, you wrote:
I should mention that they use
transactions and per-client locks, so
it is guarded against most concurrency
problems.
While that may guard against data consistency problems due to concurrency, it hurts when it comes to performance problems due to concurrency.

Do they also add one -in the application- to generate the next unique value for the primary key?
It seems that the concept of shared state eludes these folks, or the reason for the shared state eludes us.

That sounds like a pretty weird algorithm to me. I wonder how it handles concurrency - is it wrapped in a transaction?
Sounds to me like someone just wasn't sure how to write their WHERE clause.

The views are probably used as temp tables. In SQL Server we can use a table variable or a temp table (# / ##) for this purpose. Although creating views are not recommended by experts, I have created lots of them for my SSRS projects because the tables I am working on do not reference one another (NO FK's, seriously!). I have to workaround deficiencies in the database design; that's why I am using views a lot.

With the global temporary table GTT approach that you comment is being used here, the method is certainly safe with regard to a multiuser system, so no problem there. If this is Oracle then I'd want to check that the system either is using an appropriate level of dynamic sampling so that the GTT is joined appropriately, or that a call to DBMS_STATS is made to supply statistics on the GTT.

Strategy for identifying unused tables in SQL Server 2000?

I'm working with a SQL Server 2000 database that likely has a few dozen tables that are no longer accessed. I'd like to clear out the data that we no longer need to be maintaining, but I'm not sure how to identify which tables to remove.
The database is shared by several different applications, so I can't be 100% confident that reviewing these will give me a complete list of the objects that are used.
What I'd like to do, if it's possible, is to get a list of tables that haven't been accessed at all for some period of time. No reads, no writes. How should I approach this?

MSSQL2000 won't give you that kind of information. But a way you can identify what tables ARE used (and then deduce which ones are not) is to use the SQL Profiler, to save all the queries that go to a certain database. Configure the profiler to record the results to a new table, and then check the queries saved there to find all the tables (and views, sps, etc) that are used by your applications.
Another way I think you might check if there's any "writes" is to add a new timestamp column to every table, and a trigger that updates that column every time there's an update or an insert. But keep in mind that if your apps do queries of the type
select * from ...
then they will receive a new column and that might cause you some problems.

Another suggestion for tracking tables that have been written to is to use Red Gate SQL Log Rescue (free). This tool dives into the log of the database and will show you all inserts, updates and deletes. The list is fully searchable, too.
It doesn't meet your criteria for researching reads into the database, but I think the SQL Profiler technique will get you a fair idea as far as that goes.

If you have lastupdate columns you can check for the writes, there is really no easy way to check for reads. You could run profiler, save the trace to a table and check in there
What I usually do is rename the table by prefixing it with an underscrore, when people start to scream I just rename it back

If by not used, you mean your application has no more references to the tables in question and you are using dynamic sql, you could do a search for the table names in your app, if they don't exist blow them away.
I've also outputted all sprocs, functions, etc. to a text file and done a search for the table names. If not found, or found in procedures that will need to be deleted too, blow them away.

It looks like using the Profiler is going to work. Once I've let it run for a while, I should have a good list of used tables. Anyone who doesn't use their tables every day can probably wait for them to be restored from backup. Thanks, folks.

Probably too late to help mogrify, but for anybody doing a search; I would search for all objects using this object in my code, then in SQL Server by running this :
select distinct '[' + object_name(id) + ']'
from syscomments
where text like '%MY_TABLE_NAME%'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas