Cache few columns but write all of them - ignite

I have recently started exploring Apache ignite for one of our project. I have a requirement, I want to cache few columns of a table in memory, and update(or insert a new row) all the columns but still be able to cache only few of them.
Let me take an example to explain:
TABLE_A:
COLUMN_A - varchar
COLUMN_B - integer
COLUMN_C - blob
In the above example table, I want to cache only COLUMN_A and COLUMN_B in memory and omit COLUMN_C from caching but when I update or insert a row I should be able to populate all of the columns(including COLUMN_C, I am getting the data from an external source for updating/overwriting the row) but still be able to cache only columns A and B. Please note that I am doing this to save some memory because COLUMN_C is a huge object.
Additionally, I also want to get COLUMN_C on-demand from DB.
I tried a hack with custom JdbcTypesTransformer, it is great for pre loading only A & B columns (not C), but as soon as I insert or update a row by setting column C, C also gets populated in cache.
Please suggest me a way to do this in Apache Ignite.

I don't think there is a way to do this, and frankly I would not recommend to do this. If COLUMN_C doesn't need to be cached, just remove it from your objects and update it separately instead of updating through Ignite. This approach is cleaner and is going to be more efficient as well. Persistence store is update by Ignite server node, so you will end up transferring COLUMN_C value to the server node first, and if it's huge, that's a bad idea.

Related

Does it affect performance to frequently repopulate a highly read database table?

I have a database table with about 2500 rows in, which is frequently read by my web application. Will it affect the performance of reading from that table if all of the data in it is frequently (e.g. every 1-5 minutes) deleted and re-inserted?
By that I mean:
DELETE FROM MyTable
INSERT INTO MyTable SELECT ...
Probably not, at the given numbers ...
However, if you have one or more index(es) on your table (to help with read/select, or automatically on any PK/UK ...) you should consider that every delete/insert may result in re-calculation of any such index (on top of the delete/insert as such), not directly affecting table-reads as such, but adding to the overall load on the DB server.
There is no sourcecode, but it appears you are using this table as intermediate/interface to sth. else, so while 'updating' you'd probably want to make sure to bundle your delete(s)/insert(s) in transactions, best you can, rather than e.g. executing them all individually, like in a loop. Or see if you can keep your PKs and rather just update ...?
This could also help reduce fragmentation in the underlying storage ...

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

ORACLE 11g SET COLUMN NULL for specific Partition of large table

I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

Check for changes to an SQL Server table?

How can I monitor an SQL Server database for changes to a table without using triggers or modifying the structure of the database in any way? My preferred programming environment is .NET and C#.
I'd like to be able to support any SQL Server 2000 SP4 or newer. My application is a bolt-on data visualization for another company's product. Our customer base is in the thousands, so I don't want to have to put in requirements that we modify the third-party vendor's table at every installation.
By "changes to a table" I mean changes to table data, not changes to table structure.
Ultimately, I would like the change to trigger an event in my application, instead of having to check for changes at an interval.
The best course of action given my requirements (no triggers or schema modification, SQL Server 2000 and 2005) seems to be to use the BINARY_CHECKSUM function in T-SQL. The way I plan to implement is this:
Every X seconds run the following query:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM sample_table
WITH (NOLOCK);
And compare that against the stored value. If the value has changed, go through the table row by row using the query:
SELECT row_id, BINARY_CHECKSUM(*)
FROM sample_table
WITH (NOLOCK);
And compare the returned checksums against stored values.
Take a look at the CHECKSUM command:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM sample_table WITH (NOLOCK);
That will return the same number each time it's run as long as the table contents haven't changed. See my post on this for more information:
CHECKSUM
Here's how I used it to rebuild cache dependencies when tables changed:
ASP.NET 1.1 database cache dependency (without triggers)
Unfortunately CHECKSUM does not always work properly to detect changes.
It is only a primitive checksum and no cyclic redundancy check (CRC) calculation.
Therefore you can't use it to detect all changes, e. g. symmetrical changes result in the same CHECKSUM!
E. g. the solution with CHECKSUM_AGG(BINARY_CHECKSUM(*)) will always deliver 0 for all 3 tables with different content:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 1 as numB
UNION ALL
SELECT 1 as numA, 1 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 2 as numB
UNION ALL
SELECT 1 as numA, 2 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 0 as numA, 0 as numB
UNION ALL
SELECT 0 as numA, 0 as numB
) q
-- delivers 0!
Why don't you want to use triggers? They are a good thing if you use them correctly. If you use them as a way to enforce referential integrity that is when they go from good to bad. But if you use them for monitoring, they are not really considered taboo.
How often do you need to check for changes and how large (in terms of row size) are the tables in the database? If you use the CHECKSUM_AGG(BINARY_CHECKSUM(*)) method suggested by John, it will scan every row of the specified table. The NOLOCK hint helps, but on a large database, you are still hitting every row. You will also need to store the checksum for every row so that you tell one has changed.
Have you considered going at this from a different angle? If you do not want to modify the schema to add triggers, (which makes a sense, it's not your database), have you considered working with the application vendor that does make the database?
They could implement an API that provides a mechanism for notifying accessory apps that data has changed. It could be as simple as writing to a notification table that lists what table and which row were modified. That could be implemented through triggers or application code. From your side, ti wouldn't matter, your only concern would be scanning the notification table on a periodic basis. The performance hit on the database would be far less than scanning every row for changes.
The hard part would be convincing the application vendor to implement this feature. Since this can be handles entirely through SQL via triggers, you could do the bulk of the work for them by writing and testing the triggers and then bringing the code to the application vendor. By having the vendor support the triggers, it prevent the situation where your adding a trigger inadvertently replaces a trigger supplied by the vendor.
Unfortunately, I do not think that there is a clean way to do this in SQL2000. If you narrow your requirements to SQL Server 2005 (and later), then you are in business. You can use the SQLDependency class in System.Data.SqlClient. See Query Notifications in SQL Server (ADO.NET).
Have a DTS job (or a job that is started by a windows service) that runs at a given interval. Each time it is run, it gets information about the given table by using the system INFORMATION_SCHEMA tables, and records this data in the data repository. Compare the data returned regarding the structure of the table with the data returned the previous time. If it is different, then you know that the structure has changed.
Example query to return information regarding all of the columns in table ABC (ideally listing out just the columns from the INFORMATION_SCHEMA table that you want, instead of using *select ** like I do here):
select * from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'ABC'
You would monitor different columns and INFORMATION_SCHEMA views depending on how exactly you define "changes to a table".
Wild guess here: If you don't want to modify the third party's tables, Can you create a view and then put a trigger on that view?
Check the last commit date. Every database has a history of when each commit is made. I believe its a standard of ACID compliance.