Why do I see duplicate object ids when using git_odb_foreach? - libgit2

I'm running a simple object id gatherer on a large git repo (in this case linux-2.6) in preparation for storing said ids in a sqlite database.
Pseudo-code:
// Table holding the SHA1 of each object in the database, ensure ids are unique
CREATE TABLE objs(key INTEGER PRIMARY KEY, id BLOB UNIQUE);
// For each id, insert into objs table, rc can tell us if we violate uniqueness constraint
int callback(const git_oid *oid, void *payload) {
// note that 'oid' in the following string is really the id value in real code
int rc = sqlite3_exec("INSERT INTO objs(id) VALUES(oid);");
if (rc == SQLITE_CONSTRAINT) {
// code to print type and oid
}
}
int main() {
// sqlite and git initialization
git_odb_foreach(...callback...);
// cleanup
return 0;
}
Out of ~4 million objects, there are ~70000 non-unique objects I end up encountering along the way. Interestingly, when running 'git rev-list --objects --all | wc -l', this count matches the number of unique objects from the foreach code.
Can someone explain why the git_odb_foreach function would produce these non-unique ids?

Git objects might be stored in more than one packfile, in addition to existing as a loose object. That's simply something that might happen and that git implementations have to deal with.
While git/libgit2/whatever will generally check whether an object exists before creating it, there is no way to make this determination when you are speaking to a remote.
If a remote has some of the same objects in history you are downloading but there is no way to detect this via the negotiation (which only exchanges a few commit ids) then the remote might send you objects you already have. These objects come in a packfile and there is no logic to get rid of duplicate ones outside of garbage collection, which can take quite some time.
Having the same object in different packfiles could even increase performance as you would be able to use objects which are closer together and share delta chains instead of going to open a different packfile.
All git_odb_foreach does is go through each backend and call the callback with whatever it finds. It doesn't try to deduplicate as it can't know that's what you want to do. So if an objects exists both as loose and packed, or in multiple packfiles, then that's what it will return.
Note, however, that your command git rev-list --objects --all | wc -l is doing something rather different from the git_odb_foreach call. The command is asking for all objects reachable from the references whereas the function call is getting all existing objects and in many cases (possibly most) those numbers won't match. E.g. if you ever do git add, there will be a blob in the object db that won't be reachable from any reference.

Related

How does persistence ignorance work with references to (non-root) aggregates?

We have several aggregate roots that have two primary means of identification:
an integer "key", which is used as a primary key in the database (and is used as a foreign key by referencing aggregates), and internally within the application, and is not accessible by the public web API.
a string-based "id", which also uniquely identifies the aggregate root and is accessible by the public web API.
There are several reasons for having an integer-based private identifiers and a string-based public identifier - for example, the database performs better (8-byte integers as opposed to variable-length strings) and the public identifiers are difficult to guess.
However, the classes internally reference each other using the integer-based identifiers and if an integer-based identifier is 0, this signifies that the object hasn't yet been stored to the database. This creates a problem, in that entities are not able to reference other aggregate roots until after they have been saved.
How does one get around this problem, or is there a flaw in my understanding of persistence ignorance?
EDIT regarding string-based identifiers
The string-based identifiers are generated by the repository, connected to a PostgreSQL database, which generates the identifier to ensure that it does not clash with anything currently in the database. For example:
class Customer {
public function __construct($customerKey, $customerId, $name) {
$this->customerKey = $customerKey;
$this->customerId = $customerId;
$this->name = $name;
}
}
function test(Repository $repository, UnitOfWork $unitOfWork) {
$customer = new Customer(0, $repository->generateCustomerId(), "John Doe");
// $customer->customerKey == 0
$unitOfWork->saveCustomer($customer);
// $customer->customerKey != 0
}
I assume that the same concept could be used to create an entity with an integer-based key of non-0, and the Unit of Work could use the fact that it doesn't exist in the database as a reason to INSERT rather than UPDATE. The test() function above would then become:
function test(Repository $repository, UnitOfWork $unitOfWork) {
$customer = new Customer($repository->generateCustomerKey(), $repository->generateCustomerId(), "John Doe");
// $customer->customerKey != 0
$unitOfWork->saveCustomer($customer);
// $customer->customerKey still != 0
}
However, given the above, errors may occur if the Unit of Work does not save the database objects in the correct order. Is the way to get around this to ensure that the Unit of Work saves entities in the correct order?
I hope the above edit clarifies my situation.
It's a good approach to look at Aggregates as consistency boundaries. In other words, two different aggregates have separate lifecycles and you should refrain from tying their fates together inside the same transaction. From that axiom you can safely state that no aggregate A will ever have an ID of 0 when looked at from another aggregate B's perspective, because either the transaction that creates A has not finished yet and it is not visible by B, or it has completed and A has an ID.
Regarding the double identity, I'd rather have the string ID generated by the language than the database because I suppose coming up with a unique ID would imply a transaction, possibly across multiple tables. Languages can usually generate unique strings with a good entropy.

Partitioned key space for StackExchange Redis

When developing a component that use Redis, I've found it a good pattern to prefix all keys used by that component so that it does not interfere other components.
Examples:
A component managing users might use keys prefixed by user: and a component managing a log might use keys prefixed by log:.
In a multi-tenancy system I want each customer to use a separate key space in Redis to ensure that their data do not interfere. The prefix would then be something like customer:<id>: for all keys related to a specific customer.
Using Redis is still new stuff for me. My first idea for this partitioning pattern was to use separate database identifiers for each partition. However, that seems to be a bad idea because the number of databases is limited and it seems to be a feature that is about to be deprecated.
An alternative to this would be to let each component get an IDatabase instance and a RedisKey that it shall use to prefix all keys. (I'm using StackExchange.Redis)
I've been looking for an IDatabase wrapper that automatically prefix all keys so that components can use the IDatabase interface as-is without having to worry about its keyspace. I didn't find anything though.
So my question is: What is a recommended way to work with partitioned key spaces on top of StackExchange Redis?
I'm now thinking about implementing my own IDatabase wrapper that would prefix all keys. I think most methods would just forward their calls to the inner IDatabase instance. However, some methods would require a bit more work: For example SORT and RANDOMKEY.
I've created an IDatabase wrapper now that provides a key space partitioning.
The wrapper is created by using an extension method to IDatabase
ConnectionMultiplexer multiplexer = ConnectionMultiplexer.Connect("localhost");
IDatabase fullDatabase = multiplexer.GetDatabase();
IDatabase partitioned = fullDatabase.GetKeyspacePartition("my-partition");
Almost all of the methods in the partitioned wrapper have the same structure:
public bool SetAdd(RedisKey key, RedisValue value, CommandFlags flags = CommandFlags.None)
{
return this.Inner.SetAdd(this.ToInner(key), value, flags);
}
They simply forward the invocation to the inner database and prepend the key space prefix to any RedisKey arguments before passing them on.
The CreateBatch and CreateTransaction methods simply creates wrappers for those interfaces, but with the same base wrapper class (as most methods to wrap are defined by IDatabaseAsync).
The KeyRandomAsync and KeyRandom methods are not supported. Invocations will throw a NotSupportedException. This is not a concern for me, and to quote #Marc Gravell:
I can't think of any sane way of achieving that, but I suspect NotSupportedException("RANDOMKEY is not supported when a key-prefix is specified") is entirely reasonable (this isn't a commonly used command anyway)
I have not yet implemented ScriptEvaluate and ScriptEvaluateAsync because it is unclear to me how I should handle the RedisResult return value. The input parameters to these methods accept RedisKey which should be prefixed, but the script itself could return keys and in that case I think it would make (most) sense to unprefix those keys. For the time being, those methods will throw a NotImplementedException...
The sort methods (Sort, SortAsync, SortAndStore and SortAndStoreAsync) have special handling for the by and get parameters. These are prefixed as normal unless they have one of the special values: nosort for by and # for get.
Finally, to allow prefixing ITransaction.AddCondition I had to use a bit reflection:
internal static class ConditionHelper
{
public static Condition Rewrite(this Condition outer, Func<RedisKey, RedisKey> rewriteFunc)
{
ThrowIf.ArgNull(outer, "outer");
ThrowIf.ArgNull(rewriteFunc, "rewriteFunc");
Type conditionType = outer.GetType();
object inner = FormatterServices.GetUninitializedObject(conditionType);
foreach (FieldInfo field in conditionType.GetFields(BindingFlags.NonPublic | BindingFlags.Instance))
{
if (field.FieldType == typeof(RedisKey))
{
field.SetValue(inner, rewriteFunc((RedisKey)field.GetValue(outer)));
}
else
{
field.SetValue(inner, field.GetValue(outer));
}
}
return (Condition)inner;
}
}
This helper is used by the wrapper like this:
internal Condition ToInner(Condition outer)
{
if (outer == null)
{
return outer;
}
else
{
return outer.Rewrite(this.ToInner);
}
}
There are several other ToInner methods for different kind of parameters that contain RedisKey but they all more or less end up calling:
internal RedisKey ToInner(RedisKey outer)
{
return this.Prefix + outer;
}
I have now created a pull request for this:
https://github.com/StackExchange/StackExchange.Redis/pull/92
The extension method is now called WithKeyPrefix and the reflection hack for rewriting conditions is no longer needed as the new code have access to the internals of Condition classes.
Intriguing suggestion. Note that redis already offers a simple isolation mechanism by way of database numbers, for example:
// note: default database is 0
var logdb = muxer.GetDatabase(1);
var userdb = muxer.GetDatabase(2);
StackExchange.Redis will handle all the work to issue commands to the correct databases - i.e. commands issued via logdb will be issued against database 1.
Advantages:
inbuilt
works with all clients
provides complete keyspace isolation
doesn't require additional per-key space for the prefixes
works with KEYS, SCAN, FLUSHDB, RANDOMKEY, SORT, etc
you get high-level per-db keyspace metrics via INFO
Disadvantages:
not supported on redis-cluster
not supported via intermediaries like twemproxy
Note:
the number of databases is a configuration option; IIRC it defaults to 16 (numbers 0-15), but can be tweaked in your configuration file via:
databases 400 # moar databases!!!
This is actually how we (Stack Overflow) use redis with multi-tenancy; database 0 is "global", 1 is "stackoverflow", etc. It should also be clear that if required, it is then a fairly simple thing to migrate an entire database to a different node using SCAN and MIGRATE (or more likely: SCAN, DUMP, PTTL and RESTORE - to avoid blocking).
Since database partitioning is not supported in redis-cluster, there may be a valid scenario here, but it should also be noted that redis nodes are easy to spin up, so another valid option is simply: use different redis groups for each (different port numbers, etc) - which would also have the advantage of allowing genuine concurrency between nodes (CPU isolation).
However, what you propose is not unreasonable; there is actually "prior" here... again, largely linked to how we (Stack Overflow) use redis: while databases work fine for isolating keys, no isolation is currently provided by redis for channels (pub/sub). Because of this, StackExchange.Redis actually includes a ChannelPrefix option on ConfigurationOptions, that allows you to specify a prefix that is automatically added during PUBLISH and removed when receiving notifications. So if your ChannelPrefix is foo:, and you publish and event bar, the actual event is published to the channel foo:bar; likewise: any callback you have only sees bar. It could be that this is something that is viable for databases too, but to emphasize: at the moment this configuration option is at the multiplexer level - not the individual ISubscriber. To be comparable to the scenario you present, this would need to be at the IDatabase level.
Possible, but a decent amount of work. If possible, I would recommend investigating the option of simply using database numbers...

Using the DoctrineObjectConstructor, how are new entities created?

I am attempting to use JMSSerializerBundle to consume JSON into Doctrine entities. I need to both create new entities where they do not already exist in the database, and update existing entities when they do already exist. I am using the DoctrineObjectConstructor included in the JMSSerializer package to help with this. When I consume JSON which contains a property designated as an identifier, such as:
{
"id": 1,
"some_other_attribute": "stuff"
}
by attempting to deserialize it, JMSSerializer causes warnings and eventually dies with an exception for attempting to utilize reflection to set properties on a null value. The warnings all look like this:
PHP Warning: ReflectionProperty::setValue() expects parameter 1 to be object, null given in /Users/cdonadeo/Repos/Ubertester/vendor/jms/serializer/src/JMS/Serializer/GenericDeserializationVisitor.php on line 176
If I manually insert an entity with ID 1 in my database and make another attempt then I receive no errors and everything appears to be working correctly, but I'm now short half my functionality. I looked at the code for the DoctrineObjectConstructor class, and at the top is a comment:
/**
* Doctrine object constructor for new (or existing) objects during deserialization.
*/
But I don't see how it could possibly create a new a new entity because after the construct() function has done all of its checks, at the end it calls:
$object = $objectManager->find($metadata->name, $identifierList);
And since the identifier does not exist in the database the result is null which is ultimately what gets returned from the function. This explains why inserting a row in the database with the appropriate ID makes things work: find() now returns a proper Entity object, which is what the rest of the library expects.
Am I using the library wrong or is it broken? I forked the Git repo and made an edit, and trying it out everything seems to work more or less the way I expected. That edit does have some drawbacks that make me wonder if I'm not just making this more difficult than it has to be. The biggest issue I see is that it will cause persisted and unpersisted entities to be mixed together with no way to tell which ones are which, but I don't know if that's even a big deal.
For Doctrine entities use configuration:
jms_serializer:
object_constructors:
doctrine:
fallback_strategy: "fallback" # possible values ("null" | "exception" | "fallback")
see configuration reference https://jmsyst.com/bundles/JMSSerializerBundle/master/configuration

Ensuring inserts after a call to a custom NHibernate IIdentifierGenerator

The setup
Some of the "old old old" tables of our database use an exotic primary key generation scheme [1] and I'm trying to overlay this part of the database with NHibernate. This generation scheme is mostly hidden away in a stored procedure called, say, 'ShootMeInTheFace.GetNextSeededId'.
I have written an IIdentifierGenerator that calls this stored proc:
public class LegacyIdentityGenerator : IIdentifierGenerator, IConfigurable
{
// ... snip ...
public object Generate(ISessionImplementor session, object obj)
{
var connection = session.Connection;
using (var command = connection.CreateCommand())
{
SqlParameter param;
session.ConnectionManager.Transaction.Enlist(command);
command.CommandText = "ShootMeInTheFace.GetNextSeededId";
command.CommandType = CommandType.StoredProcedure;
param = command.CreateParameter() as SqlParameter;
param.Direction = ParameterDirection.Input;
param.ParameterName = "#sTableName";
param.SqlDbType = SqlDbType.VarChar;
param.Value = this.table;
command.Parameters.Add(param);
// ... snip ...
command.ExecuteNonQuery();
// ... snip ...
return ((IDataParameter)command
.Parameters["#sTrimmedNewId"]).Value as string);
}
}
The problem
I can map this in the XML mapping files and it works great, BUT....
It doesn't work when NHibernate tries to batch inserts, such as in a cascade, or when the session is not Flush()ed after every call to Save() on a transient entity that depends on this generator.
That's because NHibernate seems to be doing something like
for (each thing that I need to save)
{
[generate its id]
[add it to the batch]
}
[execute the sql in one big batch]
This doesn't work because, since the generator is asking the database every time, NHibernate just ends up getting the same ID generated multiple times, since it hasn't actually saved anything yet.
The other NHibernate generators like IncrementGenerator seem to get around this by asking the database for the seed value once and then incrementing the value in memory during subsequent calls in the same session. I would rather not do this in my implementation if I have to, since all of the code that I need is sitting in the database already, just waiting for me to call it correctly.
Is there a way to make NHibernate actually issue the INSERT after each call to generating an ID for entities of a certain type? Fiddling with the batch size settings don't seem to help.
Do you have any suggestions/other workarounds besides re-implementing the generation code in memory or bolting on some triggers to the legacy database? I guess I could always treat these as "assigned" generators and try to hide that fact somehow within the guts of the domain model....
Thanks for any advice.
The update: 2 months later
It was suggested in the answers below that I use an IPreInsertEventListener to implement this functionality. While this sounds reasonable, there were a few problems with this.
The first problem was that setting the id of an entity to the AssignedGenerator and then not actually assigning anything in code (since I was expecting my new IPreInsertEventListener implementation to do the work) resulted in an exception being thrown by the AssignedGenerator, since its Generate() method essentially does nothing but check to make sure that the id is not null, throwing an exception otherwise. This is worked around easily enough by creating my own IIdentifierGenerator that is like AssignedGenerator without the exception.
The second problem was that returning null from my new IIdentifierGenerator (the one I wrote to overcome the problems with the AssignedGenerator resulted in the innards of NHibernate throwing an exception, complaining that a null id was generated. Okay, fine, I changed my IIdentifierGenerator to return a sentinel string value, say, "NOT-REALLY-THE-REAL-ID", knowing that my IPreInsertEventListener would replace it with the correct value.
The third problem, and the ultimate deal-breaker, was that IPreInsertEventListener runs so late in the process that you need to update both the actual entity object as well as an array of state values that NHibernate uses. Typically this is not a problem and you can just follow Ayende's example. But there are three issues with the id field relating to the IPreInsertEventListeners:
The property is not in the #event.State array but instead in its own Id property.
The Id property does not have a public set accessor.
Updating only the entity but not the Id property results in the "NOT-REALLY-THE-REAL-ID" sentinel value being passed through to the database since the IPreInsertEventListener was unable to insert in the right places.
So my choice at this point was to use reflection to get at that NHibernate property, or to really sit down and say "look, the tool just wasn't meant to be used this way."
So I went back to my original IIdentifierGenreator and made it work for lazy flushes: it got the high value from the database on the first call, and then I re-implemented that ID generation function in C# for subsequent calls, modeling this after the Increment generator:
private string lastGenerated;
public object Generate(ISessionImplementor session, object obj)
{
string identity;
if (this.lastGenerated == null)
{
identity = GetTheValueFromTheDatabase();
}
else
{
identity = GenerateTheNextValueInCode();
}
this.lastGenerated = identity;
return identity;
}
This seems to work fine for a while, but like the increment generator, we might as well call it the TimeBombGenerator. If there are multiple worker processes executing this code in non-serializable transactions, or if there are multiple entities mapped to the same database table (it's an old database, it happened), then we will get multiple instances of this generator with the same lastGenerated seed value, resulting in duplicate identities.
##$##$#.
My solution at this point was to make the generator cache a dictionary of WeakReferences to ISessions and their lastGenerated values. This way, the lastGenerated is effectively local to the lifetime of a particular ISession, not the lifetime of the IIdentifierGenerator, and because I'm holding WeakReferences and culling them out at the beginning of each Generate() call, this won't explode in memory consumption. And since each ISession is going to hit the database table on its first call, we'll get the necessary row locks (assuming we're in a transaction) we need to prevent duplicate identities from happening (and if they do, such as from a phantom row, only the ISession needs to be thrown away, not the entire process).
It is ugly, but more feasible than changing the primary key scheme of a 10-year-old database. FWIW.
[1] If you care to know about the ID generation, you take a substring(len - 2) of all of the values currently in the PK column, cast them to integers and find the max, add one to that number, add all of that number's digits, and append the sum of those digits as a checksum. (If the database has one row containing "1000001", then we would get max 10000, +1 equals 10001, checksum is 02, resulting new PK is "1000102". Don't ask me why.
A potential workaround is to generate and assign the ID in an event listener rather than using an IIdentifierGenerator implementation. The listener should implement IPreInsertEventListener and assign the ID in OnPreInsert.
Why dont you just make private string lastGenerated; static?

How should I handle this Optimistic Concurrency error in this Entity Framework code, I have?

I have the following pseduo code in some Repository Pattern project that uses EF4.
public void Delete(int someId)
{
// 1. Load the entity for that Id. If there is none, then null.
// 2. If entity != null, then DeleteObject(..);
}
Pretty simple but I'm getting a run-time error:-
ConcurrencyException: Store, Update,
Insert or Delete statement affected an
unexpected number of rows (0).
Now, this is what is happening :-
Two instances of EF4 are running inthe app at the same time.
Instance A calls delete.
Instance B calls delete a nano second later.
Instance A loads the entity.
Instance B also loads the entity.
Instance A now deletes that entity - cool bananas.
Instance B tries to delete the entity, but it's already gone. As such, the no-count or what not is 0, when it expected 1 .. or something like that. Basically, it figured out that the item it is suppose to delete, didn't delete (because it happened a split sec ago).
I'm not sure if this is like a race-condition or something.
Anyways, is there any tricks I can do here so the 2nd call doesn't crash? I could make it into a stored procedure.. but I'm hoping to avoid that right now.
Any ideas? I'm wondering If it's possible to lock that row (and that row only) when the select is called ... forcing Instance B to wait until the row lock has been relased. By that time, the row is deleted, so when Instance B does it's select, the data is not there .. so it will never delete.
Normally you would catch the OptimisticConcurrencyException and then handle the problem in a manner that makes sense for your business model - then call SaveChanges again.
try
{
myContext.SaveChanges();
}
catch (OptimisticConcurrencyException e)
{
if (e.StateEntries.FirstOrDefault() is DeletingObject)
myContext.Detach(e.StateEntries.First();
myContext.SaveChanges();
}
Give that a whirl and see how you get on - not tested/compiled or anything - I've used DeletingObject as the type of the entity you're trying to delete. Substitute for your entity type.