Implementing a "soft delete" system using sqlalchemy

Implementing a "soft delete" system using sqlalchemy - orm

We are creating a service for an app using tornado and sqlalchemy. The application is written in django and uses a "soft delete mechanism". What that means is that there was no deletion in the underlying mysql tables. To mark a row as deleted we simply set the attributed "delete" as True. However, in the service we are using sqlalchemy. Initially, we started to add check for delete in the queries made through sqlalchemy itself like:
customers = db.query(Customer).filter(not_(Customer.deleted)).all()
However this leads to a lot of potential bugs because developers tend to miss the check for deleted in there queries. Hence we decided to override the default querying with our query class that does a "pre-filter":
class SafeDeleteMixin(Query):
def __iter__(self):
return Query.__iter__(self.deleted_filter())
def from_self(self, *ent):
# override from_self() to automatically apply
# the criterion too. this works with count() and
# others.
return Query.from_self(self.deleted_filter(), *ent)
def deleted_filter(self):
mzero = self._mapper_zero()
if mzero is not None:
crit = mzero.class_.deleted == False
return self.enable_assertions(False).filter(crit)
else:
return self
This inspired from a solution on sqlalchemy docs here:
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/PreFilteredQuery
However, we are still facing issues, like in cases where we are doing filter and update together and using this query class as defined above the update does not respect the criterion of delete=False when applying the filter for update.
db = CustomSession(with_deleted=False)()
result = db.query(Customer).filter(Customer.id == customer_id).update({Customer.last_active_time: last_active_time })
How can I implement the "soft-delete" feature in sqlalchemy

I've done something similar here. We did it a bit differently, we made a service layer that all database access goes through, kind of like a controller, but only for db access, we called it a ResourceManager, and it's heavily inspired by "Domain Driven Design" (great book, invaluable for using SQLAlchemy well). A derived ResourceManager exists for each aggregate root, ie. each resource class you want to get at things through. (Though sometimes for really simple ResourceManagers, the derived manager class itself is generated dynamically) It has a method that gives out your base query, and that base query gets filtered for your soft delete before it's handed out. From then on, you can add to that query generatively for filtering, and finally call it with query.one() or first() or all() or count(). Note, there is one gotcha I encountered for this kind of generative query handling, you can hang yourself if you join a table too many times. In some cases for filtering we had to keep track of which tables had already been joined. If your delete filter is off the primary table, just filter that first, and you can join willy nilly after that.
so something like this:
class ResourceManager(object):
# these will get filled in by the derived class
# you could use ABC tools if you want, we don't bother
model_class = None
serializer_class = None
# the resource manager gets instantiated once per request
# and passed the current requests SQAlchemy session
def __init__(self, dbsession):
self.dbs = dbsession
# hand out base query, assumes we have a boolean 'deleted' column
#property
def query(self):
return self.dbs(self.model_class).filter(
getattr(self.model_class, 'deleted')==False)
class UserManager(ResourceManager):
model_class = User
# some client code might look this
dbs = SomeSessionFactoryIHave()
user_manager = UserManager(dbs)
users = user_manager.query.filter_by(name_last="Duncan").first()
Now as long as I always start off by going through a ResourceManager, which has other benefits too (see aforementioned book), I know my query is pre-filtered. This has worked very well for us on a current project that has soft-delete and quite an extensive and thorny db schema.
hth!

I would create a function
def customer_query():
return db.session.query(Customer).filter(Customer.deleted == False)
I used query functions to not forget default flags, to set flags based on user permission, filter using joins etc, so that these things wont be copy-pasted and forgotten at various places.

Related

How do I construct complex django query statements?

I am not very familiar with SQL and so trying to make more complex calls via Django ORM is stumping me. I have a Printer model that spawns Jobs and the jobs receive statuses via a State model with a foreign key relationship to it. The jobs status is determined by the most recent state object associated with it. This is so I can track the history of states of jobs throughout its life cycle. I want to be able to determine which Printers have successful jobs associated with them.
from django.db import models
class Printer(models.Model):
label = models.CharField(max_length=120)
class Job(models.Model):
label = models.CharField(max_length=120)
printer = models.ForeignKey(
Printer,
related_name='jobs',
related_query_name='job'
)
def set_state(self, state):
State.objects.create(state=state, job=self)
#property
def current_state(self):
return self.states.latest('created_at').state
class State(models.Model):
created_at = models.DateTimeField(auto_now_add=True)
state = models.SmallIntegerField()
job = models.ForeignKey(
Job,
related_name='states',
related_query_name='state'
)
I need a QuerySet of Printer objects that have at least one related job with its most recent (latest) state object which has State.state == '200'. Is there a way to construct a compound call which will achieve this using the database and not having to pull in all Job objects to run python iterations on? Perhaps a custom manager? I've been reading posts about Subquery and Annotation and OuterRef, but these ideas are just not sinking in in a way that is showing me a path. I need them explained like I'm 5. They are very unpythonic statements..
The naive python way to describe what I want:
printers = []
for printer in Printer.objects.all():
for job in printer.jobs.objects.all():
if job.states.latest().state == '200':
printers.append(printer)
printers = list(set(printers))
But with the least number of DB round trips possible. Help!
edit: further question, what's the best way to filter Jobs based on the current state. Since Job.current_state is a calculated property it cannot be used in a QuerySet filter. But, again, I don't want to have to pull in all Job objects.

Took about two days to sink in, but I think I have an answer using annotation and Subqueries:
state_sq = State.objects.filter(job=OuterRef('pk')).order_by('-created_at')
successful_jobs = Job.objects.annotate(
latest_state=Subquery(state_sq.values('state')[:1])
).filter(printer=OuterRef('pk'), latest_state='200')
printers_with_successful_jobs = Printer.objects.annotate(
has_success_jobs=Exists(successful_jobs)
).filter(has_success_jobs=True)
And further, I constructed a custom manager to return latest_state by default.
class JobManager(models.Manager):
def get_queryset(self):
state_sq = State.objects.filter(
object_id=OuterRef('pk')
).order_by('-created_at')
return super().get_queryset().annotate(
latest_state=Subquery(state_sq.values('state')[:1])
)
class Job(models.Model):
objects = JobManager()
...

Managing relationships in Laravel, adhering to the repository pattern

While creating an app in Laravel 4 after reading T. Otwell's book on good design patterns in Laravel I found myself creating repositories for every table on the application.
I ended up with the following table structure:
Students: id, name
Courses: id, name, teacher_id
Teachers: id, name
Assignments: id, name, course_id
Scores (acts as a pivot between students and assignments): student_id, assignment_id, scores
I have repository classes with find, create, update and delete methods for all of these tables. Each repository has an Eloquent model which interacts with the database. Relationships are defined in the model per Laravel's documentation: http://laravel.com/docs/eloquent#relationships.
When creating a new course, all I do is calling the create method on the Course Repository. That course has assignments, so when creating one, I also want to create an entry in the score's table for each student in the course. I do this through the Assignment Repository. This implies the assignment repository communicates with two Eloquent models, with the Assignment and Student model.
My question is: as this app will probably grow in size and more relationships will be introduced, is it good practice to communicate with different Eloquent models in repositories or should this be done using other repositories instead (I mean calling other repositories from the Assignment repository) or should it be done in the Eloquent models all together?
Also, is it good practice to use the scores table as a pivot between assignments and students or should it be done somewhere else?

I am finishing up a large project using Laravel 4 and had to answer all of the questions you are asking right now. After reading all of the available Laravel books over at Leanpub, and tons of Googling, I came up with the following structure.
One Eloquent Model class per datable table
One Repository class per Eloquent Model
A Service class that may communicate between multiple Repository classes.
So let's say I'm building a movie database. I would have at least the following following Eloquent Model classes:
Movie
Studio
Director
Actor
Review
A repository class would encapsulate each Eloquent Model class and be responsible for CRUD operations on the database. The repository classes might look like this:
MovieRepository
StudioRepository
DirectorRepository
ActorRepository
ReviewRepository
Each repository class would extend a BaseRepository class which implements the following interface:
interface BaseRepositoryInterface
{
public function errors();
public function all(array $related = null);
public function get($id, array $related = null);
public function getWhere($column, $value, array $related = null);
public function getRecent($limit, array $related = null);
public function create(array $data);
public function update(array $data);
public function delete($id);
public function deleteWhere($column, $value);
}
A Service class is used to glue multiple repositories together and contains the real "business logic" of the application. Controllers only communicate with Service classes for Create, Update and Delete actions.
So when I want to create a new Movie record in the database, my MovieController class might have the following methods:
public function __construct(MovieRepositoryInterface $movieRepository, MovieServiceInterface $movieService)
{
$this->movieRepository = $movieRepository;
$this->movieService = $movieService;
}
public function postCreate()
{
if( ! $this->movieService->create(Input::all()))
{
return Redirect::back()->withErrors($this->movieService->errors())->withInput();
}
// New movie was saved successfully. Do whatever you need to do here.
}
It's up to you to determine how you POST data to your controllers, but let's say the data returned by Input::all() in the postCreate() method looks something like this:
$data = array(
'movie' => array(
'title' => 'Iron Eagle',
'year' => '1986',
'synopsis' => 'When Doug\'s father, an Air Force Pilot, is shot down by MiGs belonging to a radical Middle Eastern state, no one seems able to get him out. Doug finds Chappy, an Air Force Colonel who is intrigued by the idea of sending in two fighters piloted by himself and Doug to rescue Doug\'s father after bombing the MiG base.'
),
'actors' => array(
0 => 'Louis Gossett Jr.',
1 => 'Jason Gedrick',
2 => 'Larry B. Scott'
),
'director' => 'Sidney J. Furie',
'studio' => 'TriStar Pictures'
)
Since the MovieRepository shouldn't know how to create Actor, Director or Studio records in the database, we'll use our MovieService class, which might look something like this:
public function __construct(MovieRepositoryInterface $movieRepository, ActorRepositoryInterface $actorRepository, DirectorRepositoryInterface $directorRepository, StudioRepositoryInterface $studioRepository)
{
$this->movieRepository = $movieRepository;
$this->actorRepository = $actorRepository;
$this->directorRepository = $directorRepository;
$this->studioRepository = $studioRepository;
}
public function create(array $input)
{
$movieData = $input['movie'];
$actorsData = $input['actors'];
$directorData = $input['director'];
$studioData = $input['studio'];
// In a more complete example you would probably want to implement database transactions and perform input validation using the Laravel Validator class here.
// Create the new movie record
$movie = $this->movieRepository->create($movieData);
// Create the new actor records and associate them with the movie record
foreach($actors as $actor)
{
$actorModel = $this->actorRepository->create($actor);
$movie->actors()->save($actorModel);
}
// Create the director record and associate it with the movie record
$director = $this->directorRepository->create($directorData);
$director->movies()->associate($movie);
// Create the studio record and associate it with the movie record
$studio = $this->studioRepository->create($studioData);
$studio->movies()->associate($movie);
// Assume everything worked. In the real world you'll need to implement checks.
return true;
}
So what we're left with is a nice, sensible separation of concerns. Repositories are only aware of the Eloquent model they insert and retrieve from the database. Controllers don't care about repositories, they just hand off the data they collect from the user and pass it to the appropriate service. The service doesn't care how the data it receives is saved to the database, it just hands off the relevant data it was given by the controller to the appropriate repositories.

Keep in mind you're asking for opinions :D
Here's mine:
TL;DR: Yes, that's fine.
You're doing fine!
I do exactly what you are doing often and find it works great.
I often, however, organize repositories around business logic instead of having a repo-per-table. This is useful as it's a point of view centered around how your application should solve your "business problem".
A Course is a "entity", with attributes (title, id, etc) and even other entities (Assignments, which have their own attributes and possibly entities).
Your "Course" repository should be able to return a Course and the Courses' attributes/Assignments (including Assignment).
You can accomplish that with Eloquent, luckily.
(I often end up with a repository per table, but some repositories are used much more than others, and so have many more methods. Your "courses" repository may be much more full-featured than your Assignments repository, for instance, if your application centers more around Courses and less about a Courses' collection of Assignments).
The tricky part
I often use repositories inside of my repositories in order to do some database actions.
Any repository which implements Eloquent in order to handle data will likely return Eloquent models. In that light, it's fine if your Course model uses built-in relationships in order to retrieve or save Assignments (or any other use case). Our "implementation" is built around Eloquent.
From a practical point of view, this makes sense. We're unlikely to change data sources to something Eloquent can't handle (to a non-sql data source).
ORMS
The trickiest part of this setup, for me at least, is determing if Eloquent is actually helping or harming us. ORMs are a tricky subject, because while they help us greatly from a practical point of view, they also couple your "business logic entities" code with the code doing the data retrieval.
This sort of muddles up whether your repository's responsibility is actually for handling data or handling the retrieval / update of entities (business domain entities).
Furthermore, they act as the very objects you pass to your views. If you later have to get away from using Eloquent models in a repository, you'll need to make sure the variables passed to your views behave in the same way or have the same methods available, otherwise changing your data sources will roll into changing your views, and you've (partially) lost the purpose of abstracting your logic out to repositories in the first place - the maintainability of your project goes down as.
Anyway, these are somewhat incomplete thoughts. They are, as stated, merely my opinion, which happens to be the result of reading Domain Driven Design and watching videos like "uncle bob's" keynote at Ruby Midwest within the last year.

I like to think of it in terms of what my code is doing and what it is responsible for, rather than "right or wrong". This is how I break apart my responsibilities:
Controllers are the HTTP layer and route requests through to the underlying apis (aka, it controls the flow)
Models represent the database schema, and tell the application what the data looks like, what relationships it may have, as well as any global attributes that may be necessary (such as a name method for returning a concatenated first and last name)
Repositories represent the more complex queries and interactions with the models (I don't do any queries on model methods).
Search engines - classes that help me build complex search queries.
With this in mind, it makes sense every time to use a repository (whether you create interfaces.etc. is a whole other topic). I like this approach, because it means I know exactly where to go when I'm needing to do certain work.
I also tend to build a base repository, usually an abstract class which defines the main defaults - basically CRUD operations, and then each child can just extend and add methods as necessary, or overload the defaults. Injecting your model also helps this pattern to be quite robust.

Think of Repositories as a consistent filing cabinet of your data (not just your ORMs). The idea is that you want to grab data in a consistent simple to use API.
If you find yourself just doing Model::all(), Model::find(), Model::create() you probably won't benefit much from abstracting away a repository. On the other hand, if you want to do a bit more business logic to your queries or actions, you may want to create a repository to make an easier to use API for dealing with data.
I think you were asking if a repository would be the best way to deal with some of the more verbose syntax required to connect related models. Depending on the situation, there are a few things I may do:
Hanging a new child model off of a parent model (one-one or one-many), I would add a method to the child repository something like createWithParent($attributes, $parentModelInstance) and this would just add the $parentModelInstance->id into the parent_id field of the attributes and call create.
Attaching a many-many relationship, I actually create functions on the models so that I can run $instance->attachChild($childInstance). Note that this requires existing elements on both side.
Creating related models in one run, I create something that I call a Gateway (it may be a bit off from Fowler's definitions). Way I can call $gateway->createParentAndChild($parentAttributes, $childAttributes) instead of a bunch of logic that may change or that would complicate the logic that I have in a controller or command.

How can I speed my Entity Framework code?

My SQL and Entity Framework knowledge is a somewhat limited. In one Entity Framework (4) application, I notice it takes forever (about 2 minutes) to complete one of my method calls. The first queries do not take much time, but when I loop through the Entity Framework objects returned by the queries, even though I am only reading (not modifying) the data I supposedly got, it takes forever to complete the nested loops, even though there are only dozens of entries in each list and a few levels of looping.
I expect the example below could be re-written with a fancier query that could probably include all of the filtering I am doing in my loops with some SQL words I don't really know how to use, so if someone could show me what the equivalent SQL expression would be, that would be extremely educational to me and probably solve my current performance problem.
Moreover, since other parts of this and other applications I develop often want to do more complex computations on SQL data, I would also like to know a good way to retrieve data from Entity Framework to local memory objects that do not have huge delays in reading them. In my LINQ-to-SQL project there was a similar performance problem, and I solved it by refactoring the whole application to load all SQL data into parallel objects in RAM, which I had to write myself, and I wonder if there isn't a better way to either tell Entity Framework to not keep doing whatever high-latency communication it is doing, or to load into local RAM objects.
In the example below, the code gets a list of food menu items for a member (i.e. a person) on a certain date via a SQL query, and then I use other queries and loops to filter out the menu items on two criteria: 1) If the member has a rating of zero for any group id which the recipe is a member of (a many-to-many relationship) and 2) If the member has a rating of zero for the recipe itself.
Example:
List<PFW_Member_MenuItem> MemberMenuForCookDate =
(from item in _myPfwEntities.PFW_Member_MenuItem
where item.MemberID == forMemberId
where item.CookDate == onCookDate
select item).ToList();
// Now filter out recipes in recipe groups rated zero by the member:
List<PFW_Member_Rating_RecipeGroup> ExcludedGroups =
(from grpRating in _myPfwEntities.PFW_Member_Rating_RecipeGroup
where grpRating.MemberID == forMemberId
where grpRating.Rating == 0
select grpRating).ToList();
foreach (PFW_Member_Rating_RecipeGroup grpToExclude in ExcludedGroups)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
PFW_Recipe rcp = GetRecipeById(rcpOnMenu.RecipeID);
foreach (PFW_RecipeGroup group in rcp.PFW_RecipeGroup)
{
if (group.RecipeGroupID == grpToExclude.RecipeGroupID)
{
rcpsToRemove.Add(rcpOnMenu);
break;
}
}
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
// Now filter out recipes rated zero by the member:
List<PFW_Member_Rating_Recipe> ExcludedRecipes =
(from rcpRating in _myPfwEntities.PFW_Member_Rating_Recipe
where rcpRating.MemberID == forMemberId
where rcpRating.Rating == 0
select rcpRating).ToList();
foreach (PFW_Member_Rating_Recipe rcpToExclude in ExcludedRecipes)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
if (rcpOnMenu.RecipeID == rcpToExclude.RecipeID)
rcpsToRemove.Add(rcpOnMenu);
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}

You can use EFProf http://www.hibernatingrhinos.com/products/EFProf to track see exactly what EF is sending to SQL. It can also show you how many queries you are sending and how many unique queries. It also provides you some analysis of each query (e.g. is it unbound etc). Entity Framework with its navigation properties, it is quite easy to not realize you are making a db request. When you are in a loop, and have a navigation property, you get in to the N + 1 problem.

You could use the Keyword Virtual on your List parts of your model if you are using code first to enable proxying, that way you will not have to get all the data back at once, only as you need it.
Also consider NoTracking for read only data
context.bigTable.MergeOption = MergeOption.NoTracking;

Django aggregate query

I have a model Page, which can have Posts on it. What I want to do is get every Page, plus the most recent Post on that page. If the Page has no Posts, I still want the page. (Sound familiar? This is a LEFT JOIN in SQL).
Here is what I currently have:
Page.objects.annotate(most_recent_post=Max('post__post_time'))
This only gets Pages, but it doesn't get Posts. How can I get the Posts as well?
Models:
class Page(models.Model):
name = models.CharField(max_length=50)
created = models.DateTimeField(auto_now_add = True)
enabled = models.BooleanField(default = True)
class Post(models.Model):
user = models.ForeignKey(User)
page = models.ForeignKey(Page)
post_time = models.DateTimeField(auto_now_add = True)

Depending on the relationship between the two, you should be able to follow the relationships quite easily, and increase performance by using select_related
Taking this:
class Page(models.Model):
...
class Post(models.Model):
page = ForeignKey(Page, ...)
You can follow the forward relationship (i.e. get all the posts and their associated pages) efficiently using select_related:
Post.objects.select_related('page').all()
This will result in only one (larger) query where all the page objects are prefetched.
In the reverse situation (like you have) where you want to get all pages and their associated posts, select_related won't work. See this,this and this question for more information about what you can do.

Probably your best bet is to use the techniques described in the django docs here: Following Links Backward.
After you do:
pages = Page.objects.annotate(most_recent_post=Max('post__post_time'))
posts = [page.post_set.filter(post_time=page.most_recent_post) for page in pages]
And then posts[0] should have the most recent post for pages[0] etc. I don't know if this is the most efficient solution, but this was the solution mentioned in another post about the lack of left joins in django.

You can create a database view that will contain all Page columns alongside with with necessary latest Post columns:
CREATE VIEW `testapp_pagewithrecentpost` AS
SELECT testapp_page.*, testapp_post.* -- I suggest as few post columns as possible here
FROM `testapp_page` LEFT JOIN `testapp_page`
ON test_page.id = test_post.page_id
AND test_post.post_time =
( SELECT MAX(test_post.post_time)
FROM test_post WHERE test_page.id = test_post.page_id );
Then you need to create a model with flag managed = False (so that manage.py sync won't break). You can also use inheritance from abstract Model to avoid column duplication:
class PageWithRecentPost(models.Model): # Or extend abstract BasePost ?
# Page columns goes here
# Post columns goes here
# We use LEFT JOIN, so all columns from the
# 'post' model will need blank=True, null=True
class Meta:
managed = False # Django will not handle creation/reset automatically
By doing that you can do what you initially wanted, so fetch from both tables in just one query:
pages_with_recent_post = PageWithRecentPost.objects.filter(...)
for page in pages_with_recent_post:
print page.name # Page column
print page.post_time # Post column
However this approach is not drawback free:
It's very DB engine-specific
You'll need to add VIEW creation SQL to your project
If your models are complex it's very likely that you'll need to resolve table column name clashes.
Model based on a database view will very likely be read-only (INSERT/UPDATE will fail).
It adds complexity to your project. Allowing for multiple queries is a definitely simpler solution.
Changes in Page/Post will require re-creating the view.

NHibernate Partial Update

Is there a way in NHibernate to start with an unproxied model
var m = new Model() { ID = 1 };
m.Name = "test";
//Model also has .LastName and .Age
Now save this model only updating Name without first selecting the model from the session?

If model has other properties then name, you need to initialize these with the original value in the database, unless they will be set to null.
You can use HQL update operations; I never tried it myself.
You could also use a native SQL statement. ("Update model set name ...").
Usually, this optimization is not needed. There are really rare cases where you need to avoid selecting the data, so writing this SQL statements are just a waste of time. You are using an ORM, this means: write your software object oriented! Unless you won't get much advantages from it.

What Stefan says looks like what you need. Please be aware that this is really an edge case and you should be happy with fully loading your entity unless you have some ultra-high-performance issues.
If you simply don't want to hit the database - try using caching - entity cache is very simple and efficient.
If your entity is a huge one - i.e. it contains a blob or something - think about splitting it in two (with many-to-one so that you can utilize lazy loading).

http://www.hibernate.org/hib_docs/nhibernate/html/mapping.html
dynamic-update (optional, defaults to
false): Specifies that UPDATE SQL
should be generated at runtime and
contain only those columns whose
values have changed.
Place dynamic-update on the class in the HBM.
var m = new Model() { ID = 1 };
m = session.Update(m); //attach m to the session.
m.Name = "test";
session.Save(m);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas