Django - SQL bulk get_or_create possible? - sql

I am using get_or_create to insert objects to database but the problem is that doing 1000 at once takes too long time.
I tried bulk_create but it doesn't provide functionality I need (creates duplicates, ignores unique value, doesn't trigger post_save signals I need).
Is it even possible to do get_or_create in bulk via customized sql query?
Here is my example code:
related_data = json.loads(urllib2.urlopen(final_url).read())
for item in related_data:
kw = item['keyword']
e, c = KW.objects.get_or_create(KWuser=kw, author=author)
e.project.add(id)
#Add m2m to parent project
related_data cotains 1000 rows looking like this:
[{"cmp":0,"ams":3350000,"cpc":0.71,"keyword":"apple."},
{"cmp":0.01,"ams":3350000,"cpc":1.54,"keyword":"apple -10810"}......]
KW model also sends signal I use to create another parent model:
#receiver(post_save, sender=KW)
def grepw(sender, **kwargs):
if kwargs.get('created', False):
id = kwargs['instance'].id
kww = kwargs['instance'].KWuser
# KeyO
a, b = KeyO.objects.get_or_create(defaults={'keyword': kww}, keyword__iexact=kww)
KW.objects.filter(id=id).update(KWF=a.id)
This works but as you can imagine doing thousands of rows at once takes long time and even crashes my tiny server, what bulk options do I have?

As of Django 2.2, bulk_create has an ignore_conflicts flag. Per the docs:
On databases that support it (all but Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values

This post may be of use to you:
stackoverflow.com/questions/3395236/aggregating-saves-in-django
Note that the answer recommends using the commit_on_success decorator which is deprecated. It is replaced by the transaction.atomic decorator. Documentation is here:
transactions
from django.db import transaction
#transaction.atomic
def lot_of_saves(queryset):
for item in queryset:
modify_item(item)
item.save()

If I understand correctly, "get_or_create" means SELECT or INSERT on the Postgres side.
You have a table with a UNIQUE constraint or index and a large number of rows to either INSERT (if not yet there) and get the newly create ID or otherwise SELECT the ID of the existing row. Not as simple as it may seem on the outside. With concurrent write load, the matter is even more complicated.
And there are various parameters that need to be defined (how to handle conflicts exactly):
How to use RETURNING with ON CONFLICT in PostgreSQL?

Related

How to update and select records in the same sql query

As a typical scenario in any prod environment, we have multiple nodes which fetches and processes items from the database (oracle).
We want to make sure that each node fetches unique set of items from database and acts on it. To make this possible we are looking whether it is possible to update the records status (for e.g., Idle to In-Process), and the same update query returning the records which it updated. In this way every node will act on its own set of records and not interfere with each others' set.
We want to avoid pl/sql due to maintenance reasons. We tried with "select for update", but in few cases it was leading to database locks getting hold up for longer period of time.
Any suggestions on how to achieve this through simple sql or hibernate (since we have hibernate option available as well)?
A couple of thoughts on this. First up in Oracle you can use the RETURNING clause as part of an update statement to return select columns (such as the primary key) from the table being updated into a collection. But this method does require PL/SQL to work since you need to work with collections, although BULK transactions will mitigate some of the drawbacks of using PL/SQL.
Another option would be to add a column to your table where you can indicate which node is processing the record(s), similar to your idea of a status column indicating Idle, or Processing statuses. This one would be NULL for not being handled, or a value uniquely identifying the node or process working on the record.
A little extra research led to this post here on Stack about using Oracles RETURNING INTO statement with Java. It also leads right back to Oracle's own documentation on the subject of Oracles DML Returning feature as supported by Java
Finally we were able to find the solution for our problem. Our problem statement: Claim the top 100 items from the list order by time of their creation. So the logic that we wanted to apply was based on FIFO. So in this case each node will pick-up top 100 items from the database and start processing on it. In this way each node will work on its own set of items, without overlapping on each others path.
We achieved this by creating a TYPE in oracle database, and then used hibernate to claim the items and store the claimed items temporarily in TYPE. Here is the code:
create type TMP_TYPE as table of VARCHAR2(1000);
//Hibernate code
String query = "BEGIN UPDATE OUR_TABLE SET OUR_TABLE_STATUS = 'IP' WHERE OUR_TABLE_STATUS = 'ID' AND ID_OUR_TABLE IN (SELECT ID_OUR_TABLE FROM (SELECT ID_OUR_TABLE FROM OUR_TABLE ORDER BY AGEING_SINCE ASC ) ) AND ROWNUM < 101 RETURNING UUID BULK COLLECT INTO ?;END;";
Connection connection = getSession().connection();
CallableStatement cStmt = connection.prepareCall(query);
cStmt = connection.prepareCall(query);
cStmt.registerOutParameter(1, Types.ARRAY, " TMP_TYPE ");
cStmt.execute();
String[] updateBulkCollectArr = (String[]) (cStmt.getArray(1).getArray());
`
Got idea from here Oracle Type and Bulk Collect
Thanks #Sentinel

Reference an arbitrary row and field in another table

Is there any form (data type, inherence..) of implement in postgresql something like this:
CREATE TABLE log (
datareferenced table_row_column_reference,
logged boolean
);
The referenced data may be any row field from the database. My objective is implement something like this without use Procedural Language or implement it in a higher layer, using only a relational approach and without modify the rest of the tables. Another feature may be referencial integrity, example:
-- Table foo (id, field1, field2, fieldn)
-- ('bar', '2014-01-01', 4.33, Null)
-- Table log (datareferenced, logged)
-- ({table foo -> id:'bar' -> field2 } <=> 4.33, True)
DELETE FROM foo where id='bar';
-- as result, on cascade, deleted both rows.
I have an application build onto a MVC pattern. The logic is written in Python. The application is a management tool, very data intensive. My goal is implement a module that could store additional information per every data present in the DDBB. Per example, a client have a serie of attributes (name, address, phone, email ...) across multiple tables, and I want that the app could store metadata-like for every registry from all the DDBB. A metadata could be last modification, or a user flag, etc.
I have implemented the metadata model (in postgres), its mapping to objects and a parcial API. But the part left is the most important, the glue. My plan B is create that glue in the data mapping layer as a module. Something like this:
address= person.addresses[0]
address.saveMetadata('foo', 'bar')
-- in the superclass of Address
def saveMetadata(self, code, value):
self.mapper.metadata_adapter.save(self, code, value)
-- in the metadata adapter class:
def save(self, entity, code, value):
sql = """update value=%s from metadata_values
where code=%s and idmetadata=
(select id from metadata_rels mr
where mr.schema=%s and mr.table=%s and
mr.field=%s and mr.recordpk=%s)"""%
(value, code,
self.class2data[entity.__class__]["schema"],
self.class2data[entity.__class__]["table"],
self.class2data[entity.__class__]["field"],
entity.id)
self.mapper.execute(sql)
def read(self, entity , code):
sql = """select mv.value
from metadata_values mv
join metadata_rels mr on mv.idmetadata=mr.id
where mv.code=%s and mr.schema=%s and mr.table=%s and
mr.field=%s and mr.recordpk=%s"""%
(code,
self.class2data[entity.__class__]["schema"],
self.class2data[entity.__class__]["table"],
self.class2data[entity.__class__]["field"],
entity.id )
return self.mapper.execute(sql)
But it would add overhead between python and postgresql, complicate Python logic, and using PL and triggers may be very laborious and bug-prone. That is why i'm looking at doing the same at the DDBB level.
No, there's nothing like that in PostgreSQL.
You could build triggers yourself to do it, probably using a composite type. But you've said (for some reason) you don't want to use PL/PgSQL, so you've ruled that out. Getting RI triggers right is quite hard, though, and you must apply a trigger to the referencing and referenced ends.
Frankly, this seems like a square peg, round hole kind of problem. Are you sure PostgreSQL is the right choice for this application?
Describe your needs and goal in context. Why do you want this? What problem are you trying to solve? Maybe there's a better way to approach the same problem one step back...

Rails/Active Record .save! efficiency question

New to rails/ruby (using rails 3 and ruby 1.9.2), and am trying to get rid of some unnecessary queries being executed.
When I'm running an each do:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save!
end
and I check the sql LOG, I see three select statements followed by one insert statement. The select statements seem completely unnecessary. For example, they're something like:
SELECT Fruit.* from Fruits where Fruit.ID = 5 LIMIT 1;
SELECT Color.* from Colors where Color.ID = 6 LIMIT 1;
SELECT TreeType.* from TreeTypes where TreeType.ID = 7 LIMIT 1;
INSERT into Apples (Fruit_id, color_id, treetype_id) values (6, 7, 8) RETURNING "id";
Seemingly, this wouldnt' take much time, but when I've got 70k inserts to run, I'm betting those three selects for each insert will take up a decent amount of time.
So I'm wondering the following:
Is this typical of ActiveRecord/Rails .save! method, or did the previous developer add some sort of custom code?
Would those three select statements, being executed for each item, cause a noticeable amount of extra time?
If it is built into rails/active record, would it be easily bypassed, if that would make it run more efficiently?
You must be validating your associations on save for such a thing to occur, something like this:
class Apple < ActiveRecord::Base
validates :fruit,
:presence => true
end
In order to validate that the relationship, the record must be loaded, and this needs to happen for each validation individually, for each record in turn. That's the standard behavior of save!
You could save without validations if you feel like living dangerously:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save(:validate => false)
end
The better approach is to manipulate the records directly in SQL by doing a mass insert if your RDBMS supports it. For instance, MySQL will let you insert thousands of rows with one INSERT call. You can usually do this by making use of the Apple.connection access layer which allows you to make arbitrary SQL calls with things like execute
I'm guessing that there is a before_save EDIT: (or a validation as suggested above) method being called that is looking up the color and type of the fruit and storing that with the rest of the attributes when the fruit is saved - in which case these lookups are necessary ...
Normally I wouldn't expect activerecord to do unnecessary lookups - though that does not mean it is always efficient ...

Rails way to reset seed on id field

I have found the "pure SQL" answers to this question. Is there a way, in Rails, to reset the id field for a specific table?
Why do I want to do this? Because I have tables with constantly moving data - rarely more than 100 rows, but always different. It is up to 25k now, and there's just no point in that. I intend on using a scheduler internal to the Rails app (rufus-scheduler) to run the id field reset monthly or so.
You never mentioned what DBMS you're using. If this is postgreSQL, the ActiveRecord postgres adapter has a reset_pk_sequences! method that you could use:
ActiveRecord::Base.connection.reset_pk_sequence!('table_name')
I came out with a solution based on hgimenez's answer and this other one.
Since I usually work with either Sqlite or PostgreSQL, I've only developed for those; but extending it to, say MySQL, shouldn't be too troublesome.
Put this inside lib/ and require it on an initializer:
# lib/active_record/add_reset_pk_sequence_to_base.rb
module ActiveRecord
class Base
def self.reset_pk_sequence
case ActiveRecord::Base.connection.adapter_name
when 'SQLite'
new_max = maximum(primary_key) || 0
update_seq_sql = "update sqlite_sequence set seq = #{new_max} where name = '#{table_name}';"
ActiveRecord::Base.connection.execute(update_seq_sql)
when 'PostgreSQL'
ActiveRecord::Base.connection.reset_pk_sequence!(table_name)
else
raise "Task not implemented for this DB adapter"
end
end
end
end
Usage:
Client.count # 10
Client.destroy_all
Client.reset_pk_sequence
Client.create(:name => 'Peter') # this client will have id=1
EDIT: Since the most usual case in which you will want to do this is after clearing a database table, I recommend giving a look to database_cleaner. It handles the ID resetting automatically. You can tell it to delete just selected tables like this:
DatabaseCleaner.clean_with(:truncation, :only => %w[clients employees])
I assume you don't care about the data:
def self.truncate!
connection.execute("truncate table #{quoted_table_name}")
end
Or if you do, but not too much (there is a slice of time where the data only exists in memory):
def self.truncate_preserving_data!
data = all.map(&:clone).each{|r| raise "Record would not be able to be saved" unless r.valid? }
connection.execute("truncate table #{quoted_table_name}")
data.each(&:save)
end
This will give new records, with the same attributes, but id's starting at 1.
Anything belongs_toing this table could get screwy.
Based on #hgmnz 's answer, I made this method that will set the sequence to any value you like... (Only tested with the Postgres adapter.)
# change the database sequence to force the next record to have a given id
def set_next_id table_name, next_id
connection = ActiveRecord::Base.connection
def connection.set_next_id table, next_id
pk, sequence = pk_and_sequence_for(table)
quoted_sequence = quote_table_name(sequence)
select_value <<-end_sql, 'SCHEMA'
SELECT setval('#{quoted_sequence}', #{next_id}, false)
end_sql
end
connection.set_next_id(table_name, next_id)
end
One problem is that these kinds of fields are implemented differently for different databases- sequences, auto-increments, etc.
You can always drop and re-add the table.
No there is no such thing in Rails. If you need a nice ids to show the users then store them in a separate table and reuse them.
You could only do this in rails if the _ids are being set by rails. As long as the _ids are being set by your database, you won't be able to control them without using SQL.
Side note: I guess using rails to regularly call a SQL procedure that resets or drops and recreates a sequence wouldn't be a purely SQL solution, but I don't think that is what you're asking...
EDIT:
Disclaimer: I don't know much about rails.
From the SQL perspective, if you have a table with columns id first_name last_name and you usually insert into table (first_name, last_name) values ('bob', 'smith') you can just change your queries to insert into table (id, first_name, last_name) values ([variable set by rails], 'bob', 'smith') This way, the _id is set by a variable, instead of being automatically set by SQL. At that point, rails has entire control over what the _ids are (although if it is a PK you need to make sure you don't use the same value while it's still in there).
If you are going to leave the assignment up to the database, you have to have rails run (on whatever time schedule) something like:
DROP SEQUENCE MY_SEQ;
CREATE SEQUENCE MY_SEQ START WITH 1 INCREMENT BY 1 MINVALUE 1;
to whatever sequence controls the ids for your table. This will get rid of the current sequence, and create a new one. This is the simplest way I know of you 'reset' a sequence.
Rails way for e.g. MySQL, but with lost all data in table users:
ActiveRecord::Base.connection.execute('TRUNCATE TABLE users;')
Maybe helps someone ;)
There are CounterCache methods:
https://www.rubydoc.info/docs/rails/4.1.7/ActiveRecord/CounterCache/ClassMethods
I used Article.reset_counters Article.all.length - 1 and it seemed to work.

Django query for large number of relationships

I have Django models setup in the following manner:
model A has a one-to-many relationship to model B
each record in A has between 3,000 to 15,000 records in B
What is the best way to construct a query that will retrieve the newest (greatest pk) record in B that corresponds to a record in A for each record in A? Is this something that I must use SQL for in lieu of the Django ORM?
Create a helper function for safely extracting the 'top' item from any queryset. I use this all over the place in my own Django apps.
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
# Extracts a single element collection w/ top item
result = queryset[0:1]
# Return that element or None if there weren't any matches
return result[0] if result else None
This uses a bit of a trick w/ the slice operator to add a limit clause onto your SQL.
Now use this function anywhere you need to get the 'top' item of a query set. In this case, you want to get the top B item for a given A where the B's are sorted by descending pk, as such:
latest = top_or_none(B.objects.filter(a=my_a).order_by('-pk'))
There's also the recently added 'Max' function in Django Aggregation which could help you get the max pk, but I don't like that solution in this case since it adds complexity.
P.S. I don't really like relying on the 'pk' field for this type of query as some RDBMSs don't guarantee that sequential pks is the same as logical creation order. If I have a table that I know I will need to query in this fashion, I usually have my own 'creation' datetime column that I can use to order by instead of pk.
Edit based on comment:
If you'd rather use queryset[0], you can modify the 'top_or_none' function thusly:
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
try:
return queryset[0]
except IndexError:
return None
I didn't propose this initially because I was under the impression that queryset[0] would pull back the entire result set, then take the 0th item. Apparently Django adds a 'LIMIT 1' in this scenario too, so it's a safe alternative to my slicing version.
Edit 2
Of course you can also take advantage of Django's related manager construct here and build the queryset through your 'A' object, depending on your preference:
latest = top_or_none(my_a.b_set.order_by('-pk'))
I don't think Django ORM can do this (but I've been pleasantly surprised before...). If there's a reasonable number of A record (or if you're paging), I'd just add a method to A model that would return this 'newest' B record. If you want to get a lot of A records, each with it's own newest B, I'd drop to SQL.
remeber that no matter which route you take, you'll need a suitable composite index on B table, maybe adding an order_by=('a_fk','-id') to the Meta subclass