What is the fastest way to send batch requests to Postgres database using Golang? Each request contains 500-200000 rows.
Methods I know about are-
1. Using database/sql package's transaction Begin, Prepare, Commit.
2. Sending all data in one statement.
3. Sending a list of statements using sql.Exec() method.
Is there some other way to send batch requests without making a connection at every statement? If not which is the best way of these?
This question is similar to question at- Golang how do I batch sql statements with package database.sql
There is a bit old depesz blog post on that. His programs are Perl scripts, but if you concentrate on SQL... Anyway - from DB perspective, you can use COPY, or INSERT with many rows in VALUES. It looks that around 20 is good choice, but it is worth to test that in your case. If performance is key factor, I would put around 2000-5000 rows per transaction. Also, from DB perspective transaction, and session are two separate things. So you can open session, and to many transactions in it.
For PostgreSQL starting new session per operation is really bad idea - DB spawns new process for each session. One of answers for the question you referred contains this. So you open connection, and then transaction, as it should be done.
Related
I am currently implementing a server system which has both an SQL database and a Redis datastore. The data written to Redis heavily depends on the SQL data (cache, objects representing logic entities defined by a number of SQL models and their relationships).
While looking for an error handling methodology to wrap client requests, something similar to SQL's transaction commit & rollback (Redis doesn't support rollbacks), I thought of a mechanism which can serve this purpose and I'd appreciate input regarding it.
Basically, I intend to wrap (with before/after middleware) every client request with an SQL transaction and a Redis multi command (pipes commands until exec or discard command is invoked), and allow both transactions to occur only if the request was processed successfully.
The problem is that once you start a Redis multi command, you are not able to preform any reads/writes and actually use their values while processing a request. I reduced the problem just for reads since depending on just-now written values can be optimized out.
My (simple) solution: split the Redis connection into two - a writer and a reader. The writer connection will be the one to be initialized with the multi command and executed/discarded at the end. Of course, all writing will be preformed through it, while reading is done using the reader (executed instantly).
The down side: as opposed to SQL, you can't rely on values written in the same request (transaction). Then again, usually quite easy to overcome.
I have been reading the SQLite documentation and also referencing code I have written previously but I don't seem to be able to find a definitive answer to what I imagine to be a rather simple question.
I would like to execute many (separate) compiled statements within a transaction, but child threads may also be creating transactions or just executing statements at the same time and I would not want them included in this particular transaction. Currently, I have a single database handle that I share between all threads.
So, my question is,
1) .. is it generally better to have some kind of semaphore around transactions to ensure they will not clash/collect with other statements being executed against a database handle. I already marshal writes to prevent problems with multithreaded issues with SQLite (although with WAL now it's very hard to unsettle it at all).
2) .. or are you expected to open multiple database connections and start/commit the transactions one per database connection if they will be concurrent?
Changes made in one database connection are invisible to all other database connections prior to commit.
So it seems a hybrid approach of having several connections open to the database provides adequate concurrency guarantees, trading off the expense of opening a new connection with the benefit of allowing multi-threaded write transactions.
A query sees all changes that are completed on the same database connection prior to the start of the query, regardless of whether or not those changes have been committed.
If changes occur on the same database connection after a query starts running but before the query completes, then it is undefined whether or not the query will see those changes.
If changes occur on the same database connection after a query starts running but before the query completes, then the query might return a changed row more than once, or it might return a row that was previously deleted.
For the purposes of the previous four items, two database connections that use the same shared cache and which enable PRAGMA read_uncommitted are considered to be the same database connection, not separate database connections.
Here is the SQLite information on isolation. Which is exceptionally useful to read and understand for this problem.
My multi-threaded Delphi application parses about 100k marketplace offers. Each worker thread writes parsed data to a remote SQL Server. Currently each thread parses 3-4 offers per second which means 10 threads fire about 35 calls-for-update to SQL Server. Every second.
The idea is to implement the optimized database writes – sort of a lazy bulk updates. Each thread accumulates 20-30 parsed offers and then writes them do database in a single pass. I assume that would be way more optimal and efficient than the current approach.
I would be happy to hear your general comments and suggestions as well as shedding some light on the techniques of lazy/delayed/chunky writes from Delphi app to SQL Server database.
There's also good old-fashioned BULK INSERTS from a flat file into the database. With a large data transfer app I developed (years ago) this was by far the fastest solution. But that was before large insert statements, and it only works if you can delay to batches of at least 1000 rows.
Since you have only two very simple numeric fields you won't have to worry about Unicode, delimiters, escaping characters etc. Just write your intermediate results to a simple ASCII file, then BULK INSERT this in one transaction.
You will have to make sure this works multithreaded (should not be too difficult with unique file names), and you will have to experiment with the amount of 'latency' you tolerate, whether you can use table locks etc. The larger the bulk inserted file, the more you gain.
Make sure that you set the SQL server transaction logging to Minimal logging to prevent large transaction logs
Delphi XE4 contains FireDAC, which gives you two approaches for a solution: CachedUpdates and Array DML.
I like #Uwe's suggestion. If you are rolling your own solution without FireDAC, however, you can use an in-memory dataset as a buffer and then blast the data to a stored procedure.
Of course, this would require changes outside of the code, and you would need appropriate permissions to create the stored procedure and so forth. But if this idea appeals to you, here are two links that may help with this technique:
In Memory Datasets
Bulk push to SQL via stored procedure
I am creating an application that allows users to construct complex SELECT statements. The SQL that is generated cannot be trusted, and is totally arbitrary.
I need a way to execute the untrusted SQL in relative safety. My plan is to create a database user who only has SELECT privileges on the relevant schemas and tables. The untrusted SQL would be executed as that user.
What could possibility go wrong with that? :)
If we assume postgres itself does not have critical vulnerabilities, the user could do a bunch of cross joins and overload the database. That could be mitigated with a session timeout.
I feel like there is a lot more than could go wrong, but I'm having trouble coming up with a list.
EDIT:
Based on the comments/answers so far, I should note that the number of people using this tool at any given time will be very near 0.
SELECT queries can not change anything in databse. Lack of dba privileges guarantee that any global settings can not be changed. So, overload is truely the only concern.
Onerload can be result of complex queryies or too much simple queries.
Too complex queryies can be ruled out by setting statement_timeout in postgresql.conf
Receiving plenties of simple queryies can be avoided too. Firstly, you can set parallel connection limit per user (alter user with CONNECTION LIMIT). And if you have some interface program between user and postgresql, you can additionally (1) add some extra wait after each query completion, (2) introduce CAPTCHA to avoid automated DOS-attack
ADDITION: PostgreSQL public system functions give many possible attack vectors. They can be called like select pg_advisory_lock(1) and every user have privilege to call them. So, you should restrict access to them. Good option is creating whitelist of all "callable words" or, more precisely, identifiers that can be used with ( after them. And rule out all queryies that include call-like construct identifier ( with an identifier not in white list.
Things that come to mind, in addition to having the user SELECT-only and revoking privileges on functions:
Read-only transaction. When a transaction is started by BEGIN READ ONLY, or SET TRANSACTION READ ONLY as its first instruction, it cannot write anything, independantly of the user permissions.
At the client side, if you want to restrict it to one SELECT, better use a SQL submission function that does not accept several queries bundled into one. For instance, the swiss-knife PQexec method of the libpq API does accept such queries and so does every driver function that is built on top of it, like PHP's pg_query.
http://sqlfiddle.com/ is a service dedicated to running arbitrary SQL statements which may be seen somehow as a proof-of-concept that it's doable without being hacked or DDos'ed all day long.
The problem with this, is i'm not sure if the sql itself will still continue to run in the background after a session timeout (can't really find much evidence either way via google and haven't had any real experience where I've attempted it myself either). If you're limiting to just select access, i think this is about the worst that could happen though. The real issue would be what happens if you got a hundred users trying to do complex cross joins? Session timeout dropping the query or not, it'll put a real heavy load on the database (could very easily be enough to pull the database down entirely)
The only way (from my point of view) to protect yourself against DoS on main server with crafted queries is to set up a read only replica of the Postgres DB and a special limited user on this replica DB. This way the main Postgres server wont be affected by queries on replica.
Also you will get hot standby / continuous replication DB for the case, when main DB fails for some reason.
a friend asked me if there is a way to see the past dml statements and I wasn't really sure on how to go about answering that question. What he wants to see is the last set of insert statements. So that means it could be more than 1 record. At first I was just saying to check the latest identity, but then he asked what if more inserts were performed at the same time. Can you guys help me out? Is there a DMV I should use that I just don't know about? Thanks.
If you did not prepare for this question then there is no build in way to get to that information. However, you could use third party log reader tools to recover (all) the last statements that where executed against the database. This requires the database to be in Full recovery mode. You could potentially go back as far as you have log backups with this method.
If you want to prepare for that question being ask in the future, you have several options.
The most obvious one is Change Data Capture. You also could write a trigger yourself that records data changes.
You could also run a trace capturing SQL Batch Started.
Finally you could use a third party network sniffer/logger to capture all statements send to the server (this however requires that connection encryption is not used).