I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.
Is there a better way to do this, some bulk insert statement I do not know about?
PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES using subselect with additional existance check:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
((this is a WIKI you can edit and enhance the answer!))
The external file is the best and typical bulk-data
The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.
Bulk insert with some transformation
In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.
CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like
fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...
You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by
INSERT INTO tablename (x, y, z)
SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms
FROM tmp_tablename_fdw
-- WHERE condictions
;
You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:
CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');
JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f AS fname,
p_fpath ||'/'|| f AS f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f LIKE p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
Lack of gzip streaming
The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:
gunzip remote_or_local_file.csv.gz | convert_to_sql | psql
So ideal (future) is a server option for format .csv.gz.
Note after #CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:
gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a do insert into target select * from tmp or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv
I implemented very fast Postgresq data loader with native libpq methods.
Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
And, run the query below to insert 10000 rows if you've already had test table:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
Related
I'm trying to copy a table from dev database to QA database in SQL Server using the Import/Export Wizard. There are about 6.6 million rows and it is taking around 7 hours to do it. Is there any faster way to accomplish the task?
Below is the code that I'm using:
SELECT *
FROM [Table_Name] WITH (NOLOCK)
WHERE [ColumnDate] > 2018
OR [Code] in ('A', 'B', 'C','D')
Thanks.
You can generate the script of the table with data and then use that to create table in your destination database. It is one of the fastest approach. You can also use powershell script to automate it. Since your data is too large and SSMS is slow you are facing the issue of time.
Let me explain when you try to use import/ export property it basically takes more time because it will try to create schema and then insert data one by one. It will be like creating a table and performing insert of each row sequentially.
To avoid this situation you can simply generate script of table with data. Once you get the script you can perform insert into. Since you performing bulk insert from local it will avoid sequential insert and will reduce insert time exponentially.
For reference you can go through the link for step by step guide.
https://www.sqlshack.com/six-different-methods-to-copy-tables-between-databases-in-sql-server/
I want to extract information from a text file (almost 1GB) and store it in PostgreSQL database.
Text file is in following format:
DEBUG, 2017-03-23T10:02:27+00:00, ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
DEBUG, 2017-03-24T12:06:23+00:00, ghtorrent-49 -- ghtorrent.rb:Repo Shikanime/print exists
...
and I want to extract 'DEBUG', timestamp, 'ghtorrent-40', 'ghtorrent' and "Repo EFForg/https-everywhere exists" from each line and store it in database.
I have done it in using other languages like python (psycopg2) and C++ (libpqxx) but is it possible to write a function in PostgreSQL itself to import the whole data itself.
I am currenly using pgAdmin4 tool for the PostgreSQL.
I thinking of using something like pg_read_file in function to read the file but one line at a time and insert it into the table.
An approach I use with my large XML files - 130GB or bigger - is to upload the whole file into a temporary unlogged table and from there I extract the content I want. Unlogged tables are not crash-safe, but are much faster than logged ones, which totally suits the purpose of a temporary table ;-)
Considering the following table ..
CREATE UNLOGGED TABLE tmp (raw TEXT);
.. you can import this 1GB file using a single psql line from your console (unix)..
$ cat 1gb_file.txt | psql -d db -c "COPY tmp FROM STDIN"
After that all you need is to apply your logic to query and extract the information you want. Depending on the size of your table, you can create a second table from a SELECT, e.g.:
CREATE TABLE t AS
SELECT
trim((string_to_array(raw,','))[1]) AS operation,
trim((string_to_array(raw,','))[2])::timestamp AS tmst,
trim((string_to_array(raw,','))[3]) AS txt
FROM tmp
WHERE raw LIKE '%DEBUG%' AND
raw LIKE '%ghtorrent-40%' AND
raw LIKE '%Repo EFForg/https-everywhere exists%'
Adjust the string_to_array function and the WHERE clause to your logic! Optionally you can replace these multiple LIKE operations to a single SIMILAR TO.
.. and your data would be ready to be played with:
SELECT * FROM t;
operation | tmst | txt
-----------+---------------------+------------------------------------------------------------------
DEBUG | 2017-03-23 10:02:27 | ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
(1 Zeile)
Once your data is extracted you can DROP TABLE tmp; to free some disk space ;)
Further reading: COPY, PostgreSQL array functions and pattern matching
I have recently taken data dumps from an Oracle database.
Many of them are large in size(~5GB). I am trying to insert the dumped data into another Oracle database by executing the following SQL in SQL Developer
#C:\path\to\table_dump1.sql;
#C:\path\to\table_dump2.sql;
#C:\path\to\table_dump3.sql;
:
but it is taking a long time like more than a day to complete even a single SQL file.
Is there any better way to get this done faster?
SQL*Loader is my favorite way to bulk load large data volumes into Oracle. Use the direct path insert option for max speed but understand impacts of direct-path loads (for example, all data is inserted past the high water mark, which is fine if you truncate your table). It even has a tolerance for bad rows, so if your data has "some" mistakes it can still work.
SQL*Loader can suspend indexes and build them all at the end, which makes bulk inserting very fast.
Example of a SQL*Loader call:
$SQLDIR/sqlldr /#MyDatabase direct=false silent=feedback \
control=mydata.ctl log=/apps/logs/mydata.log bad=/apps/logs/mydata.bad \
rows=200000
And the mydata.ctl would look something like this:
LOAD DATA
INFILE '/apps/load_files/mytable.dat'
INTO TABLE my_schema.my_able
FIELDS TERMINATED BY "|"
(ORDER_ID,
ORDER_DATE,
PART_NUMBER,
QUANTITY)
Alternatively... if you are just copying the entire contents of one table to another, across databases, you can do this if your DBA sets up a DBlink (a 30 second process), presupposing your DB is set up with the redo space to accomplish this.
truncate table my_schema.my_table;
insert into my_schema.my_table
select * from my_schema.my_table#my_remote_db;
The use of the /* +append */ hint can still make use of direct path insert.
Hi I often have to insert a lot of data into a table. For example, I would have data from excel or text file in the form of
1,a
3,bsdf
4,sdkfj
5,something
129,else
then I often construct 6 insert statements in this example and run the SQL script. I found this was slow when I have to send thousands of small packages to server, it also causes extra overhead to the network.
What's your best way of doing this?
Update: I'm using ORACLE 10g.
Use Oracle external tables.
See also e.g.
OraFaq about external tables
What Tom thinks about external tables
René Nyffenegger's notes about external tables
A simple example that should get you started
You need a file located in a server directory (get familiar with directory objects):
SQL> select directory_path from all_directories where directory_name = 'JTEST';
DIRECTORY_PATH
--------------------------------------------------------------------------------
c:\data\jtest
SQL> !cat ~/.gvfs/jtest\ on\ 192.168.xxx.xxx/exttable-1.csv
1,a
3,bsdf
4,sdkfj
5,something
129,else
Create an external table:
create table so13t (
id number(4),
data varchar2(20)
)
organization external (
type oracle_loader
default directory jtest /* jtest is an existing directory object */
access parameters (
records delimited by newline
fields terminated by ','
missing field values are null
)
location ('exttable-1.csv') /* the file located in jtest directory */
)
reject limit unlimited;
Now you can use all the powers of SQL to access the data:
SQL> select * from so13t order by data;
ID DATA
---------- ------------------------------------------------------------
1 a
3 bsdf
129 else
4 sdkfj
5 something
Im not sure if this works in Oracle but in SQL Server you can use BULK INSERT sql statement to upload data from a txt or a csv file.
BULK
INSERT [TableName]
FROM 'c:\FileName.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
Just make sure that the table columns correctly matches whats in the txt file. With a more complicated solution you may want to use a format file see the following:
http://msdn.microsoft.com/en-us/library/ms178129.aspx
There are alot of ways to speed this up.
1) Do it in a single transaction. This will speed things up by avoiding connection opening / closing.
2) Load directly as a CSV file. If you load data as a CSV file, the "SQL" statements aren't required at all. in MySQL the "LOAD DATA INFILE" operation accomplishes this very intuitively and simply.
3) You can also simply dump the whole file as text into a table called "raw". And then let the database parse the data on its own using triggers. This is a hack, but it will simplify your application code and reduce network usage.
I need a simple way to export data from an SQLite database of multiple tables, then import them into another database.
Here is my scenario. I have 5 tables: A, B, C, D, E.
Each table has a primary key as the first column called ID. I want a Unix command that will dump ONLY the data in the row from the primary key in a format that can be imported into another database.
I know I can do a
sqlite3 db .dump | grep INSERT
but that gives me ALL data in the table. I'm not a database expert, and I'm trying to do this with all unix commands in which I can write a shell script, rather than writing C++ code to do it (because that's what people are telling me that's the easiest way). I just refuse to write C++ code to accomplish a task that possible can be done in 4-5 statements from the command line.
Any suggestions?
This may look a little weird, however you may give it a try:
Create text file and place following statements there:
.mode insert
.output insert.sql
select * from TABLE where STATEMENT; -- place the needed select query here
.output stdout
Feed this file to sqlite3:
$ sqlite3 -init initf DATA.DB .schema > schema.sql
As the result you will get two files: one with simple "inserts" (insert.sql) and another with db schema (schema.sql).
Suggest finding a tool that can take your query and export to a CSV. It sounds like you wanted a script. Did you want to reuse and automate this?
For other scenarios, perhaps consider the sqlite-manager Firefox plugin. It supports running your adhoc queries, and exporting the results to a CSV.
Within that, give it this statement:
SELECT ID FROM TableA
Repeat for each table as you need.
You can use sqlite3 bash. For example if you want to get insert query for all records in one table, you can do the followings:
$ sqlite3 /path/to/db_name.db
>>.mode insert
>>.output insert.sql
>>select * from table_name;
It will creates a file name called insert.sql and puts insert query for every record in the given table.
A sample for what you get in insert.sql:
INSERT INTO "table" VALUES("data for record one");
INSERT INTO "table" VALUES("data for record two");
..
you also can use the quote function of SQLite.
echo "SELECT 'INSERT INTO my_new_table (my_new_key) VALUES (' || quote(my_old_key) || ');' FROM my_old_table;" | sqlite my_table > statements.sql