Insert data into a snowflake table from sqlachemy - sql

So I am trying to insert data into a snowflake transient table (from a parquet file), but my syntax doesn't allow me to go past our SAST test in the pipeline.
Do you see anything wrong with the following code snippet (especially the insert into step since it is causing the error):
with snowflake_engine.begin() as tx:
(SOME WORKING CODE)...
if table == "lending_adjudications":
tx.execute(
f"put file://{pq_filepath_2} #{destination_schema}.{stage_guid_part_2}"
).fetchall()
stmts = [
f"create or replace transient table {destination_schema}.{table} as " # nosec
f"select $1 as fields from #{destination_schema}.{stage_guid}", # nosec
f"insert into {destination_schema}.{table} "
f"select $1 as fields from #{destination_schema}.{stage_guid_part_2}",
]
[tx.execute(stmt).fetchall() for stmt in stmts]
else:
tx.execute(
f"create or replace transient table {destination_schema}.{table} as " # nosec
f"select $1 as fields from #{destination_schema}.{stage_guid}" # nosec
).fetchall()
...
Thank you so much for your help, any insight is highly appreciated.

Related

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

Omitting columns when importing CSV into Sqlite

Imagine that you have the following data in a CSV:
Name, Age, Gender
Jake, 40, M
Bill, 17, M
Suzie, 21, F
Is it possible to exclude the Age variable when importing the above CSV? My current approach is to simply use the cut shell command.
Update
iluvcapra has a great answer for small CSVs. However, for very large CSVs this approach is inefficient. For example, imagine that the above CSV was very large, 30Gb lets say. Loading all that Age data only to immediately remove is a waste of time. With this in mind, is there a more efficient way to load subsets of columns into sqlite databases?
I suspect that the best option is to use the shell command cut to cull out unnecessary columns. Is that intuition correct? Is it common to use shell commands to pre-process CSV files into more sqlite friendly versions?
Create a temporary table with the age column, and then use an INSERT... SELECT to move the data from the temporary table into your main one:
CREATE TEMP TABLE _csv_import (name text, age integer, gender text);
.separator ","
.import file.csv test
INSERT INTO names_genders (name, gender) SELECT name, gender
FROM _csv_import WHERE 1;
DROP TABLE _csv_import;
EDIT: Updating into a view with a phantom age column:
CREATE VIEW names_ages_genders AS
SELECT (name, 0 AS age ,gender) FROM names_genders;
CREATE TRIGGER lose_age
INSTEAD OF INSERT ON names_ages_genders
BEGIN
INSERT INTO names_genders (name, gender)
VALUES (NEW.name, NEW.gender)
END;
This will create a view called names_ages_genders that will say everybody is zero years old, and will silently drop the age field from any INSERT statement called on it. Not tested! (I'm actually not sure .import can import into views.)
If you wish to avoid reading more than necessary into SQLite, and if you wish to avoid the hazards of using standard text-processing tools (such as cut and awk) on CSV files, one possibility would be to use your favorite csv2tsv converter (*) along the following lines:
csv2tsv input.csv | cut -f 1,3- > tmp.tsv
cat << EOF | sqlite3 demo.db
drop table if exists demo;
.mode csv
.separator "\t"
.import tmp.tsv demo
EOF
/bin/rm tmp.tsv
Note, though, that if input.csv has literal tabs or newlines or escaped double-quotes, then
whether the above will have the desired effect will depend on the csv2tsv that is used.
(*) csv2tsv
In case you don't have ready access to a suitable csv2tsv converter, here is a simple python3 script that does the job, handling embedded literal newlines, tabs, and the two-character sequences "\t" and "\n", in the CSV:
#!/usr/bin/env python3
# Take care of embedded tabs and newlines in the CSV
import csv, re, sys
if len(sys.argv) > 2 or (len(sys.argv) > 1 and sys.argv[1] == '--help'):
sys.exit("Usage: " + sys.argv[0] + " [input.csv [output.tsv]]")
csv.field_size_limit(sys.maxsize)
if len(sys.argv) == 3:
out=open(sys.argv[2], 'w+')
else:
out=sys.stdout
if len(sys.argv) == 1:
csvfile=sys.stdin
else:
csvfile=open(sys.argv[1])
# tabs and newlines ...
def edit(s):
s=re.sub(r'\\t', r'\\\\t', s)
s=re.sub(r'\\n', r'\\\\n', s)
s=re.sub('\t', r'\\t', s)
return re.sub('\n', r'\\n', s)
reader = csv.reader(csvfile, dialect='excel')
for row in reader:
line=""
for s in row:
s=edit(s)
if len(line) == 0:
line = s
else:
line += '\t' + s
print(line)

How to pass a hash to a sub?

In my perl "script" I'm collecting data and building a hashmap. The hashmap keys represent field names, and the values represent the value I want to insert into the corresponding field.
The hashmap is built, and then passed to the saveRecord() method which is supposed to build a SQL query and eventually it will execute it.
The idea here is to update the database once, rather than once per field (there are a lot of fields).
The problem: I'm having trouble passing the hashmap over to the sub and then pulling the fields and values out of the hashmap. At this point my keys and values are blank. I suspect the data is getting lost during the passing to a sub.
The output of the script indicates no keys and no values.
Need help passing the data to the sub in a way that lets me pull it back apart as shown - with join().
Thanks!
Code snippet:
for my $key (keys %oids) {
$thisField = $key;
$thisOID = $oids{$thisField};
# print "loop: thisoid=$thisOID field=$thisField\n";
# perform the SNMP query.
$result = getOID ($thisOID);
# extract the information from the result.
$thisResult = $result->{$thisOID};
# remove quotation marks from the data value, replace them with question marks.
$thisResult =~ s/\"|\'|/\?/g;
# TODO: instead of printing this information, pass it to a subroutine which writes it to the database (use an update statement).
# printf "The $thisField for host '%s' is '%s'.\n", $session->hostname(), $result->{$thisOID};
# add this key/value pair to the mydata hashmap.
$mydata{$thisField} = $thisResult;
# print "$thisField=$thisResult\n";
}
# write one record update for hashmap %mydata.
saveRecord (%mydata);
# write all fields to database at once...
sub saveRecord ($) {
my $refToFields=shift;
my #fieldlist = keys %$refToFields;
my #valuelist = values %$refToFields;
my $sql = sprintf ("INSERT INTO mytable (%s) VALUES (%s)",join(",",#fieldlist), join(",",#valuelist) );
# Get ID of record with this MAC, if available, so we can perform SQL update
my $recid = getidbymac ($MAC);
print "sql=$sql\n";
# TODO: use an insert or an update based on whether recid was available...
# TODO: ID available, update the record
# TODO: ID not available, insert record let ID be auto assigned.
}
I cleaned up your code a little. Your main problem was not using a reference when calling your sub. Also note the commented regex which is cleaned up:
Code:
use strict;
use warnings;
# $thisResult =~ s/["']+/?/g;
my %mydata = ( 'field1' => 12, 'field2' => 34, );
saveRecord (\%mydata); # <-- Note the added backslash
sub saveRecord {
my $ref = shift;
my $sql = sprintf "INSERT INTO mytable (%s) VALUES (%s)",
join(',', keys %$ref),
join(',', values %$ref);
print "sql=$sql\n";
}
Output:
sql=INSERT INTO mytable (field1,field2) VALUES (12,34)

Rails3: SQL execution with hash substitution like .where()

With a simple model like that
class Model < ActiveRecord::Base
# ...
end
we can do queries like that
Model.where(["name = :name and updated_at >= :D", \
{ :D => (Date.today - 1.day).to_datetime, :name => "O'Connor" }])
Where the values in the hash will be substituted into the final SQL statement with proper escaping depending on the underlying database engine.
I would like to know a similar feature for SQL execution like:
ActiveRecord::Base.connection.execute( \
["update models set name = :name, hired_at = :D where id = :id;"], \
{ :id => 73465, :D => DateTime.now, :name => "O'My God" }] \
) # THIS CODE IS A FANTASY. NOT WORKING.
(Please do not solve the example with loading a Model object, modifying and then saving! The example is only an illustration for the feature I would like to have / know. Concentrate on the subject!)
The original problem is that I want to insert large amount (many thousand lines) of data into the database. I want to use some features of the SQL abstraction of the ActiveRecord framework but I don't want to use model objects based on ActiveRecord::Base because they are damn slow! (8 queries per second for my current problem.)
query = ActiveRecord::Base.connection.raw_connection.prepare("INSERT INTO users (name) VALUES(:name)")
query.execute(:name => 'test_name')
query.close
Extending the #peufeu solution with concrete code example for bulk insert:
users_places = []
users_values = []
timestamp = Time.now.strftime('%Y-%m-%d %H:%M:%S')
params[:users].each do |user|
users_places << "(?,?,?,?)"
users_values << user[:name] << user[:punch_line] << timestamp << timestamp
end
bulk_insert_users_sql_arr = ["INSERT INTO users (name, punch_line, created_at, updated_at) VALUES #{users_places.join(", ")}"] + users_values
begin
sql = ActiveRecord::Base.send(:sanitize_sql_array, bulk_insert_users_sql_arr)
ActiveRecord::Base.connection.execute(sql)
rescue
"something went wrong with the bulk insert sql query"
end
Here is the reference to sanitize_sql_array method in ActiveRecord::Base, it generates the proper query string by escaping the single quotes in the strings. For example the punch_line "Don't let them get you down" will become "Don\'t let them get you down".
Yes you could do raw SQL, but checkout the ar-extensions gem that helps with batch inserts:
https://github.com/zdennis/ar-extensions
Here's a post on it, and various other techniques:
http://www.coffeepowered.net/2009/01/23/mass-inserting-data-in-rails-without-killing-your-performance/
For INSERTs, batching them using a long VALUES clause (as shown by Simon's link) is the fastest way (unless you want to generate a text file and load it in your database with MySQL's LOAD DATA INFILE). But you have to be very careful about escaping your text values (which is not done in the example).
I was asking "what database are you using" because it does matter for mass UPDATEs.
For instance, you can do this on postgres (and I believe SQL Server changing "columnX" to "colX" ):
UPDATE foo
JOIN (VALUES (1,2),(3,4),... long list) v ON (foo.id=v.column1)
SET foo.bar = v.column2
And you can update a load of rows using a single statement, very fast.
If you don't need Ruby to perform some Ruby-specific magic on your data, the fastest way to transfer data from one DB to a different one is to export as a text file (CSV or tab separated), load it on the other DB (LOAD DATA INFILE on MySQL), perhaps in a temporary table, and bulk process using SQL.
EDIT : Here's how I do this in Python :
sql = [ "INSERT INTO foo (column list) VALUES " ]
values = []
for tuple in tuple_list:
append "(?,?,?,?)" to sql
extend values list with tuple
Then join sql into a string, you get "INSERT INTO foo (column list) VALUES (?,?,?,?),(?,?,?,?),(?,?,?,?)" with the "(?,?,?,?)" repeated as many times as you have lines to insert.
Then "values" contains a list of (a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3) with an,bn,cn,dn being the tuples you want to insert for line n. Each one corresponds to a placeholder in the sql string.
Then pass this to the usual "execute query with parameters" function which will handle quoting and escaping as usual.
I encountered a similar issue recently when tying to insert 100K+ records into a MySQL database for a Rails 4 app using mysql2 gem. The data included characters that had to be sanitized prior to insert.
The solution I ended going with was a slightly modified version of Option 3 described at https://www.coffeepowered.net/2009/01/23/mass-inserting-data-in-rails-without-killing-your-performance/
Here's the relevant code block from the above link:
TIMES = 10000
inserts = []
TIMES.times do
inserts.push "(3.0, '2009-01-23 20:21:13', 2, 1)"
end
sql = "INSERT INTO user_node_scores (`score`, `updated_at`, `node_id`, `user_id`) VALUES #{inserts.join(", ")}"
The modification I made was using the public method ActiveRecord::Base.sanitize() on values that required it.
inserts = []
created = Time.now.strftime "%Y-%m-%d %H:%M:%S"
params[:audits].each do |audit|
inserts.push "(#{audit.user_id), #{created}," + ActiveRecord::Base.sanitize(audit.comment) + ", #{audit.status})"
end
sql = "INSERT INTO user_audits (`user_id`, `created_at`, `comment`, `status`) VALUES #{inserts.join(", ")}"

What's the best way to copy a subset of a table's rows from one database to another in Postgres?

I've got a production DB with, say, ten million rows. I'd like to extract the 10,000 or so rows from the past hour off of production and copy them to my local box. How do I do that?
Let's say the query is:
SELECT * FROM mytable WHERE date > '2009-01-05 12:00:00';
How do I take the output, export it to some sort of dump file, and then import that dump file into my local development copy of the database -- as quickly and easily as possible?
Source:
psql -c "COPY (SELECT * FROM mytable WHERE ...) TO STDOUT" > mytable.copy
Destination:
psql -c "COPY mytable FROM STDIN" < mytable.copy
This assumes mytable has the same schema and column order in both the source and destination. If this isn't the case, you could try STDOUT CSV HEADER and STDIN CSV HEADER instead of STDOUT and STDIN, but I haven't tried it.
If you have any custom triggers on mytable, you may need to disable them on import:
psql -c "ALTER TABLE mytable DISABLE TRIGGER USER; \
COPY mytable FROM STDIN; \
ALTER TABLE mytable ENABLE TRIGGER USER" < mytable.copy
source server:
BEGIN;
CREATE TEMP TABLE mmm_your_table_here AS
SELECT * FROM your_table_here WHERE your_condition_here;
COPY mmm_your_table_here TO 'u:\\source.copy';
ROLLBACK;
your local box:
-- your_destination_table_here must be created first on your box
COPY your_destination_table_here FROM 'u:\\source.copy';
article: http://www.postgresql.org/docs/8.1/static/sql-copy.html
From within psql, you just use copy with the query you gave us, exporting this as a CSV (or whatever format), switch database with \c and import it.
Look into \h copy in psql.
With the constraint you added (not being superuser), I do not find a pure-SQL solution. But doing it in your favorite language is quite simple. You open a connection to the "old" database, another one to the new database, you SELECT in one and INSERT in the other. Here is a tested-and-working solution in Python.
#!/usr/bin/python
"""
Copy a *part* of a database to another one. See
<http://stackoverflow.com/questions/414849/whats-the-best-way-to-copy-a-subset-of-a-tables-rows-from-one-database-to-anoth>
With PostgreSQL, the only pure-SQL solution is to use COPY, which is
not available to the ordinary user.
Stephane Bortzmeyer <bortzmeyer#nic.fr>
"""
table_name = "Tests"
# List here the columns you want to copy. Yes, "*" would be simpler
# but also more brittle.
names = ["id", "uuid", "date", "domain", "broken", "spf"]
constraint = "date > '2009-01-01'"
import psycopg2
old_db = psycopg2.connect("dbname=dnswitness-spf")
new_db = psycopg2.connect("dbname=essais")
old_cursor = old_db.cursor()
old_cursor.execute("""SET TRANSACTION READ ONLY""") # Security
new_cursor = new_db.cursor()
old_cursor.execute("""SELECT %s FROM %s WHERE %s """ % \
(",".join(names), table_name, constraint))
print "%i rows retrieved" % old_cursor.rowcount
new_cursor.execute("""BEGIN""")
placeholders = []
namesandvalues = {}
for name in names:
placeholders.append("%%(%s)s" % name)
for row in old_cursor.fetchall():
i = 0
for name in names:
namesandvalues[name] = row[i]
i = i + 1
command = "INSERT INTO %s (%s) VALUES (%s)" % \
(table_name, ",".join(names), ",".join(placeholders))
new_cursor.execute(command, namesandvalues)
new_cursor.execute("""COMMIT""")
old_cursor.close()
new_cursor.close()
old_db.close()
new_db.close()