I have a shapefile (DMA area definitions), it's a static data and is not going to change frequently, so I was wondering what would be the best approach to import it into Redshift via DBT.
I'm looking for something similar to dbt seeds, but dbt seeds work only with csv files while what I have is a geo shapefile on s3.
The sql query I'm using to import it from S3 is the following:
CREATE TABLE dma (
fid INT IDENTITY(1,1),
id BIGINT,
name VARCHAR,
long_name VARCHAR,
geometry GEOMETRY);
COPY dma (geometry, id, name, long_name )
FROM 's3://{somePath}/{someFile}.shp'
FORMAT SHAPEFILE
CREDENTIALS '{someCredentials}';
So basically I want this import to Redshift be part of my dbt setup instead of running some external sql query manually
You could wrap this in a macro so you could invoke it with something like dbt run-operation create_dma_table. For inspiration, you could see this similar approach for managing UDFs.
I would change the create table statement to include replace, so it's idempotent. I would also add the target schema to the table, so that you can build this in multiple environments (dev/prod/etc).
{% macro create_dma_table() %}
CREATE OR REPLACE TABLE {{target.schema}}.dma (
fid INT IDENTITY(1,1),
id BIGINT,
name VARCHAR,
long_name VARCHAR,
geometry GEOMETRY
);
COPY {{target.schema}}.dma (geometry, id, name, long_name )
FROM 's3://{somePath}/{someFile}.shp'
FORMAT SHAPEFILE
CREDENTIALS '{someCredentials}'
;
{% endmacro %}
You may also want to explicitly add transaction handling (e.g., BEGIN...COMMIT) so this doesn't get rolled back.
Another option would be to poll information_schema using run_query or similar for the existence of this table, and add an if block to this macro so it only executes if the table doesn't already exist. This would let you run this macro in an on-run-start hook to guarantee that this table will always be in your target schema before your dbt runs execute.
Related
Can we create a new table in DBT?
Can we copy the table structure which is present in the dev environment in the database to another environment using DBT?
Yes. However, Dbt needs a "reason" to create tables, for example, to materialize the data produced by one of its models. DBT cannot create table just for the creation's sake.
Well, strictly speaking, you can do this by putting CREATE TABLE... in a pre-hook or post-hook section, but I suppose this is not what you want since dbt makes no difference here at all.
You can define your existed table in sources where you can set database, schema and table name different from the target storage space where dbt writes data. And then, define a model something like:
{{ materialized="table" }}
select *
from {{ source('your_source', 'existed_table_name') }}
limit 1 /* add "limit 1" if you only want the structure */
Put necessary connection credentials in the profiles.yml, and build the model. Dbt will copy one row from source table into model table, before that model table creation is done for free.
I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]
probably a very trivial question but I'm not sure about this and also don't want to lose the table. How do I rename a table in Athena?
Database name - friends
table name - centralPark
desired table name -centralPerk
you can't!
see the list of unsupported DDL in Athena.
what you can do is to make a new table using select:
CREATE TABLE centralPark
AS SELECT * FROM centralPerk
WITH DATA
and drop the old table:
DROP TABLE IF EXISTS centralPerk
Using a CTAS query is effective, but I found it to be quite slow. It needs to copy all the files.
But you don't need to copy the files. You can create a new table directly in the Glue catalog and point it at the existing files. This works in seconds or less.
If you're using Python, I highly recommend the awswrangler library for this kind of work.
import awswrangler as wr
def wrangler_copy(db, old_name, new_name):
wr.catalog.create_parquet_table(
db,
new_name,
path=wr.catalog.get_table_location(db, old_name),
columns_types=wr.catalog.get_table_types(db, old_name),
# TODO: partitions, etc
)
And then drop the old table if you like.
DROP TABLE IF EXISTS <old_name>
I would like to create a script to insert data into an SQL database.
My project is to take an access database, that isn't well structured and put it into SQL. The access data can be imported, but there is a table that i have created to that the access DB doesn't have, which is what i want the script for. Its a physical Paper box archive database. I need to create the "Locations" data.
To be more specific the data is:
ID (auto num)
Rack - These are the shelving units
Row - This is the same as shelf
Column - This is the amount of boxes horizontally on a shelf
Position - This is the depth (there can be two boxes in the same column on each shelf)
INSERT INTO
In terms of script, there are a few methods of inserting data into MySQL. Firstly if you have an existing table, you can insert values into specific columns like so:
INSERT INTO TableName (Column1, Column2 etc..)
VALUES ("Column1 value", 420 etc...)
This can be added to a while loop in order to fill multiple rows at a fast pace.
IMPORT FILE
Another method you can use to create a table with existing data and columns is to import an excel sheet for example. This can be done by right clicking on the database you wish to add the new table to, heading to tasks then import data.
Database (Right Click) > Tasks > Import Data...
You will then need to select data source, presumably excel, then specify the file path. Next select destination; probably SQL Server Native Client for you. The rest from there should be pretty easy to follow.
BULK IMPORT
I've not had a lot of practice with bulk importing to SQL, however from what I am aware you can use this method to import data from an external file into a SQL table programmatically.
An example I have is as follows:
--Define the data you are importing in a temp table
CREATE TABLE #ClickData
ID INT IDENTITY(1,1)
,Dated VARCHAR(255) COLLATE Latin1_General_CI_AS
,PageViews VARCHAR(255) COLLATE Latin1_General_CI_AS
)
insert into #ClickData
--Select the data from the file
select Dated, PageViews
from openrowset(--Openrowset is the method of doing this
bulk 'FilePath\ImportToSqlTest.csv',--The file you wish to import data from
formatfile = 'FilePath\Main.XML',--The XML formatting for the data you are gathering (I believe this part is for reading the file)
firstrow = 2--Specifiy the starting row(Mine is 2 to ignore headers)
) as data
Apologies if this answer isn't overly helpful, I had to write this in a rush. I'm not entirely sure what you're looking for, as others stated your question is very vague. Hopefully this might help somewhat.
I have a script that parses xml file to generate SQLite table automatically. And, the simplified command is as follows.
Table string CREATE TABLE IF NOT EXISTS benchmark (id integer primary key autoincrement,Version float, CompilationParameters_Family text, CompilationParameters_XilinxVersion text, CompilationParameters_Device text, CompilationParameters_XilinxParameterList_Parameter_OptimizationGoal text, CompilationParameters_XilinxParameterList_Parameter_PlacerEffortLevel text)
It works well, but I wonder if I can attach some aliases for the long name in a database.
Is this possible? I mean, can I have a command something like
Table string ... CompilationParameters_XilinxVersion tex >>as version<< ...
so that I can use CompilationParameters_XilinxVersion or version when retrieve the data.
What you're trying to do is not possible in SQL. However, you may want to create a VIEW that simply substitutes the long column names with your short column aliases. Note that VIEWs in sqlite are read-only and therefore cannot be written to.