SQOOP --query with SCHEMA in SQL Server - hive

I'm trying to use the --query option in sqoop to import data from SQL Server. My concern is, how can we declare which schema to use with --query in SQL Server.
My script:
sqoop \
--options-file sqoop/aw_mssql.cfg \
--query "select BusinessEntityId, LoginID, cast(OrganizationNode as string) from Employee where \$CONDITIONS" \
--hive-table employees \
--hive-database mssql \
-- --schema=HumanResources
Still produces an error
Invalid object name 'Employee'
Also tried
--connect "jdbc:sqlserver://192.168.1.17;database=AdventureWorks;schema=HumanResources"
but that also failed.

You can try this below code:
sqoop import \
--connect jdbc:sqlserver://192.168.1.17;database=AdventureWorks \
--username "Your User" \
--password "Your Password" \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--verbose \
--query "select BusinessEntityId, LoginID, cast(OrganizationNode as string) from HumanResources.Employee where \$CONDITIONS" \
--split-by "EmpID" \
--where " EmpID='Employee ID' " \
-m 1 \
--target-dir /user/cloudera/ingest/raw/Employee\
--fields-terminated-by "," \
--hive-import \
--create-hive-table \
--hive-table mssql.employees \
hive-import – Import table into Hive (Uses Hive’s default delimiters
if none are set.)
create-hive-table – It will create new HIBE table. Note: Job
will be failed if a Hive table already exists. It works in this
case.
hive-table – Specifies <db_name>.<table_name>.

The sqoop command you are using is missing a few things. 1st of all you need to specify that this is an sqoop import job. apart from that your query needs to have a connection string. Moreover i dont know what argments you are passing inside the options file so if you had posted the details it would have been easier and i am not sure about the -- --schema=HumanResources thing as i haven't seen it. A correct working sqoop example query is :
sqoop import --connect <connection string> --username <username> --password <password> --query <query> --hive-import --target-table <table_name> -m <no_if_mappers
Moreover keep this in mind while using --query tool you need not specify the --table tool, otherwise it will throw an error.

-schema can work in conjunction with -table, but not with -query. Think what would that mean, it would require to parse the text of the query and replace every unqualified table reference with a two-part name, but not table references that are already two-part, three-part or four-part names. And match exactly the syntax rules of the back end (SQL Server in this case). It's just not feasible.
Specify the schema explicitly in the query:
select BusinessEntityId, LoginID, cast(OrganizationNode as string)
from HumanResources.Employee
where ...

Related

transfer files from S3 bucket to BigQuery every minute using runtime parameter

i'd like to transfer data from an S3 bucket to BQ every minute using the runtime parameter to define which folder to take the data from but i get : Missing argument for parameter runtime.
the parameter is defined under the --params with "data_path"
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_tranfer \
--target_dataset=$ds \
--schedule=None \
--params='{"destination_table_name_template":$ds,
"data_path":"s3://bucket/test/${runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
Apparently you have to add the run_time in the destination_table_name_template
so the cmd line works like this:
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_transfer \
--target_dataset=demo \
--schedule=None \
--params='{"destination_table_name_template":"demo_${run_time|\"%Y%m%d%H\"}",
"data_path":"s3://bucket/test/{runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
the runtime has to be the same as the partition_id. above the partition is hourly. the records in the files have to belong to the that partition_id or the jobs will fail. to see your partition ids use:
SELECT table_name, partition_id, total_rows
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
but, important to mention. it's not a good idea to rely on this service for an every minute ingestion into BigQuery since your jobs get queued and can take several minutes. the service seems to be designed to run only once every 24H.

Copy table structure alone in Bigquery

In Google's Big query, is there a way to clone (copy the structure alone) a table without data?
bq cp doesn't seem to have an option to copy structure without data.
And Create table as Select (CTAS) with filter such as "1=2" does create the table without data. But, it doesn't copy the partitioning/clustering properties.
BigQuery now supports CREATE TABLE LIKE explicitly for this purpose.
See documentation linked below:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_like
You can use DDL and limit 0, but you need to express partitioning and clustering in the query as well
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY DATE(timestamp)
CLUSTER BY
customer_id
AS SELECT * FROM mydataset.myothertable LIMIT 0
If you want to clone structure of table along with partitioning/clustering properties w/o having need in knowing what exactly those partitioning/clustering properties - follow below steps:
Step 1: just copy your_table to new table - let's say your_table_copy. This will obviously copy whole table including all properties (including such like descriptions, partition's expiration etc. - which is very simple to miss if you will try to set them manually) and data. Note: copy is cost free operation
Step 2: To get rid of data in newly created table - run below query statement
SELECT * FROM `project.dataset.your_table_copy` LIMIT 0
while running above make sure you set project.dataset.your_table_copy as destination table with 'Overwrite Table' as 'Write Preference'. Note: this is also cost free step (because of LIMIT 0)
You can easily do both above steps from within Web UI or Command Line or API or any client of your choice - whatever you are most comfortable with
This is possible with the BQ CLI.
First download the schema of the existing table:
bq show --format=prettyjson project:dataset.table | jq '.schema.fields' > table.json
Then, create a new table with the provided schema and required partitioning:
bq mk \
--time_partitioning_type=DAY \
--time_partitioning_field date_field \
--require_partition_filter \
--table dataset.tablename \
table.json
See more info on bq mk options: https://cloud.google.com/bigquery/docs/tables
Install jq with: npm install node-jq
You can use BigQuery API to run a select, as you suggested, which will return an empty result and set the partition and cluster fields.
This is an example (Only partition but cluster works as well)
curl --request POST \
'https://www.googleapis.com/bigquery/v2/projects/myProject/jobs' \
--header 'Authorization: Bearer [YOUR_BEARER_TOKEN]' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{"configuration":{"query":{"query":"SELECT * FROM `Project.dataset.audit` WHERE 1 = 2","timePartitioning":{"type":"DAY"},"destinationTable":{"datasetId":"datasetId","projectId":"projectId","tableId":"test"},"useLegacySql":false}}}' \
--compressed
Result
Finally, I went with below python script to detect the schema/partitioning/clustering properties to re-create(clone) the clustered table without data. I hope we get an out of the box feature from bigquery to clone a table structure without the need for a script such as this.
import commands
import json
BQ_EXPORT_SCHEMA = "bq show --schema --format=prettyjson %project%:%dataset%.%table% > %path_to_schema%"
BQ_SHOW_TABLE_DEF="bq show --format=prettyjson %project%:%dataset%.%table%"
BQ_MK_TABLE = "bq mk --table --time_partitioning_type=%partition_type% %optional_time_partition_field% --clustering_fields %clustering_fields% %project%:%dataset%.%table% ./%cluster_json_file%"
def create_table_with_cluster(bq_project, bq_dataset, source_table, target_table):
cmd = BQ_EXPORT_SCHEMA.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)\
.replace('%path_to_schema%', source_table)
commands.getstatusoutput(cmd)
cmd = BQ_SHOW_TABLE_DEF.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)
(return_value, output) = commands.getstatusoutput(cmd)
bq_result = json.loads(output)
clustering_fields = bq_result["clustering"]["fields"]
time_partitioning = bq_result["timePartitioning"]
time_partitioning_type = time_partitioning["type"]
time_partitioning_field = ""
if "field" in time_partitioning:
time_partitioning_field = "--time_partitioning_field " + time_partitioning["field"]
clustering_fields_list = ",".join(str(x) for x in clustering_fields)
cmd = BQ_MK_TABLE.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', target_table)\
.replace('%cluster_json_file%', source_table)\
.replace('%clustering_fields%', clustering_fields_list)\
.replace('%partition_type%', time_partitioning_type)\
.replace('%optional_time_partition_field%', time_partitioning_field)
commands.getstatusoutput(cmd)
create_table_with_cluster('test_project', 'test_dataset', 'source_table', 'target_table')

sqoop import staging table issue

I am trying to import the data from teradata into HDFS location.
I have access to view for that database. So I created a staging table in another database. But when I try to run the code it says error
Error: Running Sqoop version: 1.4.6.2.6.5.0-292 18/12/23 21:49:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 18/12/23 21:49:41 ERROR tool.BaseSqoopTool: Error parsing arguments for import:staging-table, t_hit_data_01_staging, –clear-staging-table, --query, select * from table1 where cast(date1 as Date) <= date '2017-09-02' and $CONDITIONS, --target-dir, <>, --split-by, date1, -m, 25
I have given the staging table details in the code and ran it. but throws error.
(Error parsing arguments from import and as un-recognized arguments from staging table)
sqoop import \
--connect jdbc:teradata://<server_link>/Database=db01 \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username <UN> \
--password <PWD> \
–-staging-table db02.table1_staging –clear-staging-table \
--query "select * from table1 where cast(date1 as Date) <= date '2017-09-02' and \$CONDITIONS " \
--target-dir '<hdfs location>' \
--split-by date1 -m 25`
The data should be loaded into the HDFS location, using the staging table in another database in Teradata.Then later on changing the where clause it sqoop should create another file under the same folder in HDFS location. Example: part-0000, next file as part -0001 etc.,
I dont think there is a staging option available for import command.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Import data from sqoop to hive

sqoop import –connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” \
–username=retail_dba –password=cloudera –table export1 –hive-import \
–hive-table export_3 –create-hive-table –fields-terminated-by “|” \
–lines-terminated-by “\n” –null-string nvl –null-non-string -2 –outdir java_files
If I use the above command it gives an error that
either use split by or -m 1 for sequential import
when I used split-by it ignored null values and imported other into hive
Can you explain the reason?
Thanks
Varun
The NULL value issues you are getting are not related to split-by.
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
$ sqoop import ... --null-string '\\N' --null-non-string '\\N'

PostgreSQL - Automate schema and table creation - powershell

I am trying to automate the creation of schemas and some tables into that newly created schema. I am trying to write a script in powershell to help me achieve the same. I have been able to create the schema, however, I cannot create the tables into that schema.
I am passing the new schema to be created as a variable to powershell.
script so far (based off the solution from the following answer. StackOverFlow Solution):
$MySchema=$args[0]
$CreateSchema = 'CREATE SCHEMA \"'+$MySchema+'\"; set schema '''+$MySchema+''';'
write-host $CreateSchema
C:\PostgreSQL\9.3\bin\psql.exe -h $DBSERVER -U $DBUSER -d $DBName -w -c $CreateSchema
# To create tables
C:\PostgreSQL\9.3\bin\psql.exe -h $DBSERVER -U $DBUSER -d $DBName -w -f 'E:\automation\scripts\create-tables.sql' -v schema=$MySchema
At the execution, I see the following error:
psql:E:/automation/scripts/create-tables.sql:11: ERROR: no schema has been selected to create in
The content of create-tables.sql is:
SET search_path TO :schema;
CREATE TABLE testing (
id SERIAL,
QueryDate varchar(255) NULL
);
You've got this in your first step:
$CreateSchema = 'CREATE SCHEMA \"'+$MySchema+'\"; set schema '''+$MySchema+''';'
Take out that set schema - it's erroneous and causing the schema not to be created. Then on the next step you wind up with an empty search path (because the schema never got created), which is why you get that error.