Unable to Load Tab-delimited Txt File from S3 into Redshift - amazon-s3

At SQLWorkbenchJ, I am trying to load a text file that is 'tab' delimited from Amazon S3 into Redshift by using this command:
COPY table_property
FROM 's3://...txt’
CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…’
IGNOREHEADER 1
DELIMITER '\t';
But it returns the following warning:
Warnings:
Load into table 'table_property' completed, 0 record(s) loaded successfully.
I have checked various Stackoverflow sources and Tutorial: Loading Data from Amazon S3 but neither of the solutions works.
My data from the text file looks like this:
BLDGSQFT DESCRIPTION LANDVAL STRUCVAL LOTAREA OWNER_PERCENTAGE
12440 Apartment 15 Units or more 2013005 1342004 1716 100
20247 Apartment 15 Units or more 8649930 5766620 7796.25 100
101
1635 Live/Work Condominium 977685 651790 0 100
Does anyone have the solution to this?

Check the table STL_LOAD_ERRORS and STL_LOADERROR_DETAIL for the precise error message.

The message you are talking about is not an "Error". Your table will have all records. It just say there was no addition records added.

Try putting DELIMITER '\\t' instead of DELIMITER '\t'. That worked in many of my cases working with Redshift from Java, PHP, and Python. Or sometimes even more '\' signs. It's tied to how IDEs/languages interpret string queries that are supposed to be executed.
For example, this is my code from Airflow DAG, something I'm working on right now (doesn't matter if you are not familiar with Airflow, it's basically Python code.
redshift_load_task = PostgresOperator(task_id='s3_to_redshift',
sql=" \
COPY " + table_name + " \
FROM '{{ params.source }}' \
ACCESS_KEY_ID '{{ params.access_key}}' \
SECRET_ACCESS_KEY '{{ params.secret_key }}' \
REGION 'us-west-2' \
ACCEPTINVCHARS \
IGNOREHEADER 1 \
FILLRECORD \
DELIMITER '\\t' \
BLANKSASNULL \
EMPTYASNULL \
MAXERROR 100 \
DATEFORMAT 'YYYY-MM-DD' \
",
postgres_conn_id="de_redshift",
database="my_database",
params={
'source': 's3://' + s3_bucket_name + '/' + s3_bucket_key + '/' + filename,
'access_key': s3.get_credentials().access_key,
'secret_key': s3.get_credentials().secret_key,
},
)
Notice how I defined the delimiter DELIMITER '\\t' instead of DELIMITER '\t'.
Another example is part of Hive query, executed via Java code on Spark:
...
AND (ip_address RLIKE \"^\\\\d+\\\\.\\\\d+\\\\.\\\\d+\\\\.\\\\d+$\")"
...
Notice here how there are 4 backslashes in order to escape d in the regex, instead of only writing \d. Hope it helps.

Related

transfer files from S3 bucket to BigQuery every minute using runtime parameter

i'd like to transfer data from an S3 bucket to BQ every minute using the runtime parameter to define which folder to take the data from but i get : Missing argument for parameter runtime.
the parameter is defined under the --params with "data_path"
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_tranfer \
--target_dataset=$ds \
--schedule=None \
--params='{"destination_table_name_template":$ds,
"data_path":"s3://bucket/test/${runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
Apparently you have to add the run_time in the destination_table_name_template
so the cmd line works like this:
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_transfer \
--target_dataset=demo \
--schedule=None \
--params='{"destination_table_name_template":"demo_${run_time|\"%Y%m%d%H\"}",
"data_path":"s3://bucket/test/{runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
the runtime has to be the same as the partition_id. above the partition is hourly. the records in the files have to belong to the that partition_id or the jobs will fail. to see your partition ids use:
SELECT table_name, partition_id, total_rows
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
but, important to mention. it's not a good idea to rely on this service for an every minute ingestion into BigQuery since your jobs get queued and can take several minutes. the service seems to be designed to run only once every 24H.

How to remove the double quotes and skip last 3 lines in bcp loading into SQL Server?

I am loading data into SQL Server using bcp command line utility.
The problem is data is coming is like below. Every field in double quotes and I have to skip last three rows because it has some trailers. How can i solve this?
Thanks in Advance. I appreciate if you could help in this.
bcp database.schema.tablename in Filename.text -T -c -t"|" -r"0x0a" -F 3 -m 2
UHDR 20211110
"DATE","CUSIP","ISIN","SEDOL","TICKER","DESCRIPTION","QUANTITY","RATE","COMMENT","MARKET","FEE"
11/10/2021|""|"CA45826T3010"|"BMVXZT5"|"ITR"|"INTEGRA RESOURCES CORP REGISTERED SHS"|"28712"|"0.0000"|"HTB"|"CA"|"11.5000"
If you want to:
unconditionally remove all " chars.
unconditionally skip the last 3 lines:
(Get-Content -ReadCount 0 Filename.text) -replace '"' |
Select-Object -SkipLast 3 |
Set-Content Filename_CleanedUp.text
Note: -ReadCount 0 is a performance optimization that makes Get-Content read all lines into a single array instead of streaming the lines one by one.
Then pass Filename_CleanedUp.text to your bcp command.

Copy table structure alone in Bigquery

In Google's Big query, is there a way to clone (copy the structure alone) a table without data?
bq cp doesn't seem to have an option to copy structure without data.
And Create table as Select (CTAS) with filter such as "1=2" does create the table without data. But, it doesn't copy the partitioning/clustering properties.
BigQuery now supports CREATE TABLE LIKE explicitly for this purpose.
See documentation linked below:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_like
You can use DDL and limit 0, but you need to express partitioning and clustering in the query as well
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY DATE(timestamp)
CLUSTER BY
customer_id
AS SELECT * FROM mydataset.myothertable LIMIT 0
If you want to clone structure of table along with partitioning/clustering properties w/o having need in knowing what exactly those partitioning/clustering properties - follow below steps:
Step 1: just copy your_table to new table - let's say your_table_copy. This will obviously copy whole table including all properties (including such like descriptions, partition's expiration etc. - which is very simple to miss if you will try to set them manually) and data. Note: copy is cost free operation
Step 2: To get rid of data in newly created table - run below query statement
SELECT * FROM `project.dataset.your_table_copy` LIMIT 0
while running above make sure you set project.dataset.your_table_copy as destination table with 'Overwrite Table' as 'Write Preference'. Note: this is also cost free step (because of LIMIT 0)
You can easily do both above steps from within Web UI or Command Line or API or any client of your choice - whatever you are most comfortable with
This is possible with the BQ CLI.
First download the schema of the existing table:
bq show --format=prettyjson project:dataset.table | jq '.schema.fields' > table.json
Then, create a new table with the provided schema and required partitioning:
bq mk \
--time_partitioning_type=DAY \
--time_partitioning_field date_field \
--require_partition_filter \
--table dataset.tablename \
table.json
See more info on bq mk options: https://cloud.google.com/bigquery/docs/tables
Install jq with: npm install node-jq
You can use BigQuery API to run a select, as you suggested, which will return an empty result and set the partition and cluster fields.
This is an example (Only partition but cluster works as well)
curl --request POST \
'https://www.googleapis.com/bigquery/v2/projects/myProject/jobs' \
--header 'Authorization: Bearer [YOUR_BEARER_TOKEN]' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{"configuration":{"query":{"query":"SELECT * FROM `Project.dataset.audit` WHERE 1 = 2","timePartitioning":{"type":"DAY"},"destinationTable":{"datasetId":"datasetId","projectId":"projectId","tableId":"test"},"useLegacySql":false}}}' \
--compressed
Result
Finally, I went with below python script to detect the schema/partitioning/clustering properties to re-create(clone) the clustered table without data. I hope we get an out of the box feature from bigquery to clone a table structure without the need for a script such as this.
import commands
import json
BQ_EXPORT_SCHEMA = "bq show --schema --format=prettyjson %project%:%dataset%.%table% > %path_to_schema%"
BQ_SHOW_TABLE_DEF="bq show --format=prettyjson %project%:%dataset%.%table%"
BQ_MK_TABLE = "bq mk --table --time_partitioning_type=%partition_type% %optional_time_partition_field% --clustering_fields %clustering_fields% %project%:%dataset%.%table% ./%cluster_json_file%"
def create_table_with_cluster(bq_project, bq_dataset, source_table, target_table):
cmd = BQ_EXPORT_SCHEMA.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)\
.replace('%path_to_schema%', source_table)
commands.getstatusoutput(cmd)
cmd = BQ_SHOW_TABLE_DEF.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)
(return_value, output) = commands.getstatusoutput(cmd)
bq_result = json.loads(output)
clustering_fields = bq_result["clustering"]["fields"]
time_partitioning = bq_result["timePartitioning"]
time_partitioning_type = time_partitioning["type"]
time_partitioning_field = ""
if "field" in time_partitioning:
time_partitioning_field = "--time_partitioning_field " + time_partitioning["field"]
clustering_fields_list = ",".join(str(x) for x in clustering_fields)
cmd = BQ_MK_TABLE.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', target_table)\
.replace('%cluster_json_file%', source_table)\
.replace('%clustering_fields%', clustering_fields_list)\
.replace('%partition_type%', time_partitioning_type)\
.replace('%optional_time_partition_field%', time_partitioning_field)
commands.getstatusoutput(cmd)
create_table_with_cluster('test_project', 'test_dataset', 'source_table', 'target_table')

Snakemake: How do I use a function that takes in a wildcard and returns a value?

I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids.
I have this function which does that in my Snakemake file:
def identify_read_groups(cram_file):
import subprocess
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.split('\n')[:-1]
return(read_groups)
I have this rule all:
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
And this rule to actually do the split:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
import subprocess
read_groups = open(input.readGroupIDs).readlines()
read_groups = [str(rg.replace('\n','')) for rg in read_groups]
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
I get this error when doing a dry run:
[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups
{sample} isn't being passed to the function.
How do I solve this problem? I'm open to other approaches if I'm not doing this in a 'snakemake-ic' way.
==============
EDIT 1
Ok, the first set of examples I gave had many many issues.
Here's a better (?) set of code, which I hope demonstrates my issue.
import sys
from os.path import join
shell.prefix("set -eo pipefail; ")
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
RG_dict[i] = identify_read_groups(i)
rule all:
input:
expand('{sample}.boo.txt', sample=list(RG_dict.keys()))
rule split_cram_by_rg:
input:
file='cram/{sample}.bam.cram',
RG = lambda wildcards: RG_dict[wildcards.sample]
output:
expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
shell:
'samtools view -b -r {input.RG} {input.file} > {output}'
rule merge_RG_bams_into_one_bam:
input:
rules.split_cram_by_rg.output
output:
'{sample}.boo.txt'
message:
'echo {input}'
shell:
'samtools merge {input} > {output}' #not working
"""
==============
EDIT 2
Getting MUCH closer, but currently struggling with expand properly building the lane bam files and keeping the wildcards
I'm using this loop to create the intermediate file names:
for sample in SAMPLES:
for rg_id in list(return_ID(sample)):
out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))
return_ID is a function which takes the sample wildcard and returns a list of the read groups the sample contains
If I use out_rg_bam as an input for a merge rule, then ALL of the files get combined into a merged bam, instead of being split by sample.
If I use expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample)) then rg_id gets applied to each sample. So if I have two samples (a,b) , with read groups (0,1) and (0,1,2), I end up with a0, a1, a0, a1, a2 and b0, b1, b0, b1, b2.
I'm going to give a more general answer to help others that might find this thread. Snakemake only applies wildcards to strings in the 'input' and 'output' sections when the strings are directly listed, e.g.:
input:
'{sample}.bam'
If you are trying to use functions like you were here:
input:
read_groups=identify_read_groups('cram/{sample}.bam.cram')
The wildcard replacement will not be done. You can use a lambda function and do the replacement yourself:
input:
read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))
try this:
I use id = 0, 1, 2, 3 to name the output bam file depending on how many readgroup for a bam file.
## this is a regular function which takes the cram file, and get the read-group to
## construct your rule all
## you actually just need the number of #RG, below can be simplified
def get_read_groups(sample):
import subprocess
cram_file = 'cram/' + sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
RG_dict[sample] = get_read_groups(sample)
outbam = []
for sample in SAMPLES:
read_groups = RG_dict[sample]
for i in range(len(read_groups)):
outbam.append("{}.RG{}.bam".format(sample, id))
rule all:
input:
outbam
## this is the input function, only uses wildcards as argument
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards.sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups[wildcards.id])
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups
output:
'cram/RG_bams/{sample}.RG{id}.bam'
run:
import subprocess
read_groups = input.read_groups
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
when use snakemake, think the way bottom up. First define what you want to generate in the rule all, and then construct the rule to create your final all.
Your all rule cannot have wildcards. It's a no wildcard-zone.
EDIT 1
I typed this pseudo-code in Notepad++, its not meant to compile, just trying to provide a framework. I think this is more what you are after.
Use a function inside of an expand to generate a list of file names which will then be used to driver the Snakemake pipeline's all rule. The baseSuffix and basePrefix variables are just to give you an idea as to String passing, arguments are permitted here. When passing back the list of strings, you will have to unpack them to ensure Snakemake reads the result properly.
def getSampleFileList(String basePrefix, String baseSuffix){
myFileList = []
ListOfSamples = *The wildcard glob call*
for sample in ListOfSamples:
command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
for rg in command:
myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}
basePreix = "cram/RG_bams/"
baseSuffix = ".bam"
rule all:
input:
unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))
rule processing_rg_files:
input:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
output:
'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
run:
"Let's pretend this is useful code"
END OF EDIT
If it wasn't in the all rule, you'd use inline functions
So I'm not sure what you're trying to accomplish. As per my guesses, read below for some notes about your code.
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
The dry run is failing when it calls the function "identify_read_groups" inside the rule all call. It's being passed into your function call as a string, not a wildcard.
Technically, if the samtools call wasn't failing, and the function call "identify_read_groups(cram_file)" returned a list of 5 strings, it would expand to something like this:
rule all:
input:
'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'
But the term "{sample}", at this stage in Snakemake's pre-processing, is considered a string. As you needed to denote wildcards in an expand function with {{}}.
See how I address every Snakemake variable I declare for my rule all input call and don't use wildcards:
expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())
In this case read_table returns 2-dimensional array to form. Snakemake is well supported by python. I needed this for pairing of different annotations to different variant types.
Your rule all needs to be a string, or list of strings, as input. You cannot have wildcards in your "all" rule. These rule all input strings are what Snakemake uses to generate matches for OTHER wildcards. Build the entire filename in the function call and return it if you need to.
I think you should just turn it into something like this:
rule all:
input:
expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
Also consider updating this to be more generic.:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
It can have two or more wildcards (why we love Snakemake). You can access the wildcards later in the python "run" directive via the wildcards object, since it looks like you'll want to in your for each loop.
I think input and output wildcards have to match, so maybe do try it this way as well.
rule split_cram_by_rg:
input:
'cram/{sample}.bam.cram'
output:
expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
...
params:
rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
run:
command = 'Just an example ' + + str(params.rg)
Again, not super sure what you're trying to do, I'm not sure I like the idea of the function call twice, but hey, it would run ;P Also notice the use of a wildcard "sample" in the input directive within a string {} and in the output directive within an expand {{}}.
An example of accessing wildcards in your run directive
Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything. I use an external configfile here.

Sqoop Export with Missing Data

I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a bunch of different arguments with the Sqoop command, but cannot get it to work. Here is what I was running thus far:
sqoop export --connect jdbc:postgresql://localhost:5432/XX -username
XX -password XX --table XX --input-fields-terminated-by
"\t" --input-lines-terminated-by "\n" --input-null-string '\n' --input-null
non-string '\n' -m 1 --export-dir /user/dan/output
I have also tried it without the "--input-null-string" and "--input-null-non-string" args and got the same result. My table has 6 columns and the file I am reading has tab separated values that are inserted into the table if all 6 are there. Any help would be appreciated.
I solved the problem by changing my reduce function so that if there were not the correct amount of fields to output a certain value and then I was able to use the --input-null-non-string with that value and it worked.