COPY INTO snowflake table not loading data - No errors - amazon-s3

As part of the Snowflake WebUI Essentials course, I'm trying to load data from 'WEIGHT.TXT' on AWS S3 bucket into a Snowflake DB table.
select * from weight_ingest
> Result: 0 rows
list #S3TESTBKT/W
> Result:1
> s3://my-s3-tstbkt/WEIGHT.txt 509814 6e66e0c954a0dfe2c5d9638004a98912 Tue, 17 Dec 2019 14:52:52 GMT
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/W
FILES = 'WEIGHT.TXT'
FILE_FORMAT = (FORMAT_NAME=USDA_FILE_FORMAT)
> Result: Copy executed with 0 files processed.
Can someone please help me resolve this? Thanks in advance.
Further Information:
S3 Object URL: https://my-s3-tstbkt.s3.amazonaws.com/WEIGHT.txt (I'm able to open the file contents in a browser)
Path to file: s3://my-s3-tstbkt/WEIGHT.txt
File Format Definition:
ALTER FILE FORMAT "USDA_NUTRIENT_STDREF"."PUBLIC".USDA_FILE_FORMAT
SET COMPRESSION = 'AUTO'
FIELD_DELIMITER = '^'
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
Stage Definition:
ALTER STAGE "USDA_NUTRIENT_STDREF"."PUBLIC"."S3TESTBKT"
SET URL = 's3://my-s3-tstbkt';
```

I believe issue is with your copy command. Try following steps:
Execute list command to get list of files:
List #S3TESTBKT
if your source file appear here just make sure folder name in your copy command.
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/
FILES = ('WEIGHT.txt')
FILE_FORMAT = (FORMAT_NAME = USDA_FILE_FORMAT);

Related

Flume not writing correctly in amazon s3 (weird characters)

My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"û­þ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??
Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream

read a table with specified file format

I have to read a file using this file format:
CREATE OR REPLACE FILE FORMAT FF_CSV
TYPE = CSV
COMPRESSION = GZIP
RECORD_DELIMITER = '\n'
FIELD_DELIMITER = 'µµµ'
FILE_EXTENSION = 'csv'
SKIP_HEADER = 0
SKIP_BLANK_LINES = TRUE
DATE_FORMAT = AUTO
TIME_FORMAT = AUTO
TIMESTAMP_FORMAT = AUTO
BINARY_FORMAT = UTF8
ESCAPE = NONE --may need to set to '<character>'
ESCAPE_UNENCLOSED_FIELD = NONE --may need to set to '<character>'
TRIM_SPACE = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"' --may need to set to '<character>'
NULL_IF = '' --( '<string>' [ , '<string>' ... ] )
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
REPLACE_INVALID_CHARACTERS = TRUE
VALIDATE_UTF8 = TRUE
EMPTY_FIELD_AS_NULL = TRUE
SKIP_BYTE_ORDER_MARK = FALSE
ENCODING = UTF8
;
For now I just want to test whether this definition would work correctly with my file or not. However, I am uncertain about how I can test for this. This is how I upload the file to my Snowflake stage:
put file://Users/myname/Desktop/leg.csv #:~
Now, how can I use FILE FORMAT FF_CSVin a select statement such that I can read my uploaded file using that format?
You would use a copy into statement to pull the data from the stage into a table.
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I would also recommend checking out some of these pages around Loading data into Snowflake.
https://docs.snowflake.com/en/user-guide-data-load.html

How to pick up file name dynamically while uploading the file in S3 with Python?

I am working on a requirement where I have to save logs of my ETL scripts to S3 location.
For this I am able to store the logs in my local system and now I need to upload them in S3.
For this I have written following code-
import logging
import datetime
import boto3
from boto3.s3.transfer import S3Transfer
from etl import CONFIG
FORMAT = '%(asctime)s [%(levelname)s] %(filename)s:%(lineno)s %
(funcName)s() : %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
S3_DOMAIN = 'https://s3-ap-southeast-1.amazonaws.com'
S3_BUCKET = CONFIG['S3_BUCKET']
filepath = ''
folder_name = 'etl_log'
filename = ''
def log_file_conf(merchant_name, table_name):
log_filename = datetime.datetime.now().strftime('%Y-%m-%dT%H-%M-%S') +
'_' + table_name + '.log'
fh = logging.FileHandler("E:/test/etl_log/" + merchant_name + "/"
+ log_filename)
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter(FORMAT, DATETIME_FORMAT))
logger.addHandler(fh)
client = boto3.client('s3',
aws_access_key_id=CONFIG['S3_KEY'],
aws_secret_access_key=CONFIG['S3_SECRET'])
transfer = S3Transfer(client)
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/"+filename)
Issue I am facing here is that logs are generated for different merchants hence their names are based on the merchant and this I have taken cared while saving on local.
But for uploading in S3 I don't know how to select log file name.
Can anyone please help me to achieve my goal?
s3 is an object store, it doesn't have "real path", the so call path e.g. "/" separator is actually cosmetic. So nothing prevent you from using something similar to your local file naming convention. e.g.
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/" + merchant_name + "/" + filename)
To list all the file under the arbitrary path (it is called "prefix") , you just do this
# simple list object, not handling pagination. max 1000 objects listed
client.list_objects(
Bucket = S3_BUCKET,
Prefix = folder_name + "/" + merchant_name
)

Unable to extract JSON fields in Splunk

I'm trying to extract JSON fields from syslog inputs.
In ./etc/system/default/props.conf I've added the following lines:
[mylogtype]
SEDCMD-StripHeader = s/^[^{]+//
INDEXED_EXTRACTIONS = json
KV_MODE = none
pulldown_type = true
The SEDCMD works; the syslogs headers are removed.
But the JSON fields are not parsed.
Any ideas?
Resolved. Use the following configuration in props.conf
[yourlogtype]
SEDCMD-StripHeader = s/^[^{]+//
KV_MODE = json
pulldown_type = true

Custom Delimiters in CSV Data while exporting data from BigQuery to GCS bucket?

Background:
I have a GA Premium account. I currently have the following process setup:
The raw data from the GA account flows into BigQuery.
Query the Bigquery tables.
Export query results to a GCS bucket. I export it in a CSV and gzipped format.
Export the CSV gzipped data from the GCS bucket to my Hadoop cluster HDFS.
Generate hive tables from the data on the cluster by using the comma as the field delimiter.
I run Steps 1-3 programmatically using the BigQuery REST API.
Problem:
My data contains embedded commas and newlines within quotes in some of the fields. When I generate my hive tables, the embedded commas and newlines are causing shifts in my field values for a row or are causing nulls in the records in the hive table.
I want to clean the data by either removing these embedded commas and newlines or by replacing them with custom delimiters within the quotes.
However, the catch is that I would like to do this data cleaning at Step 3 - while exporting to GCS. I looked into the possible query parameters that I can use achieving this but did not find any. The possible parameters that you can use to populate the configuration.extract object are listed at: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract
Here is snippet of the code that does the data export from Bigquery tables to GCS bucket.
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
Thanks in advance.
EDIT: looks like I misunderstood your question. You wanted to modify the values to not include commas and new lines. I thought your issue was only commas and that the fix would be to just not use commas as deliminators.
To be clear, there is no way to make the modification while exporting. You will need to run another query to produce a new table.
Example:
SELECT x, y, z,
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(bad_data, '%', '%45'),
'\n', '%20'
)
',', '%54'
) FROM ds.tbl
This will encode the bad_data field in a query string compatible format. Remember to run this query with large results enabled if necessary.
A java.net.URLDecoder or something similar should be able to decode if you don't want to do it by hand later.
You can set the field delimiter of the export object.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.fieldDelimiter
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'fieldDelimiter': '~',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
With Python:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.ExtractJobConfig()
job_config.field_delimiter = ';'
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result()