How to extracting large amounts of data (more than 100 MB) from Snowflake into CSV - syntax-error

I am trying to export large amounts of data from snowflake into a CSV. I saw a similar question and the solution given was to “Run the query as part of a COPY INTO {location} command to an internal stage, and then use a GET command to pull it down locally.”
I tried following the guide and ran the following but receives the error, “SQL compilation error: syntax error line 4 at position 3 unexpected 'file_format'.”
I am not sure how to fix this or even if the first part of my syntax is correct. Can someone please help.
copy into #my_stage/result/data_ from (select *
from"IRIS"."PRODUCTION"."VW_ALL_IIS_LHJ"
where (RECIP_ADDRESS_COUNTY = 06065 or ADMIN_ADDRESS_COUNTY = 06065)
file_format=(TYPE='CSV');
[ HEADER = TRUE]
get #%my_stage/result/data.csv/;

I'm pretty sure the issue is that you're missing a closing parenthesis. Try:
copy into #my_stage/result/data_ from (select *
from"IRIS"."PRODUCTION"."VW_ALL_IIS_LHJ"
where (RECIP_ADDRESS_COUNTY = 06065 or ADMIN_ADDRESS_COUNTY = 06065))
file_format=(TYPE='CSV');
[ HEADER = TRUE]
get #%my_stage/result/data.csv/;
Sorry - I don't have a way to test this.

You are missing a parentheses after the where clause. You opened a parentheses after the first FROM and then another one at the WHERE clause, but you only closed the WHERE parentheses.
Also, AFAIK, you don't need to call a get if the stage was properly set. The copy into command will place it in your stage, you then retrieve it from that stage but you can do this by the normal way of accessing the stage you specified. So if you sent it to a s3 bucket, you'd just access the resource from S3 as if it were any other file.
Lastly, remember there are many useful parameters you can indicate in the FILE_FORMAT, such as Record_delimiter, compression and how to handle nulls.
And remove the last semicolon after csv, that's going to cause another error because HEADER is not a valid instruction on its own.
Also you don't have to put HEADER = TRUE between brackets. Brackets in documentation mean it's an optional parameter.

Related

Row yielded no matching during look up

I have a job that runs a package, the job used to run well with no issue but today one of the package failed, and it’s Csv source -> lookup -> oledb destination. Error is Row yielded no matching look up.
I have tried to
Ltrim Rtrim,
Change the mataddat to false, but still didn’t work. Any ideas please? Thank you.
The error is self explanatory. Your lookup does not find a matching value and it fails because that is what you instructed it to do
You can change it to ignore failures (similarly to what a left join would work)
You can redirect it to an error output and then deal with it on the error output flow.
You can redirect it to a no match output which will gives you the option to create a new flow from this component for the non matching entries.

Issues pulling change log using python

I am trying to query and pull changelog details using python.
The below code returns the list of issues in the project.
issued = jira.search_issues('project= proj_a', maxResults=5)
for issue in issued:
print(issue)
I am trying to pass values obtained in the issue above
issues = jira.issue(issue,expand='changelog')
changelog = issues.changelog
projects = jira.project(project)
I get the below error on trying the above:
JIRAError: JiraError HTTP 404 url: https://abc.atlassian.net/rest/api/2/issue/issue?expand=changelog
text: Issue does not exist or you do not have permission to see it.
Could anyone advise as to where am I going wrong or what permissions do I need.
Please note, if I pass a specific issue_id in the above code it works just fine but I am trying to pass a list of issue_id
You can already receive all the changelog data in the search_issues() method so you don't have to get the changelog by iterating over each issue and making another API call for each issue. Check out the code below for examples on how to work with the changelog.
issues = jira.search_issues('project= proj_a', maxResults=5, expand='changelog')
for issue in issues:
print(f"Changes from issue: {issue.key} {issue.fields.summary}")
print(f"Number of Changelog entries found: {issue.changelog.total}") # number of changelog entries (careful, each entry can have multiple field changes)
for history in issue.changelog.histories:
print(f"Author: {history.author}") # person who did the change
print(f"Timestamp: {history.created}") # when did the change happen?
print("\nListing all items that changed:")
for item in history.items:
print(f"Field name: {item.field}") # field to which the change happened
print(f"Changed to: {item.toString}") # new value, item.to might be better in some cases depending on your needs.
print(f"Changed from: {item.fromString}") # old value, item.from might be better in some cases depending on your needs.
print()
print()
Just to explain what you did wrong before when iterating over each issue: you have to use the issue.key, not the issue-resource itself. When you simply pass the issue, it won't be handled correctly as a parameter in jira.issue(). Instead, pass issue.key:
for issue in issues:
print(issue.key)
myIssue = jira.issue(issue.key, expand='changelog')

How to use insert_job

I want to run a Bigquery SQL query using insert method.
I ran the following code just like so:
JobConfigurationQuery = Google::Apis::BigqueryV2::JobConfigurationQuery
bq = Google::Apis::BigqueryV2::BigqueryService.new
scopes = [Google::Apis::BigqueryV2::AUTH_BIGQUERY]
bq.authorization = Google::Auth.get_application_default(scopes)
bq.authorization.fetch_access_token!
query_config = {query: "select colA from [dataset.table]"}
qr = JobConfigurationQuery.new(configuration:{query: query_config})
bq.insert_job(projectId, qr)
and I got an error as below:
Caught error invalid: Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0:
Please let me know how to use the insert_job method.
I'm not sure what client library you're using, but insert_job probably takes a JobConfiguration. You should create one of those and set the query parameter to equal your JobConfigurationQuery you've created.
This is necessary because you can insert various jobs (load, copy, extract) with different types of configurations to this one API method, and they all take a single configuration type with a subfield that specifies which type and details about the job to insert.
More info from BigQuery's documentation:
jobs.insert documentation
job resource: note the "configuration" field and its "query" subfield

ssis Package validation error ole db source failed

I am getting the following error when I try and run my package. I am new to ssis. Any suggestions. Tahnks
===================================
Package Validation Error (Package Validation Error)
===================================
Error at Data Flow Task [SSIS.Pipeline]: "OLE DB Source" failed validation and returned validation status "VS_NEEDSNEWMETADATA".
Error at Data Flow Task [SSIS.Pipeline]: One or more component failed validation.
Error at Data Flow Task: There were errors during task validation.
(Microsoft.DataTransformationServices.VsIntegration)
Program Location:
at Microsoft.DataTransformationServices.Project.DataTransformationsPackageDebugger.ValidateAndRunDebugger(Int32 flags, IOutputWindow outputWindow, DataTransformationsProjectConfigurationOptions options)
at Microsoft.DataTransformationServices.Project.DataTransformationsProjectDebugger.LaunchDtsPackage(Int32 launchOptions, ProjectItem startupProjItem, DataTransformationsProjectConfigurationOptions options)
at Microsoft.DataTransformationServices.Project.DataTransformationsProjectDebugger.LaunchActivePackage(Int32 launchOptions)
at Microsoft.DataTransformationServices.Project.DataTransformationsProjectDebugger.LaunchDtsPackage(Int32 launchOptions, DataTransformationsProjectConfigurationOptions options)
at Microsoft.DataTransformationServices.Project.DataTransformationsProjectDebugger.Launch(Int32 launchOptions, DataTransformationsProjectConfigurationOptions options)
VS_NEEDSNEWMETADATA shows up when the underlying data behind one of the tasks changes. The fastest solution will probably be to just delete and re-create each element which is throwing an error.
How about disabling validation checks?
Like if you right click on source or destination component and select properties then you will have the property named validateExternalMetadata put that as false and try.
This Solution is working for me.
This normally occurs if there has been a change to your schema, not to stress, just double click on your input and output and it should resolve itself
Make sure your connection is valid. If you are using dynamic connections, then try to set the option "delay validation" = true on the package or dataflow.
In my case destination table structure was not matching with matadata in OLEDB component. I added the missing column which i forgot to add and after that it was fixed.
After researching a bit (check to extract your own conclusions: this and this one), I think I've found a nice workaround for when the problem with the metadata comes from a Ole DB object, but only for a very specific case.
The thing is that when you change your columns names / remove columns / add columns, you can't do anything but update the metadata.
However, if you use a SQL query to retrieve the data from the object, in the case that you don't need to update the query itself, you won't need to update the metadata if the query still can ask for what it wants. Basically, if the query is still valid.
I've tried it within my own ETL, and changed an Ole DB object which was reading the data from an Excel file, targeting one sheet and then I had all the columns selected in the tab.
Changing it for an SQL query to retrieve the full sheet like:
SELECT * FROM ['Sheet_Name$']
Solved completely the case for me, even introducing files with different metadata in headers.

Making sure data is loaded

I use the following command to load data.
/home/bigquery/bq load --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...
It has happened that when I got non-zero return code the data was still loaded. How do I make sure that the command is successful or not? Checking return code does not seem to help. There are times when I loaded the same file again because I got an error but the data was already available in bigquery.
You can use bq show -j of the load job and check job status.
If you are writing code to do the load, so you don't know the job id, you can pass the job id into the load operation (as long as it is unique) so you will know which job to check.
For instance you can run
/home/bigquery/bq load --job_id=some_unique_job_id --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...'
then
/home/bigquery/bq show --j some_unique_job_id
Note if you are creating new tables for every load (as opposed to appending), you could use the write disposition WRITE_EMPTY to make sure you only did the load if the table was empty, thus preventing adding the same data twice. This isn't directly supported in bq.py, but you could use the underlying bigquery_client.py to make this call, or use the REST api directly.