"Validate Schema" In Source and Sink - azure-data-factory-2

I am using the following dataset for the purposes of testing Mapping Data flows: https://github.com/fivethirtyeight/data/tree/master/avengers.
My data flow is very simple; a Source (delimited file on AzureDataLakeStorageGen2) moving to a sink (parquet file on AzureDataLakeStorageGen2). I have used the Import Projection capability to get my schema and set the value "Validate Schema". I get a projection that assumes what appear to the be the correct datatypes. It is worth noting that my dataset pointing to this source specifies all the columns as strings so there is clearly a disconnect between the import schema at dataset and that of the source Data flow transformation.
See this image.
Dataflow Source Projection
If I attempt to use the "Validate Schema" options I run into problems. The first issue I get is I get an error on any data type that is not a string. When using "Data preview" or run the data flow in a pipeline for example Error: at Source 'AvengersHeader': Column 'Appearances has incompatible types( Found: StringType, Required: ShortType). I had assumed this was due to the header row however excluding it using Skip line count with a value of 1 give me a new error e.g. Error: at Source 'AvengersHeader': Missing column 'URL.
The Source documentation states
Validate schema: If Validate schema is selected, the data flow will fail to run if the incoming source data doesn't match the defined schema of the dataset.
I had assumed that this would look at datatypes but now wonder if this is simply the columns supplied.
For the sink documentation I it states
Validate schema: If validate schema is selected, the data flow will fail if any column of the incoming source schema isn't found in the source projection, or if the data types don't match. Use this setting to enforce that the source data meets the contract of your defined projection. It's useful in database source scenarios to signal that column names or types have changed.
I note a couple of things:
This specifies changes in the source
It specifies that source type don't match however this does not appear to do that.
My question then is can anyone point to specific documentation that details the expected behavior and usage of this? Either official MS documentation or blogs I'm not fussy. If anyone wants to throw their 2 cents/pennies in that's fine too.
Looking at this further I have explored the derived column follow by a conditional split but this is potentially quite time consuming:
`iif(
isNull(toInteger(Appearances))==true()
|| isNull(toBoolean(case(or({Current?}=='', isNull({Current?})) == true(), 'NO',{Current?})) )== true()
|| isNull(toShort(Year)) == true()
|| isNull( toShort({Years since joining}))== true()
|| isNull(toBoolean(case(or(Death1=='', isNull(Death1)) == true(), 'NO',Death1)) )== true()
|| isNull(toBoolean(case(or(Return1=='', isNull(Return1)) == true(), 'NO',Return1)) )== true()
|| isNull(toBoolean(case(or(Death2=='', isNull(Death2)) == true(), 'NO',Death2)) )== true()
|| isNull(toBoolean(case(or(Return2=='', isNull(Return2)) == true(), 'NO',Return2)) )== true()
|| isNull(toBoolean(case(or(Death3=='', isNull(Death3)) == true(), 'NO',Death3)) )== true()
|| isNull(toBoolean(case(or(Return3=='', isNull(Return3)) == true(), 'NO',Return3)) )== true()
|| isNull(toBoolean(case(or(Death4=='', isNull(Death4)) == true(), 'NO',Death4)) )== true()
|| isNull(toBoolean(case(or(Return4=='', isNull(Return4)) == true(), 'NO',Return4)) )== true()
|| isNull(toBoolean(case(or(Death5=='', isNull(Death5)) == true(), 'NO',Death5)) )== true()
|| isNull(toBoolean(case(or(Return5=='', isNull(Return5)) == true(), 'NO',Return5)) )== true()
,1,0
)`
The Conditional Split transformation looks for rows which comply and writes them, writing malformed rows elsewhere.
While this works I'm not sure I would do that instead a dataframe using _corrupt_record feels like it would solve my problem much more easily posing the question why would I attempt to validate in Mapping Data Flows?

This error Error: at Source 'AvengersHeader': Missing column 'URL is due to you skip 1 line so that it can't find column named URL. You need to check 'First row as header' option in connection of dataset instead of skipping 1 line.
'Validate schema' option in the source is comparing Projecting with your schema of your dataset. If column and its type isn't same, data flow will fail.
So in your situation, I suggest you don't check 'Validate schema' option and then can work fine.

Related

Karate netty/mock schema match not working [duplicate]

I have a request that is hitting my mock server... the request is in json, but one of the values is a string of about 2,000+ characters.. I am wanting to match the request if the string value (of 2,000+ characters) contains a specific substring value...
for example:
Scenario:
pathMatches('/callService') &&
methodIs('post') && request.clientDescription contains 'blue eyes'
(request.clientDescription = string of 2,000+ characters)
It seems that it does not like the key word contains and I can't seem to find any information on the syntax I would use to search through a given string in a request and see if it contains a specific value.
I understand that I could try to match the entire string value using '==', but I am looking for a way to only match if it contains a substring.
Here's a tip, whatever you see on the right of Scenario: is pure JavaScript, and methodIs() etc. happen to be pre-defined for your convenience.
So this should work, using String.includes()
Scenario: request.clientDescription.includes('blue eyes')
Also please refer this answer for other ideas: https://stackoverflow.com/a/57463815/143475
And one more: https://stackoverflow.com/a/63708918/143475
It did not seem to like when I added "&& request.clientDescription.includes('blue eyes')" in the Scenario, but it did lead me in the right direction, and I did find a solution. thanks!
Error : after adding String.includes to Scenario:
com.intuit.karate - scenario match evaluation failed: evaluation (js) failed: pathMatches('/callService') &&
methodIs('post') && request.clientDescription.includes('blue eyes'), javax.script.ScriptException: TypeError: request.clientDescription.includes is not a function in at line number 2
stack trace: jdk.nashorn.api.scripting.NashornScriptEngine.throwAsScriptException(NashornScriptEngine.java:470)
A Solution:
defined a function in the background using karate.match
Code Example of Solution:
Background:
* def isBlueEyed = function(){return karate.match("request.clientDescription contains 'Blue Eyes'").pass}
Scenario:
pathMatches('/callService') &&
methodIs('post') && isBlueEyed()
* def response = read('./***/***/**')

How correctly pass arguments to the SQL request via the sqlx package in Golang?

In my Golang (1.15) application I use sqlx package to work with the PostgreSQL database (PostgreSQL 12.5).
When I try to execute SQL statement with arguments PostgreSQL database it raises an error:
ERROR: could not determine data type of parameter $1 (SQLSTATE 42P18):
PgError null
According to the official documentation, this error means that an INDETERMINATE DATATYPE was passed.
The organizationId has value. It's not null/nil or empty. Also, its data type is a simple built-in data type *string.
Code snippet with Query method:
rows, err := cr.db.Query(`
select
channels.channel_id::text,
channels.channel_name::text
from
channels
left join organizations on
channels.organization_id = organizations.organization_id
where
organizations.tree_organization_id like concat( '%', '\', $1, '%' );`, *organizationId)
if err != nil {
fmt.Println(err)
}
I also tried to use NamedQuery but it also raise error:
ERROR: syntax error at or near ":" (SQLSTATE 42601): PgError null
Code snippet with NamedQuery method:
args := map[string]interface{}{"organization_id": *organizationId}
rows, err := cr.db.NamedQuery(`
select
channels.channel_id::text,
channels.channel_name::text
from
channels
left join organizations on
channels.organization_id = organizations.organization_id
where
organizations.tree_organization_id like concat( '%', '\', :organization_id, '%' );`, args)
if err != nil {
fmt.Println(err)
}
In all likelihood, the arguments is not passed correctly to my request. Can someone explain how to fix this strange behavior?
P.S. I must say right away that I do not want to form an sql query through concatenation, or through the fmt.Sprintf method. It's not safe.
Well, I found the solution of this problem.
I found the discussion in github repository of the sqlx package.
In the first option, we can make concatenation of our search string outside of the query. This should still be safe from injection attacks.
The second choice to try this: concat( '%', '\', $1::text, '%' ). As Bjarni Ragnarsson said in the comment, PostgreSQL cannot deduce the type of $1.

Rally custom list query not working on string custom field

I have a custom field being added on user story (HierarchicalRequirement) level.
The WSAPI documentation shows the following details for the field:
c_CustomFieldName
Required false
Type string
Max Length 32,768
Sortable true
Explicit Fetch false
Query Expression Operators contains, !contains, =, !=
When trying to create a report using Custom List to identify user stories where this field is empty, I add (c_CustomFieldName = "") to the query.
And yet, the result shows rows where this field is not empty.
How can that be?
I tried querying on null, but it didn't work.
thx in advance
What you're doing should work- are you getting errors, or just incorrect data? It almost seems like it's ignoring your query altogether.
I tried to repro both with the custom list app and against wsapi directly and the following all worked as expected:
(c_CustomText = "") //empty
(c_CustomText = null) //empty
(c_CustomText != "") //non-empty
(c_CustomText != null) //non-empty
It's possible you're running into some weird data-specific edge case in your data. It may be worth following up with support.

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.

SQL Injection: is this secure?

I have this site with the following parameters:
http://www.example.com.com/pagination.php?page=4&order=comment_time&sc=desc
I use the values of each of the parameters as a value in a SQL query.
I am trying to test my application and ultimately hack my own application for learning purposes.
I'm trying to inject this statement:
http://www.example.com.com/pagination.php?page=4&order=comment_time&sc=desc' or 1=1 --
But It fails, and MySQL says this:
Warning: mysql_fetch_assoc() expects parameter 1 to be resource,
boolean given in /home/dir/public_html/pagination.php on line 132
Is my application completely free from SQL injection, or is it still possible?
EDIT: Is it possible for me to find a valid sql injection statement to input into one of the parameters of the URL?
The application secured from sql injection never produces invalid queries.
So obviously you still have some issues.
Well-written application for any input produces valid and expected output.
That's completely vulnerable, and the fact that you can cause a syntax error proves it.
There is no function to escape column names or order by directions. Those functions do not exist because it is bad style to expose the DB logic directly in the URL, because it makes the URLs dependent on changes to your database logic.
I'd suggest something like an array mapping the "order" parameter values to column names:
$order_cols = array(
'time' => 'comment_time',
'popular' => 'comment_score',
... and so on ...
);
if (!isset($order_cols[$_GET['order'])) {
$_GET['order'] = 'time';
}
$order = $order_cols[$_GET['order']];
Restrict "sc" manually:
if ($_GET['sc'] == 'asc' || $_GET['sc'] == 'desc') {
$order .= ' ' . $_GET['sc'];
} else {
$order .= ' desc';
}
Then you're guaranteed safe to append that to the query, and the URL is not tied to the DB implementation.
I'm not 100% certain, but I'd say it still seems vulnerable to me -- the fact that it's accepting the single-quote (') as a delimiter and then generating an error off the subsequent injected code says to me that it's passing things it shouldn't on to MySQL.
Any data that could possibly be taken from somewhere other than your application itself should go through mysql_real_escape_string() first. This way the whole ' or 1=1 part gets passed as a value to MySQL... unless you're passing "sc" straight through for the sort order, such as
$sql = "SELECT * FROM foo WHERE page='{$_REQUEST['page']}' ORDER BY data {$_REQUEST['sc']}";
... which you also shouldn't be doing. Try something along these lines:
$page = mysql_real_escape_string($_REQUEST['page']);
if ($_REQUEST['sc'] == "desc")
$sortorder = "DESC";
else
$sortorder = "ASC";
$sql = "SELECT * FROM foo WHERE page='{$page}' ORDER BY data {$sortorder}";
I still couldn't say it's TOTALLY injection-proof, but it's definitely more robust.
I am assuming that your generated query does something like
select <some number of fields>
from <some table>
where sc=desc
order by comment_time
Now, if I were to attack the order by statement instead of the WHERE, I might be able to get some results... Imagine I added the following
comment_time; select top 5 * from sysobjects
the query being returned to your front end would be the top 5 rows from sysobjects, rather than the query you try to generated (depending a lot on the front end)...
It really depends on how PHP validates those arguments. If MySQL is giving you a warning, it means that a hacker already passes through your first line of defence, which is your PHP script.
Use if(!preg_match('/^regex_pattern$/', $your_input)) to filter all your inputs before passing them to MySQL.