Can the regenerate functionality of HCL OneTest Data generates the test data results in the same sequence? - schema

When a user regenerates the data of a schema, will the regenerated test data results in the same sequence of data as it was generated earlier or will the data be shuffled?

When you use the regenerate functionality of HCL OneTest Data, it is possible to get the same sequence of test data. In order to get the same sequence of test data while regeneration, you must enter some seed value while you generate the test data for the first time.
For more information, you can refer https://help.hcltechsw.com/onetest/hclonetestserver/10.1.2/com.hcl.test.otd.help.doc/topics/t_help_regenerate_data.html

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.
When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

Copy Data from Blob to SQL via Azure data factory

I have two sample files in blob as sample1.csv and sample2.csv as below
data sample
SQL table name sample2, with column Name,id,last name,amount
Created a ADF flow without schema, it results as below
preview data
source settings are allow schema drift checked.
sink setting are auto mapping turned on. allow insert checked. table action none.
I have also tried setting a define schema in dataset, its result are same.
any help here?
my expected outcome would be data in sample1 will inserted null into the column "last name"
If I understand correctly, you said: "my expected outcome would be data in sample1 will inserted null into the column last name", you only need to add a derived column to you sample1.csv file.
You could follow my steps:
I create a sample1.csv file in Blob Storage and a sample2 table in my SQL database:
Using DerivedColumn to create new column last name with null value:
expression: toString(null())
Sink settings:
Run the pipeline and check the data in table:
Hope this helps.
You cannot mix schemas in the same source in the same data flow execution.
Schema Drift will handle changes to the schema on an execution-per-execution basis.
But if you are reading multiple different schemas from a folder, you will get non-deterministic results.
Instead, if you loop through those files in a pipeline ForEach one-by-one, data flow will be able to handle the evolving schema.

How to make import.io retrieve data stored

Everytime I run the API in the android app, it runs the query itself and retrieve data from the website instead of the stored data, how do I make it retrieve the data stored to save running time?
This isn't something you can do via the UI just yet, but it is coming!
If you have saved the results of your Extractor as a dataset you can do this via API:
To query a dataset, you need to query its "snapshot"...
First use the GetConnector API with the ID of your dataset:
http://api.docs.import.io/#!/Connector_Methods/getConnector
Note the snapshot ID
Use the ID of the dataset and the snapshot ID from the result and enter them here:
http://api.docs.import.io/#!/Connector_Methods/getDataSnapshot
This will return the data stored in your dataset.

Newly inserted or updated row count in pentaho data integration

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy

pentaho spoon transformation to delete data from table

I'm trying to write a ETL job using pentaho data integration tool, in spoon. Used "delete" icon and provided the target tabl details but the rows are not getting deleted.and no error. I have access to the schema.Please suggest.
In order to use the "Delete" step first you need to have a data source where PDI will read the keys to look for in the table. So, your transformation should look like this:
In my example, the first step queries the origin table for a list of Ids to be deleted, and then passes them to the Delete step as keys to be used as condition for the delete instruction.