How BCP's -F and -L option works? - bcp

I am trying to understand how BCP's -F (first row) and -L (Last Row) option works i.e How does its pointer goes to specific locations and selecting only defined rows without doing any row number calculation like Row_number(). I am trying to BCP out very large table into smaller files by running multiple thread of BCP with different first row and last row but it appears to be performing very slow. So i am trying to understand why.

Related

How to loop through a table and print all row in differents files?

I'm working on a table which possess in one of it's column the configuraztion of a cisco router in each of it's row. I want to print each of those configurations in a different file.
ex: the conf in row 1 will be printed in a file named conf1.conf, the row 2 wil be conf2.conf, etc....
I'm trying to do this with a bash script. How should I do it? I can't really do it manually, it's a script that must be repeated every single day.

Refresh rows in results window

I have some rows that all contain the same flag value from a column and they don't have anything else in common. When I run my program, the flag is going to be updated so that re-running the query will no longer find the same rows.
Is there a way to run a query, and then later just refresh the rows as long as they are still in the results window?
Not really. You can cut and paste what is in the results window and create inserts out of that data. Unlike xbase solutions when you commit a delete your data is deleted. You have to have an external copy or backup to restore from to put the data back.

how to remove duplicate row from output table using Pentaho DI?

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

Pentaho - Having multiple Copy rows to result results in Get rows from result empty

I'm trying to process some data and store it in a datawarehouse. For doing it, I wanted to store dimensions in one transformation and fact (only have one) in another transformation. So I can use a job for execute the first one, copy rows to result and get them into the second transformation.
In the first transformation, I read some Excel file and separate this data into some streams. It is data from a baptism, so I have one stream for the person, another one for parents, another one for sponsors, and so on... At the end of each stream, I insert data into database and return PK autogenerated (it is an id autoincrement).
In the second one, I only have Get rows from result and want to set them into a txt file (just for see it is been done correctly). The problem is that the file is created but it is empty. I suppose that if I let fields in Get rows from result empty, it gets all fields.
What am I doing wrong?
At the end what I want is to have one Copy rows to result at the end of each stream in the first transformation and get all this data in the second one.
In "Insert Pare Padrina" I return id_pare_padrina which is autogenerated, and the same with "Insert Mare Padrina" (I have more streams which I also have to include them into result). This transformation is not executed per row because I need values of other rows.
Thank you!
In order to pass the data from the first transformation to the second transformation, you need to set certain parameters like:
1. First of all, in the transformation settings of the second transformation (at the Job Level), check on the items as image below:
Copy Previous results to parameters will ensure that all the results/data in the "Copy Rows to Result" step is getting properly passed to the next level.
Execute for every input row : will execute the second transformation for every rows in the first transformation file. This is optional based on your requirement.
2. In the same transformation settings, define the "Parameters" in the Parameters tabs. Check the image below:
Here, NAME is the parameter i have defined. So when you are using the "Get rows from result", you can define these parameter names.
3. Instead of using "Get rows from result", you can alternately use "Get Variables" step to fetch all the variables coming from the previous step. All you need to do is to define the parameter names inside the ktr file (CTRL + T). (Actually i have practically implemented in that fashion and it worked for me.)
4. Since "copy rows to result" step uses heap memory, defining multiple instances of this step might exhaust the memory space quickly and your code might fall in trouble. Ideally use a single instance of this step.
But if your data interation is only one row, best option would be to use "set variables" step.
I assume you might have missed some of these sections in the job.
You can read more on copy rows to result in here.
Hope it helps :)

How to import data to a table with 14 columns via BCP if my data file has less than 13 delimiters?

Sorry for the extremely awkward wording in that question. I'll explain.
I have a table with 14 columns in which i'm trying to import data to via BCP. My data comes from a text file. This text file is TAB delimited. Logically there should be 13 delimiters for 14 cells of data in a row. My data is inconsistent and doesn't have delimiters if the values at the end are null. This means that some rows of data only have 10 delimiters. This causes my data to "wrap around" when it is imported. The first cell of data in my text file is being put in the 10th column of the row prior to it. It should be the first cell in its own new row.
The thing is every single row in the text file ends in with "CRLF" which is used by default in BCP.
Is there a way to tell BCP to fill in all 14 columns before moving on to the next row? Or will i have to re format my data file every time i import (not ideal).
Here is my BCP command:
bcp testdb.dbo.MACARP in C:\Users\sysbrady\Desktop\MyData.txt /c /T /t "\t" /E -S WSTVDISTD023\SQLEXPRESS
"Is there a way to tell BCP to fill in all 14 columns before moving on to the next row?"
When you say "fill in", do you mean you want BCP to keep the null values present in your text file? The -k qualifier tells BCP to keep the nulls (make sure the column in your table allows nulls). See link below:
http://msdn.microsoft.com/en-us/library/ms187887.aspx
"The thing is every single row in the text file ends in with "CRLF" which is used by default in BCP."
This is unclear - could you post an image? Unsure of whether you have phrased this as a problem, or a feature you want to retain.