I currently have a transformation setup with 2 table inputs and one Merge Rows (Diff), the SQL select statement in both table inputs are constant, they are not changing except for the table name. So I have:
select * from THIS_WILL_CHANGE
I have around 100 tables and I don't want to manually enter the table names every iteration, especially because this is automation...
What is the best way to achieve this? Is there any way to read like a CSV file with all the table names and loop that way? Any help is appreciated..
This is something I've had to do before too!
You can do this with a variable and a job which executes once for each row of the previous step.
Create a parent job to host these steps
Create a transformation which gets the table names from 'somewhere' eg. CSV file, or database query - a select on all_tables for tables with the same column names might be a nice way to do this for all time...
In this same transformation, use copy rows to result step to push the data back to the job
Create a new 'sub job', which executes once for each row, and has a hop from the 'get data' step in the main job
In the sub job, create two transformations, one to set the variable from the results field, and one to do your select
In your select query, check the box 'substitute variables' and place your variable with the same name as your set variables step into your SQL as ${yourVariableHere}
I've put this in an image below, which hopefully helps you.
The dynamic sql row step is a good option for this, provided the tables all have the same layout/metadata.
Related
I am fairly new to Pentaho, and while working on it, I have stumbled across a problem. Below is how my flow is:
Read input from a file. Let's say this has 5 columns.
Make some modifications to existing columns. (Filter, modify and all).
Add a new column, which will be equal to an SQL function of the current row data. Example, it can be sum(id, id+1)
Dump to the database.
Step 1,2, 4 are already in place and are working fine. It's Step 3 where I am stuck. I've tried to execute SQL, but that is only for Modifying DDL and doesn't return data. Table input needs data to be in a table already, which isn't the case with me.
I have a workaround, that I can enter all rows in DB, and then fire an update query, but I was hoping if there is a better way to do this.
You can add formula step and in the formula column, you can specify what you want to achieve. For example, your other column+1 and save it in a new field or also replace the existing field value
I'm still relatively new to SQL and Pentaho.
I've pulled a table with two different IDs and need to run a query for each specific instance.
For example,
SELECT *
FROM Table
WHERE RecordA = 'value in column A'
AND RecordB = 'value in column B'
I need the results back, either appended to new columns in the original table or part of their own text file output.
I was initially looking at using a formula for this inside of Pentaho, but couldn't quite figure it out. Since I have the query written I threw it into Excel and got the concatenated results (so a string of 350 or so queries that I need to run). I'm just not sure how to accomplish this - I tried the Execute SQL Script inside of Pentaho but it doesn't seem to do output?
Any direction would be useful. I've searched a little but have come up short so far, possibly because I am still pretty new to this platform.
You can accomplish this behavior in a lot of ways, with a "Database Lookup" step for example, but I usually do that in a quite easy way and here is a example for your tests, I hope it helps.
The idea here is to have two Table input steps, the first one will fetch the IDs we want to look at. For example you may use a SQL query similar to note on the left. The result will be a 1 column stream of rows.
Next we have a Table Input that reads the rows received and executes it's query for each row. I'll add a screenshot with the options that I selected.
What it does is replace a placeholder '?' with the data that is received. If you need two columns use two '?' but remember that it will replace the first one with the first column and the second one with the second column
And you are good to go. Test it a couple of times and good luck.
And the config for the second table input.
In Sql Server ,
I don't want to generate whole table data script.I want the script of my own inserted 100 rows script . My table have huge record and it takes long if i generate script of whole table Also its not good to find my rows again from that script and pick.
Is there any way to put criteria where i can specify my range ?
something like where condition as we use in sql queries ?
Thanks.
You can use something like this.
It's called sp_generate_inserts. There are many examples in the header of the stored procedure. Take a look at example number 4.
I accidentally added a wrong column to my BigQuery table schema.
Instead of reloading the complete table (million of rows), I would like to know if the following is possible:
remove bad rows (rows with values contains the wrong column) by running a "select *" query on the table with some kind of filter, and saving result to same table.
removing the (now) unused column.
Is this functionality (or similar) supported?
Possibly the "save result to table" functionality can have a "compact schema" option.
The smallest time-saving way to remove a column from Big Query according to the documentation.
ALTER TABLE [table_name] DROP COLUMN IF EXISTS [column_name]
If your table does not consist of record/repeated type fields - your simple option is:
Select valid columns while filtering out bad records into new temp table
SELECT < list of original columns >
FROM YourTable
WHERE < filter to remove bad entries here >
Write above to temp table - YourTable_Temp
Make a backup copy of "broken" table - YourTable_Backup
Delete YourTable
Copy YourTable_Temp to YourTable
Check if all looks as expected and if so - get rid of temp and backup tables
Please note: the cost of above #1 is exactly the same as action in first bullet in your question. The rest of actions (copy) are free
In case if you have repeated/record fields - you still can execute above plan, but in #1 you will need to use some BigQuery User-Defined Functions to have proper schema in output
You can see below for examples - of course this will require some extra dev - but if you are in critical situation - this should work for you
Create a table with Record type column
create a table with a column type RECORD
I hope, at some point Google BigQuery Team will add better support for cases like yours when you need to manipulate and output repeated/record data, but for now this is a best workaround I found - at least for myself
Below is the code to do it. Lets say c is the column that you wants to delete.
CREATE OR REPLACE TABLE transactions.test_table AS
SELECT * EXCEPT (c) FROM transactions.test_table;
Or second method and my favorite is by following below steps.
Write Select query with the columns you want to exclude.
Go to Query Settings
Query Settings
In Destination setting Set destination table for query results, enter project name, Dataset name and table name exactly same as you entered in Step 1.
In Destination table write preference select Overwrite table.
Destination table settings
Save the Query Setting and run the query.
Save results to table is your way to go. Try on the big table with the selected columns you are interested, and you can apply a limit to make it small.
I need to set up a new company for automated data import. The utility has provided the data in a spreadsheet. (Image 1)
Based on this data, I need to create a stored procedure that will identify the correct meter, if it exists, and perform either an insert or update to the monthly data table. For automated utility data import, I want to make sure I restrict everything to a particular utility company.
The steps are the following ( I am having a hard time converting this to SQL)
1- I just want a script that identify the correct meter to see if it exists, basically check the Meter# column in the excel with the MeterNumber column in the Meters table.
2- The next step is perform either an insert or update to the MonthlyData table. This is a screen shot of all its columns.
3- Then I just want to make sure that I am restricting everything to the particular company which in this case Site1 since 2 different companies might have the same meter#. The UtilityCompany table contains 3 columns: ID, Name, UtilityType
I honestly do not know from where to get started, would anybody help me with the script? Thank you
You will want to:
perform a Bulk Insert operation to take your data from the excel file into a staging table.
write a query to select ALL rows for the corresponding utility company (notice I didn't see iterate over each row...). This select could be an update where you update an additional column to mark the row as an INSERT, or an UPDATE.
Then the last step (2 parts), retrieve all of the rows that were marked as INSERT, and insert those into your table. Then grab all rows that were marked with an UPDATE, and update their corresponding values based on your matching criteria.