How to use Pentaho Denormalizer Step with Metadata Injection - pentaho

I want to denormalize the below data.
Input
Input
Required output
col1 col2 col3 col4
aaa bbb ccc ddd
I think in Pentaho we can use Metadata Injection step with Denormalizer step to dynamically denormalize all row values to columns

Yes, its possible. One constraint is you need to provide your input data twice.
I have prepared a solution , you can see dynamic denormalize from Here

Related

Filter table columns and route to different table if it is null

I don't know much about SQL but still I would like to ask this forum.
My job is to handle records with null values. I mean we have natural keys (suppose 4 columns) where in if any of the column gets NULL values into it, then that should be routed to another table so that it can be reported to client.
AFAIK SQL gives only one output and cannot be split. Is there any way we can handle this in SQL/spark SQL? I need to execute this job using spark.
Process flow is :
first data is sqooped and kept in hive table
I need to take this data and check for null values.
store it in next level tables
Although you can't do it in one go, you can do it by the steps mentioned.
After the table is created in Hive, using PySpark you could do,
#Set all the imports and enable Hive support for the session
#Dataframe to hold rows where either of 4 columns is null
df=spark.sql("select * from tblName where col1 is null or col2 is null or col3 is null or col4 is null")
#Write the resulting dataframe to a Hive table
df.saveAsTable('tableName') #Use other arguments in saveAsTable as required

search for a large set of tokens across a large set of columns

I'm working in SQL Server 2008. I'm trying to return all records where the given columns have a substring that matches at least one token of a very large set of tokens. The number of columns I'm searching on is also quite large. What is the best way to do this?
I know that the basic approach is something like:
WHERE
(col1 LIKE '%token1%' OR col1 LIKE '%token2%' OR...
OR
col2 LIKE '%token1%' OR col1 LIKE '%token2%' OR...
OR
. . . .
)
However, this will be very tedious and large.
This is a bit long for a comment.
You basically have two alternatives. The first is full text search. That is, treat each column as a document and create a full text index on them.
The second option is to normalize your data structure. You would create a separate row for each token in each column. A row in this normalized structure would look like:
EntityId "Column" Token
1 col1 Toke1
1 col3 Toke2
2 col1 Toke2
. . .
This structure would greatly speed your search with the appropriate index.
By the way, your data structure looks suspicious. A table that contains lists of things in a column is usually a bad idea. The proper data structure for a list in relational databases is a table, not a column. A table with multiple columns that contain the same type of information (such as a list of tokens) usually suggests that the columns should be denormalized.

Joining sql column with comma delimited column

I have three tables that look like
table1
ID Title Description
1 Title1 Desc1
2 Title2 Desc2
table2
ID HistoryID
1 1001
2 1002
2 1003
2 1004
table3
HistoryID Value
1001 val1
1002 val2
1003 val3
1004 val4
Now I am planning to it using "only" two tables
table1
ID Title Description HistoryIDList
1 Title1 Desc1 1001
2 Title2 Desc2 1002,1003,1004
table3
HistoryID Value
1001 val1
1002 val2
1003 val3
1004 val4
I have created a sql table-value function that will return indexed values 1002,1003,1004 that could be joined with HistoryID from table3.
Since I am losing normalization, and do not have FK for HistoryIDList, my questions are
should there by significant performance issue running a query that would join HistoryIDList
would indexing sql function do the trick or not since there is no relation between two columns.
In that case is it possible to add FK on table created in sql function?
Why would you change a good data structure to a bogus data structure? The two table version is a bad choice.
Yes, there is a significant performance difference when joining from a single id to a list, compared to a simple equi-join. And, as bad as that normally is, the situation is even worse here because the type of the id is presumably an int in the original table and a character string in the other table.
There is no way to enforce foreign key relationships with your proposed structure without using triggers.
The only thing that you could possibly do to improve performance would be to have a full text index on the HistoryIdList column. This might speed the processing. Once again, this is complicated by the fact that you are using numeric ids.
My recommendation: just don't do it. Leave the table structure as it is, because that is the best structure for a relational database.

Inserting a row in multiple tables, and maintaining a relationship separately

I am a bit lost trying to insert my data in a specific scenario from an excel sheet into 4 tables, using SSIS.
Each row of my excel sheet needs to be split into 3 tables. The identity column value then needs to be inserted into a 4th mapping table to hold the relationship. How do I achieve this efficiently using SSIS 2008?
Note in the below example, its fixed that both col4 and 5 go into 3rd table.
Here is data example
Excel
col1 col2 col3 col4 col5
a b c d 3
a x c y 5
Table1
PK col
1 a
2 a
Table2
PK col1 col2
1 b c
2 x c
Table3
PK Col
1 d
2 3
3 y
4 5
Map_table
PK Table1_ID Table2_ID Table3_ID
1 1 1 1
2 1 1 2
2 2 2 3
2 2 2 4
I am fine even if just a SQL based approach is suggested, as I do not ave any mandate to use SSIS only. Additional challenge is that in table 2, if a same data row exists, I want to use that ID in the map table, instead of inserting duplicate rows!
Multicast is the component you are looking for. This component takes an input source and DUPLICATE it as many output. You can, in that scenario, have an Excel source and duplicate the flow to insert the data into your Table1, Table2 and Table3.
Now, the tricky part is getting back those identities into your Map_Table. Either you dont use IDENTITY and use some other means (like a GUID, or an incremental counter of your own that you would setup as a derived column before the multicast) or you use the ##IDENTITY to retrive the last inserted identity. Using ##IDENTITY sounds like a pain to me for your current scenario, but that's up to you. If the data is not that huge, I would go for a GUID.
##IDENTITY don't work well with BULK operations. It will retrieve only the last identity created. Also, keep in mind that I talked about ##IDENTITY, but you may want to use IDENT_CURRENT('TableName') instead to retrieve the last identity for a specific table. ##IDENTITY retrieve the last identity created within your session, whatever the scope. You can use SCOPE_IDENTITY() to retrive the last identity within your scope.

SQL question regarding the insertion of empty tuples to prepare for update statements

I am making a table that will be borrowing a couple of columns from another table, but not all of them. Right now my new table doesn't have anything in it. I want to add X number of empty tuples to my table so that I can then begin adding data from my old table to my new table via update statements.
Is that the best way to do it? What is the syntax for adding empty rows? Thanks
Instead of inserting nulls and then updating them, cant you just insert data from the other table directly. using something like this -
INSERT INTO Table1 (col1, col2, col3)
SELECT column1, column2, column3
FROM Table2
WHERE <some condition>
If you still want to insert empty records, then you will have to make sure that all your columns allow nulls. Then you can use something like this -
Table1
PrimaryKey_Col | col1 | col2 | col3
Insert INTO Table1 (PrimaryKey_col) values (<some value>)
This will make sure your a new row is inserted with a primary key and the rest of the columns are nulls. And these records can be updated later.
No, this is not even a good way, let alone the best way.
If you look at it conceptually adding empty rows serves no purpose.
In databases each row of a table corresponds to a true statement (fact). Adding a row with all NULLs (even if possible) records nothing and represents inconsistent data on its own. Especially having multiple empty records.
Also, if you are even able to add a row with all NULLs to a table that's an indication that you have no data integrity rules for the row so and that's mostly what databases are about - integrity rules and quality of data. A well designed database should not accept contradictory or meaningless data (and empty row is meaningless)
As Pavanred answered inserting real data is a single command, so there is no benefit in making it two or more commands.