Datastage Column Mapping logic Based on column Value (Or any SQL function to do the same) - sql

I'm trying to find a solution in datastage (Or in SQL) - without having to use a bunch of if/else conditions - where I can map value of one column based on value of another column.
Example -
Source File -
ID
Header1
Value1
Header2
Value2
1
Length
10
Height
15
2
Weight
200
Length
20
Target Output -
ID
Length
Height
Weight
1
10
15
2
20
200
I can do this using Index/Match function of excel. Was wondering if datastage or Snowflake can look into all these fields similarly and automatically populate the value column to the corresponding header column!

I think the best solution in DataStage would be a Pivot stage followed by a Transformer stage to strip out the hard-coded source column names.

Related

How to transpose columns when they encode multiple "records"?

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc
I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

How to (1) condense into one row after certain number of rows; (2) How to assign field names

Using Pentaho PDI 8.3.
After REST calls with quite complex data structures, I was able to extract data with a row for each data element in a REST result/ E.g:
DataCenterClusterAbstract
1
UK1
Datacenter (auto generated)
Company
29
0
39
15
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
2
UK1_Murex
Datacenter (auto generated)
Company
0
0
0
0
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
3
UK1_UNIX
Notice that there are 8 data elements that are spread out into separate rows. I would like to condense these 8 data elements into one row each iteration in Pentaho. Is this possible? And assign field names?
Row flattener
Condense 8 data element in columns into one row. Each of these 8 data elements are repeating.
(1) Add row flattener
(2) Assign field names for the rows coming in - so you have 10 data attributes in rows specify a field name for each row.
(3) In table output use space as seperator

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]

Mark accumulated values on a QlikView column if condition is fulfilled

I have a table in Qlikview with 2 columns:
A B
a 10
b 45
c 30
d 15
Based on this table, I have a formula with full acumulation defined as:
SUM(a)/SUM(TOTAL a)
As a result,
A B D
b 45 45/100=0.45
c 30 75/100=0.75
d 15 90/100=0.90
a 10 100/100=1
My question is. how do I mark in colour the values in column A that have on column D <=0.8)?
The challenge is that D is defined with full accumulation, but if I reference D in a formula, it doesn't consider the full accumulation!
I tried with defining a formula E=if(D>0.8,'Y','N') but this formula doesn't take the visible (accumulated) value for D unfortunately, instead it takes the D with no accumulation. If this worked, I would have tried to hide (not disable) E and reference it from the dimensions column of the table , Text colour option. Any ideas please?? Thanks
You can't get an expression column's value from within a dimension or it's properties, because the expression columns rely on the dimensions provided. It would create an endless loop. Your options are:
Apply your background colour to the expression columns, not the dimensions. This would actually make more sense as the accumulated values would have the colour, not the dimension.
When loading this specific table, have QlikView create a new column that contains the accumulated values of B. This would mean, however, that the order of your chart-table would need to be fixed for the accumulations to make any sense.
Use aggregation to create a temporary table and accumulate the values using RangeSum(). Note this will only accumulate properly if the table is ordered in Ascending order of Column A
=IF(Aggr(RangeSum(Above(Sum(B),0,10)),A)/100>0.8,
rgb(0,0,0),
rgb(255,0,0)
)

Excel: one column has duplicates of each value, I need to take averages of the corresponding two values from the other columns

Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.