Flattening array column and creating Index column for array elements at the same time -- Azure Data Factory - indexing

I have a dataset that in simple representation looks like:
Col1
Col2
1
[A,B]
2
[C]
I want to denormalize the data and create another column while flattening, which would be the index of the elements in the array. The desired result set would look like:
Col1
Col2
Col3
1
A
1
1
B
2
2
C
1
I was able to achieve the requirement using mapindex, keyvalues and mapassociation expression functions.
Somehow I feel like this is not the right way to do it and there must be a better and easier way to do it. I read the microsoft documentation and couldnt find it.
Can someone help/guide me to a better solution?
Edit 1:
Source is Azure Blob Storage. I have access to only ADF. Data is a complex XML document. All transformations are to be performed only with ADF.
Edit 2:
Target is SAP BW . But I don't have control on it. I can only write to it.

You can use flatten transformation to flatten the array values and Window transformation to get the RowNumber, partition by Col1.
Flatten transformation: Unroll by array column (Col2).
Window transformation: Connect the output of flatten to Windows transformation.
Set a partition column in the Over clause.
Set a sort column to sort the data ordering.
In window columns setting, you can define the aggregation rowNumber() to get the index value based on col1.
Output of Window transformation:

Related

How to use a google sheets pivot query to output strings

I have a (much larger) table like this sample:
I am trying to output a table that looks like this:
The closest I can get with a pivot query returns numerical results in the value fields, rather than the desired text strings
=query(Data, "Select D,count(D) group by D Pivot B")
I resorted to a series of formulas to build my row and column headers, and then fill in the data field - See Version 3 in the sample sheet. But I couldn't figure out how to fill in the data with a single formula - as opposed to copying and pasting in the data field, which is not desirable with a dynamic number of row and column headers based on the original data.
Is there a way to wrap my data field formula (in cell B44 of the sample) in an arrayformula that will fill the data field, with a dynamic number of columns and rows?
Or even more advanced is there a formula that will deliver my desired results table in a single formula?
This should work, it's a bit difficult to explain, but i could demonstrate the various parts if you opened up your sheet to editable...
=ARRAYFORMULA(TRANSPOSE(QUERY(TRIM(SPLIT(TRANSPOSE(QUERY(QUERY({CHAR(10)&A2:A11,B2:B11&"|"&D2:D11&"|"},"select MAX(Col1) group by Col1 pivot Col2"),,9^9)),"|",0,0)),"select Col1,MAX(Col3) where Col1<>'' group by Col1 pivot Col2 order by Col1 desc label Col1'Project'")))

pandas read sql query improvement

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64
assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]
So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974

SSIS Conditional Filter split with bit data type

Data - from table (58 columns), in this table 4 columns will use in conditional filter.
1 column - bit data type(Color_Ind)
1 column - bit data type(Type_ind)
2 columns - varchar data type(Region,State)
process steps
1) OLE DB connection to extract all data
2) use Data Conversion to convert bit type columns into string data type.
3) use conditional filter to filter the data into 5 different category.
Probelm:-
- If I use color_ind and Type_ind it will not create my desired output, but If I use it without these columns and only use Region and state it will create my desired output.
Could you please tell me how to use bit data type into condtional filter in SSIS?
Appreciate your help.
Thanks
I got the answare from http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/a0c4dbe0-bbe1-421f-b2fe-b72dce4d224e post...
Thanks for your efforts.

SQL Server 2008 Query Result Formating (changing x and y axis fields)

What is the most efficient way to format query results, wether in the actual SQL Server SQL code or using a different program such as access or excel so that the X (first row column headers) and Y Axis (first column field values) can be changed, but with the same query result data still being represented, just in a different way.
they way the data is stored in my database and they way my original query results are returned in SQL Server 2008 are as follows:
Original Results Format
And Below is the way I need to have the data look:
How I need the Results to Look
In essence, I need to have the zipcode field go down the Y Axis (first column) and the Coveragecode field to go across the top first Row (X Axis) with the exposures filling in the rest of the data.
The only way I can thing of getting this done is by bringing the data into excel and doing a bunch of V-LookUps. I tried using some pivot tables but didn't get to far. I'm going to continue trying to format the data using the V-LookUps but hopefully someone can come up with a better way of doing this.
What you are looking for is a table operator called PIVOT. Docs: http://msdn.microsoft.com/en-us/library/ms177410.aspx
WITH Src AS(
SELECT Coveragecode, Zipcode, EarnedExposers
FROM yourTable
)
SELECT Zipcode, [BI], [PD], [PIP], [UMBI], [COMP], [COLL]
FROM Src
PIVOT(MAX(EarnedExposers) FOR CoverageCode
IN(
[BI],
[PD],
[PIP],
[UMBI],
[COMP],
[COLL]
)
) AS P;