Flatten and Parsing Json using Azure Data Flow - azure-data-factory-2

I have table which has the following data
Code Info
AE [{"key":"eng","value":"ABC"},{"key":"fra","value":"DEF"}]
US [{"key":"eng","value":"XYZ"},{"language":"dut","value":"123"}]
UK [{"key":"arb","value":"KLM"}]
I want to transform it using Azure Data Flow as below
Code InfoKey InfoValue
AE eng ABC
AE fra DEF
US eng XYZ
US dut 123
UK arb KLM
I had tried using flatten transformation, parsing transformation and nothing was succeeded. when I use flatten transformation on Column 'Info', it gives the same output. and when i tried parsing json, its not able to transform the data at all.
Could someone help how we can transform this data.
Thanks

I reproduced the scenario and was able to parse data to multiple columns.
Step1:
• Add the output of the source to derived column transformation and split the JSON value row (Info) into multiple arrays.
• Here I am replacing few values before applying split to get the correct array representation.
split(replace(replace(replace(Info, '[', ''),']',''),'},','}},'),'},')
• Preview of the derived column: Info for code AE is split into array Info1 & Info2.
Step2:
• Connect the output of the derived column to Flatten transformation to flatten the arrays to multiple rows.
• Select Info[] under Unroll by and Unroll root to flatten Info array to multiple rows.
• Output of flatten transformation
Step3:
• Connect the flatten output to parse transformation to parse the array values to multiple columns.
• Select the column to parse in the expression and parsed column names with type in Output column type.
• Output of parse transformation: Data is parsed into 2 columns Key and value.
• Here there is a NULL value for Code US and value 123, this is because the source does not have a key column for code US in the 2nd array.
Step4:
Sink Output:

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

Clean output of a panda data extraction deleting unnamed index column

I have a dataset which have extracted a row under condition in column 'Description'. This is first few rows to show how data look like.
I have extracted a row whith the condition below:
ATL_ID=airport_codes[airport_codes['Description'].str.contains('Hartsfield-Jackson Atlanta ')]
It successfully finds the row. Now, I need to extract the value under 'Code' I use this code:
ATL_ID.loc[:,'Code']
and output is:
373 10397
Name: Code, dtype: int64
I dnt want anything else in the output except 10397. 373 is the row index and the rest is additional description which I dnt want. How I can get one number for the 'Code'?
Thanks

Why can't access df data in pandas?

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
Assuming you have the following data frame df imported from your csv:
Closing Date 2014/12 2015/12 2016/12 2017/12 2018/12
0 Net Sales 31,634 49,924 62,051 68,137 72,590
1 Net increase -17,909 -16,962 -34,714 -26,220 -29,721
2 Net Received - - - - -
3 Net Paid -328 -6,038 -9,499 -9,375 -10,661
then by doing df = df[["2018/12"]] you create a new data frame with one column and df.iloc[0,0] will work perfectly well here returning 72,590. I you wrote df = df["2018/12"] you'd create a new series and here df.iloc[0,0] will throw an error 'too many indexers', because it's a one-dimensional series.
Anyway, if you need the values of a series, use the values attribute (or to_numpy() for version 0.24 or later) to get the data as array or to_list() to get them as a list.
But I guess what you really want is to have your table transposed:
df = df.set_index('Closing Date').T
to the following more logical form:
Closing Date Net Sales Net increase Net Received Net Paid
2014/12 31,634 -17,909 - -328
2015/12 49,924 -16,962 - -6,038
2016/12 62,051 -34,714 - -9,499
2017/12 68,137 -26,220 - -9,375
2018/12 72,590 -29,721 - -10,661
Here, df.loc['2018/12','Net Sales'] gives you 72,590 etc.

Extract column metadata from rows

I'm trying to read a csv file where metadata is placed on row 4 to 7
Source,"EUROSTAT","INE","BDP"
Magnitude,,,
Unit ,"Percent","Percent","Unit"
Series,"Rate","PIB","Growth"
Where the relevant data starts on row 10. This CSV format will always be fixed, with this sort of data and this row disposition
Date,Series1,Series2,Series3
30-09-2014,13.1,1.1,5.55
30-06-2014,13.9,0.9,5.63
31-03-2014,15.1,1.0,5.57
31-12-2013,15.3,1.6,5.55
30-09-2013,15.5,-1.0,5.66
30-06-2013,16.4,-2.1,5.65
What I've done was skip the first rows and read the data from Row 10 afterwards, and define the column metadata myself. However, the rows don't have the Unit type nor the magnitude that was defined in the Metadata header. In SSIS I do an Unpivot of the columns to an Indicator column so I have the Date, Series, Value style of table.
My question here is how can I make a table of the format Date, Series, Value, Magnitude, Unit where the Series, Magnitude, Unit are read from the first 10 rows?

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help