Split a json string column or flatten transformation in data flow (ADF) - azure-data-factory-2

I copy the following csv file to a data flow in ADF.
The column Data has json format, but it is considered string. I want to flatten Data column into individual rows. I tried the flatten transformation, it did not work as Data column is not json. How do I deal with it? I also tried split expression, and it did not work either. Thank you

Just from your screenshot, We can find that :
The data in Data are not JSON format.
Data most look like an Array.
The 'array' has 9 elements.
Me must consider it as the "Array" then we could using Data Flow Derived Column to flatten the Data. Please ref my steps bellow:
Source data:
Derived Column expressions and settings:
The expression to make data as string and using index to get the value:
Data 1: split(substring(Data, 2, length(Data)-2), ",")[1]
Data 2: split(substring(Data, 2, length(Data)-2), ",")[2]
Data 3: split(substring(Data, 2, length(Data)-2), ",")[3]
Data 4: split(substring(Data, 2, length(Data)-2), ",")[4]
Data 5: split(substring(Data, 2, length(Data)-2), ",")[5]
Data 6: split(substring(Data, 2, length(Data)-2), ",")[6]
Data 7: split(substring(Data, 2, length(Data)-2), ",")[7]
Data 8: split(substring(Data, 2, length(Data)-2), ",")[8]
Data 9: split(substring(Data, 2, length(Data)-2), ",")[9]
Derived Column output:
If the Data are standard JSON format, we need convert the string to JSON first, and then use the key to get the value.
HTH.

Related

Dot product between two Vector columns of a DataFrame

I do have this situation, and I'm stuck, and looking for guidance please (I do see a lot of the limitations when performing linear algebra operations on Spark and one is distributed scientific computing like scipy and numpy at scale, serialization and deserialization), did thought of the joining this 2 column and perform a combination of columns and took a look of this approach, but index of the vector in vector column is very important for me and I also did look at this udf for dot product for dataframe columns but is performing elements in to row and not all combinations from col1 with col2:
"Looking to solve 2 product between 2 SparseVectors columns, one
SparseVectors Column is from df1 and another is SparseVectors column
from df2 with preserving the index of each vector". As u already know this is sole for big data, millions and billions vectors and collecting and using simple numpy and scipy is not a solution for me at this moment and only after filtering to have a small data.
Here is a sample of my data, each vector length is the same but the column (amount of vectors for each df are different :
> df1:
col1
|(11128,[0,1,2,3,5...|
|(11128,[11,22,98,...|
|(11128,[51,90,218...|
> df2:
col1
|(11128,[21,23,24,...|
|(11128,[0,1,2,3,5...|
|(11128,[0,1,2,3,4...|
|(11128,[28,59,62,...|
...
Adding more info for vectors part, maybe modifying .withColumn() to use .map() function to do all vectors in parallel at once, since does have index? I do know is not the best
approach but is all I can think of it right now (this is not related
solve for .dot() product but more for the UDF/pandas_udf to
extend math operations at Vectors level:
I do bring all into rdd with index, is a way for me to modify the approach to make index as col name ?
[[0, SparseVector(11128, {0: 0.0485, 1: 0.0515, 2: 0.0535, 3: 0.0536, 5: 0.0558, 6: 0.0545, 7: 0.0558, 59: 0.1108, 62: 0.1114, 65: 0.1123, 68: 0.1126, 70: 0.113, 82: 0.121, 120: 0.1414, 149: 0.149, 189: 0.1685, 271: 0.1876, 275: 0.1891, 303: 0.1919, 478: 0.2193, 634: 0.2359, 646: 0.2383, 1017: 0.2626, 1667: 0.2943, 1821: 0.3006, 2069: 0.3095, 2313: 0.3191, 3104: 0.347})],
[1, SparseVector(11128, {11: 0.0621, 22: 0.0776, 98: 0.1167, 210: 0.155, 357: 0.1811, 360: 0.1818, 466: 0.1965, 475: 0.1962, 510: 0.2005, 532: 0.2033, 597: 0.2092, 732: 0.2178, 764: 0.2198, 1274: 0.2489, 1351: 0.2519, 1353: 0.2522, 1451: 0.2562, 1577: 0.2608, 2231: 0.2841, 2643: 0.2969, 3107: 0.3114})]]
So I did try approach with UDF but so far I can get with static vector
(I convert to rdd and take each vector individually but is not the
best approach for me, I want to do all at once and in parallel so map
and keep the index for each vector in place when doing it):
from pyspark.mllib.linalg import *
# write our UDF for .dot product
def dot_prod(a,b):
return a.dot(b)
# apply the UDF to the column
df = df.withColumn("dotProd", udf(dot_prod, FloatType())(col("col2"), array([lit(v) for v in static_array])))

Is there an easier way to grab a single value from within a Pandas DataFrame with multiindexed columns?

I have a Pandas DataFrame of ML experiment results (from MLFlow). I am trying to access the run_id of a single element in the 0th row and under the "tags" -> "run_id" multi-index in the columns.
The DataFrame is called experiment_results_df. I can access the element with the following command:
experiment_results_df.loc[0,(slice(None),'run_id')].values[0]
I thought I should be able to grab the value itself with a statement like the following:
experiment_results_df.at[0,('tags','run_id')]
# or...
experiment_results_df.loc[0,('tags','run_id')]
But either of those just results in the following rather confusing error (as I'm not setting anything):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It's working now, but I'd prefer to use a simpler syntax. And more than that, I want to understand why the other approach isn't working, and if I can modify it. I find multiindexes very frustrating to work with in Pandas compared to regular indexes, but the additional formatting is nice when I print the DF to the console, or display it in a CSV viewer as I currently have 41 columns (and growing).
I don't understand what is the problem:
df = pd.DataFrame({('T', 'A'): {0: 1, 1: 4},
('T', 'B'): {0: 2, 1: 5},
('T', 'C'): {0: 3, 1: 6}})
print(df)
# Output
T
A B C
0 1 2 3
1 4 5 6
How to extract 1:
>>> df.loc[0, ('T', 'A')]
1
>>> df.at[0, ('T', 'A')]
1
>>> df.loc[0, (slice(None), 'A')][0]
1

Databricks spark.read csv has row #torefresh

I am going to read csv to dataframe
1. I create the structure
2. load csv spark.read.option("header", "false").schema(schema).option('delimiter', ',').option('mode', 'PERMISSIVE').csv(path1) < path1 is an array having about 10000 csv
and I get the df like in the picture enter image description here
how to check which files/ which rows getting the #torefresh and null...…???
To know which files contains those rows you can use input_file_name function from pyspark.sql.functions
e.g.
df.where("col1 == '#torefresh'").withColumn("file", input_file_name()).show()
with that you can easily also get an aggregate with one row per file
df.where("col1 == '#torefresh'").withColumn("file", input_file_name()).groupBy("file").count().show()
+--------------------+-----+
| file|count|
+--------------------+-----+
|file:///C:/Users/...| 119|
|file:///C:/Users/...| 131|
|file:///C:/Users/...| 118|
|file:///C:/Users/...| 127|
|file:///C:/Users/...| 125|
|file:///C:/Users/...| 116|
+--------------------+-----+
I don't know of a good spark way to find row number in original file - the moment you load csv into a DataFrame this information is pretty much lost. There is a row_number function but it works over a window so the numbers will depend on the way you define window partitioning / sorting.
If you are working with a local file system, you could try to manually read the csv again and find row numbers, something like this:
import csv
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#udf(returnType=ArrayType(StringType()))
def getMatchingRows(filePath):
with open(filePath.replace("file:///", ""), 'r') as file:
reader = csv.reader(file)
matchingRows = [index for index, line in enumerate(reader) if line[0] == "#torefresh"]
return matchingRows
withRowNumbers = df.where("col1 == '#torefresh'")\
.withColumn("file", input_file_name())\
.groupBy("file")\
.count()\
.withColumn("rows", getMatchingRows("file"))
withRowNumbers.show()
+--------------------+-----+--------------------+
| file|count| rows|
+--------------------+-----+--------------------+
|file:///C:/Users/...| 119|[1, 2, 4, 5, 6, 1...|
|file:///C:/Users/...| 131|[1, 2, 3, 6, 7, 1...|
|file:///C:/Users/...| 118|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...| 127|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...| 125|[1, 2, 3, 5, 6, 7...|
|file:///C:/Users/...| 116|[1, 2, 3, 5, 7, 8...|
+--------------------+-----+--------------------+
But it's going to be pretty inefficient and if you expect to have those rows in many of the files it beats the point of using DataFrames. I'd suggest working on the source of data and enabling some sort of Ids on creation, but that's of course unless all you need to know is that file contains any.
If besides knowing that first value is "#torefresh" you also need all other values to be null you can extend the where filter and the manual check in udf.

pandas, good to have list as element or, just flatten it?

Suppose, I want to analyze item-purchase records
My analization function expects userid, item_ids
df analyze(user_id, item_ids):
..
Is it a good idea to prepare data in
user_id item_ids
1, [3,4,5]
vs
user_id, item_ids
1, 3
1, 4
1, 5
(with the 2nd one, I could do groupby and generate the data format I need)
I just find it hard to work with data format of ([1, [3,4,5]]) than ([1,3],[1,4],[1,5]) in intermediate steps..

Format query result as JSON in Athena

In Athena, I have some data in formatted as a rather complicated struct with various nested arrays and structs:
struct<a:struct<b:string, c:int, d:array< ... >>>
I want to format my query result as a JSON string:
{"a": {"b": "cow", "c": 5, "d": [ ... ]}}
However if I cast the data to JSON, CAST(x AS JSON), I just get the associated values:
[["cow", 5, [ ... ]]]
How do I get the desired format without constructing the JSON object by hand? The underlying data in the S3 bucket is JSON; is there a better way to format the table in Athena?