Altering or updating of nested data in DataFrame Spark - dataframe

I have a very weird requirement in spark wherein I have to transform the data present in a dataframe.
So I read data from s3 bucket and transform them into a dataframe. This is being done fine, the next step is where the challenge lies.
Once the data is being read the data which is Json data needs to be transformed so that all data are consistent.
Sample data which I have
{"name": "John", "age": 24, "object_data": {"tax_details":""}}
{"name": "nash", "age": 26, "object_data": {"tax_details": {"Tax": "None"} } }
The issue is that tax_details field is string in first document and second document is having an object. I want to ascertain that everytime I put it as object, if that can be done by dataframe operation then it will be great. Else any pointer to do it will be great.
Looking for any help

Related

Efficiency in using pandas and parquet

People talk a lot about using parquet and pandas. And I am trying hard to understand if we can utilize the entire features of parquet files when used with pandas. For instance say I have a big parquet file (partitioned on year) with 30 columns (including year, state, gender, last_name) and many rows. I want to load the parquet file and perform similar computation that follow
import pandas as pd
df = pd.read_parquet("file.parquet")
df_2002 = df[df.year == 2002]
df_2002.groupby(["state", "gender"])["last_name"].count()
Here in this query only 4 columns (out of 30) and only year 2002 partition is used. It means we just want to bring the columns and rows that are needed for this computation, and something like this is possible in parquet with predicate and projection pushdown (and why we using parquet).
But I am trying to understand how this query behaves in pandas. Does it bring everything into memory the moment we call df = pd.read_parquet("file.parquet) ? Or any lazy factor is getting applied here to bring in the projection & predicate pushdown? If this is not the case then what is the point in using pandas with parquet? Any of this is possible with the arrow package out there ?
Eventhough I haven't used dask just wondering if this kind of situation is handled in dask as they perform it lazily.
I am sure this kind of situation is handled well in the spark world, but just wondering how these situations are handled in local scenarios with packages like pandas, arrow,dask, ibis etc.
And I am trying hard to understand if we can utilize the entire features of parquet files when used with pandas.
TL;DR: Yes, but you may have to work harder than if you used something like Dask.
For instance say I have a big parquet file (partitioned on year)
This is pedantic but a single parquet file is not partitioned on anything. Parquet "datasets" (collections of files) are partitioned. For example:
my_dataset/year=2002/data.parquet
my_dataset/year=2003/data.parquet
Does it bring everything into memory the moment we call df = pd.read_parquet("file.parquet) ?
Yes. But...you can do better:
df = pd.read_parquet('/tmp/new_dataset', filters=[[('year','=', 2002)]], columns=['year', 'state', 'gender', 'last_name'])
The filters keyword will pass the filter down to pyarrow which will apply the filter in a pushdown fashion both to the partition (e.g. to know which directories need to be read) and to the row group statistics.
The columns keyword will pass the column selection down to pyarrow which will apply the selection to only read the specified columns from disk.
Any of this is possible with the arrow package out there ?
Everything in pandas' read_parquet file is being handled behind the scenes by pyarrow (unless you change to some other engine). Traditionally, the group_by would then be handled by directly by pandas (well, maybe numpy) but pyarrow has some experimental compute APIs as well if you wanted to try doing everything in pyarrow.
Eventhough I haven't used dask just wondering if this kind of situation is handled in dask as they perform it lazily.
In my understanding (I don't have a ton of experience with dask), when you say...
df_2002 = df[df.year == 2002]
df_2002.groupby(["state", "gender"])["last_name"].count()
...in a dask dataframe then dask will figure out that it can apply pushdown filters and predicates and it will do so when loading the data. So dask takes care of figuring out what filters you should apply and what columns you need to load. This saves you from having to figure it out yourself ahead of time.
Complete example (you can use strace to verify that it is only loading one of the two parquet files and only part of that file):
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import shutil
shutil.rmtree('/tmp/new_dataset')
tab = pa.Table.from_pydict({
"year": ["2002", "2002", "2002", "2002", "2002", "2002", "2003", "2003", "2003", "2003", "2003", "2003"],
"state": [ "HI", "HI", "HI", "HI", "CO", "CO", "HI", "HI", "CO", "CO", "CO", "CO"],
"gender": [ "M", "F", None, "F", "M", "F", None, "F", "M", "F", "M", "F"],
"last_name": ["Smi", "Will", "Stev", "Stan", "Smi", "Will", "Stev", "Stan", "Smi", "Will", "Stev", "Stan"],
"bonus": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
ds.write_dataset(tab, '/tmp/new_dataset', format='parquet', partitioning=['year'], partitioning_flavor='hive')
df = pd.read_parquet('/tmp/new_dataset', filters=[[('year','=', 2002)]], columns=['year', 'state', 'gender', 'last_name'])
df_2002 = df[df.year == 2002]
print(df.groupby(["state", "gender"])["last_name"].count())
Disclaimer: You are asking about a number of technologies here. I work pretty closely with the Apache Arrow project and thus my answer may be biased in that direction.

Stream Analytics doesn't produce output to SQL table when reference data is used

I am working with ASA lately and I am trying to insert ASA stream directly to the SQL table using reference data. I based my development on this MS article: https://msdn.microsoft.com/en-us/azure/stream-analytics/reference/reference-data-join-azure-stream-analytics.
Overview of data flow - telemetry:
I've many devices of different types (Heat Pumps, Batteries, Water Pumps, AirCon...). Each of these devices has different JSON schema for their telemetry data. I can distinguish JSONs by an attribute in message (e.g.: "DeviceType":"HeatPump" or "DeviceType":"AirCon"...)
All of these devices are sending their telemetry to a single Event Hub
Behind Event Hub, there is a single Stream Analytics component where I redirect streams to different outputs based on attribute Device Type. For example I redirect telemetry from HeatPumps with query SELECT * INTO output-sql-table FROM input-event-hub WHERE DeviceType = 'HeatPump'
I would like to use some reference data to "enrich" ASA stream with some IDKeys, before I inserted stream into SQL table.
What I've already done:
Successfully inserted ASA stream directly to SQL table using ASA query SELECT * INTO [sql-table] FROM Input WHERE DeviceType ='HeatPump', where [sql-table] has the same schema than JSON message + standard columns (EventProcessedUtcTime, PartitionID, EventEnqueueUtcTime)
Successfully inserted ASA stream directly to SQL table using ASA query SELECT Column1, Column2, Column3... INTO [sql-table] FROM Input WHERE DeviceType = 'HeatPump' - basically the same query as above, only this time I used named columns in select statement.
Generated JSON file of reference data and put it to the BLOB storage
Created new static (not using {date} and {time} placeholders) reference data input in ASA pointing to the file in BLOB storage.
Then I joined reference data to the data stream in ASA query using the same statement with named columns
Results no output rows in SQL table
When debugging the problem I've used Test functionality in Query ASA
I sample data from Event Hub - stream data.
I upload sample data from file - reference data.
After sampling data from Event Hub have finished, I tested a query -> output produced some rows -> it's not a problem in a query
Yet... if I run ASA, no output rows are inserted into SQL table.
Some other ideas I tried:
Used TRY_CAST function to cast fields from reference data to appropriate data types before I joined them with fields in stream data
Used TRY_CAST function to cast fields in SELECT before I inserted them into SQL table
I really don't know what to do now. Any suggestions?
EDIT: added data stream JSON, reference data JSON, ASA query, ASA input configuration, BLOB storage configuration and ASA test output result
Data Stream JSON - single message
[
{
"Activation": 0,
"AvailablePowerNegative": 6.0,
"AvailablePowerPositive": 1.91,
"DeviceID": 99999,
"DeviceIsAvailable": true,
"DeviceOn": true,
"Entity": "HeatPumpTelemetry",
"HeatPumpMode": 3,
"Power": 1.91,
"PowerCompressor": 1.91,
"PowerElHeater": 0.0,
"Source": "<omitted>",
"StatusToPowerOff": 1,
"StatusToPowerOn": 9,
"Timestamp": "2018-08-29T13:34:26.0Z",
"TimestampDevice": "2018-08-29T13:34:09.0Z"
}
]
Reference data JSON - single message
[
{
"SourceID": 1,
"Source": "<ommited>",
"DeviceID": 10,
"DeviceSourceCode": 99999,
"DeviceName": "NULL",
"DeviceType": "Heat Pump",
"DeviceTypeID": 1
}
]
ASA Query
WITH HeatPumpTelemetry AS
(
SELECT
*
FROM
[input-eh]
WHERE
source='<omitted>'
AND entity = 'HeatPumpTelemetry'
)
SELECT
e.Activation,
e.AvailablePowerNegative,
e.AvailablePowerPositive,
e.DeviceID,
e.DeviceIsAvailable,
e.DeviceOn,
e.Entity,
e.HeatPumpMode,
e.Power,
e.PowerCompressor,
e.PowerElHeater,
e.Source,
e.StatusToPowerOff,
e.StatusToPowerOn,
e.Timestamp,
e.TimestampDevice,
e.EventProcessedUtcTime,
e.PartitionId,
e.EventEnqueuedUtcTime
INTO
[out-SQL-HeatPumpTelemetry]
FROM
HeatPumpTelemetry e
LEFT JOIN [input-json-devices] d ON
TRY_CAST(d.DeviceSourceCode as BIGINT) = TRY_CAST(e.DeviceID AS BIGINT)
ASA Reference Data Input configuration
Reference Data input configuration in Stream Analytics
BLOB storage directory tree
Blob storage directory tree
ASA test query output
ASA test query output
matejp. I didn't reproduce your issue and you could refer to my steps.
reference data in blob storage:
{
"a":"aaa",
"reference":"www.bing.com"
}
stream data in blob storage
[
{
"id":"1",
"name":"DeIdentified 1",
"DeviceType":"aaa"
},
{
"id":"2",
"name":"DeIdentified 2",
"DeviceType":"No"
}
]
query statement:
SELECT
inputSteam.*,inputRefer.*
into sqloutput
FROM
inputSteam
Join inputRefer on inputSteam.DeviceType = inputRefer.a
Output:
Hope it helps you.Any concern, let me know.
I think I found the error. In past days I tested nearly every combination possible when configuring inputs in Azure Stream Analytics.
I've started with this example as baseline: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-build-an-iot-solution-using-stream-analytics
I've tried the solution without any changes to be sure that the example with reference data input works -> it worked
Then I've changed ASA output from CosmosDB to SQL table without changing anything -> it worked
Then I've changed my initial ASA job to be the as much the "same" as the ASA job in the example (writing into SQL table) -> it worked
Then I've started playing with BLOB directory names -> here I've found the error.
I think the problem I encountered is due to using a character "-" in folder name.
In my case I've created folder named "reference-data" and upload file named "devices.json" (folder structure "/reference-data/devices.json") -> ASA output to SQL table didn't work
As soon as I've changed the folder name to "refdata" (folder structure "/referencedata/devices.json") -> ASA output to SQL table worked.
Tried 3 times changing reference data input from folder name containing "-" and not containing it => every time ASA output to SQL server stop working when "-" was in folder name.
To recap:
I recommend not to use "-" in BLOB folder names for static reference data input in ASA Jobs.

How to extract this json into a table?

I've a sql column filled with json document, one for row:
[{
"ID":"TOT",
"type":"ABS",
"value":"32.0"
},
{
"ID":"T1",
"type":"ABS",
"value":"9.0"
},
{
"ID":"T2",
"type":"ABS",
"value":"8.0"
},
{
"ID":"T3",
"type":"ABS",
"value":"15.0"
}]
How is it possible to trasform it into tabular form? I tried with redshift json_extract_path_text and JSON_EXTRACT_ARRAY_ELEMENT_TEXT function, also I tried with json_each and json_each_text (on postgres) but didn't get what expected... any suggestions?
desired results should appear like this:
T1 T2 T3 TOT
9.0 8.0 15.0 32.0
I assume you printed 4 rows. In postgresql
SELECT this_column->'ID'
FROM that_table;
will return column with JSON strings. Use ->> if you want text column. More info here: https://www.postgresql.org/docs/current/static/functions-json.html
In case you were using some old Postgresql (before 9.3), this gets harder : )
Your best option is to use COPY from JSON Format. This will load the JSON directly into a normal table format. You then query it as normal data.
However, I suspect that you will need to slightly modify the format of the file by removing the outer [...] square brackets and also the commas between records, eg:
{
"ID": "TOT",
"type": "ABS",
"value": "32.0"
}
{
"ID": "T1",
"type": "ABS",
"value": "9.0"
}
If, however, your data is already loaded and you cannot re-load the data, you could either extract the data into a new table, or add additional columns to the existing table and use an UPDATE command to extract each field into a new column.
Or, very worst case, you can use one of the JSON Functions to access the information in a JSON field, but this is very inefficient for large requests (eg in a WHERE clause).

Loading data into Google Big Query

my question is the following:
Let's say I have a json file that I want to load into big query.
It contains these two lines of data.
{"value":"123"}
{"value": 123 }
I have defined the following schema for my data.
[
{ "name":"value", "type":"String"}
]
When I try to load the json file into big query it will fail with the following error:
Field:value: Could not convert value to string
Is there a way to get around this issue other than transforming the data in the json file?
Thanks!
You can set the maxBadRecords property on the load job to skip a number of errors but still load the data.
Following your example, you could still load the data if you set it as:
"configuration": {
"load": {
"maxBadRecords": 1,
}
}
This is a way to get around the issue while still loading your JSON data into the table, just that the erroneous rows will be skipped. If loading a list of files, you could set it to be a function of the number of files that you are loading (e.g. maxBadRecords = 20 * fileCount)

MongoVUE Bulk Insert

I'm trying to insert multiple documents using MongoVUE by passing an array of documents in the Insert Document window. For example:
[ {"name": "Kiran", age: 20}, {"name": "John", "age": 31} ]
However, I kept getting the following error:
ReadStartDocument can only be called when CurrentBsonType is Document, not when CurrentBsonType is Array
Does anyone know how to do bulk insert in MongoVUE?
Thanks!
In case anyone else stumbles on this question, the answer is that the "Import Multiple Documents" functionality in MongoVue doesn't accept an array of objects like you would expect it to. Instead, it expects the document to be formatted as a simple series of documents.
For the above example, you could create a simple file called "import.json" and format the data like this and it will import fine:
{"name": "Kiran", age: 20}
{"name": "John", "age": 31}