Streaming data to Bigquery using Appengine - google-bigquery

I'm collecting data (deriving from cookies installed in some websites) in BigQuery using a streaming approach with a Python code in App Engine.
The function I use to save the data is the following:
def stream_data(data):
PROJECT_ID = "project_id"
DATASET_ID = "dataset_id"
_SCOPE = 'https://www.googleapis.com/auth/bigquery'
credentials = appengine.AppAssertionCredentials(scope=_SCOPE)
http = credentials.authorize(httplib2.Http())
table = "table_name"
body = {
"ignoreUnknownValues": True,
"kind": "bigquery#tableDataInsertAllRequest",
"rows": [
{
"json": data,
},
]
}
bigquery = discovery.build('bigquery', 'v2', http=http)
bigquery.tabledata().insertAll(projectId=PROJECT_ID, datasetId=DATASET_ID, tableId=table, body=body).execute()
I have deployed the solution on two different App Engine instances and I get different result. My question is: how is it possible?
On the other hand comparing the results with Google Analytics metrics I also notice that not all the data are stored in BigQuery. Do you have any idea about this problem?

In your code there isn't a query exception handling during the insertAll operation. If BigQuery can't write data, you don't catch the exception.
In your last line try this code:
bQreturn = bigquery.tabledata().insertAll(projectId=PROJECT_ID, datasetId=DATASET_ID, tableId=table, body=body).execute()
logging.debug(bQreturn)
In this way, on Google Cloud Platform log, you can easily find a possible error in the insertAll operation.

When using insertAll() method you have to keep this in mind:
Data is streamed temporarily in the streaming buffer which has
different availability characteristics than managed storage. Certain
operations in BigQuery do not interact with the streaming buffer, such
as table copy jobs and API methods like tabledata.list {1}
If you are using the table preview, streaming buffered entries may not be visible.
Doing SELECT COUNT(*) from your table should return your total number of entries.
{1}: https://cloud.google.com/bigquery/troubleshooting-errors#missingunavailable-data

Related

How do I ingest tide gauge data from the NOAA HTTP API into Thingsboard Professional?

NOAA provides tidal and weather data through their own http API, and I would like to be able to use their API to get data into ThingsBoard (Professional) every six minutes to overlay with my device data (their data are updated every 6 minutes). Can someone walk me through the details of using the correct Integrations or Rule chains to get the time series data added to the database? It would also be nice to only use the metadata once. Below you can see how to get the most recent tide gauge level (water level) using their API.
For example, to see the latest tide gauge water level for a tide gauge (in this case, tide gauge 8638610), the API allows for getting the most recent water level information -- https://api.tidesandcurrents.noaa.gov/api/prod/datagetter?date=latest&station=8638610&product=water_level&datum=navd&units=metric&time_zone=lst_ldt&application=web_services&format=json
That call produces the following JSON: {"metadata":{"id":"8638610","name":"Sewells Point","lat":"36.9467","lon":"-76.3300"},"data":[{"t":"2022-02-08 22:42", "v":"-0.134", "s":"0.003", "f":"1,0,0,0", "q":"p"}]}
The Data Converter was fairly easy to construct (except maybe the noaa_data.data[0, 0] used in the code below):
//function Decoder(payload,metadata)
var noaa_data = decodeToJson(payload);
var deviceName = noaa_data.metadata.id;
var dataType = 'water_level';
var latitude = noaa_data.metadata.lat;
var longitude = noaa_data.metadata.lon;
var waterLevelData = noaa_data.data[0, 0];
//functions
function decodeToString(payload) {
return String.fromCharCode.apply(String, payload);
}
var result = {
deviceName: deviceName,
dataType: dataType,
time: waterLevelData.t,
waterLevel: waterLevelData.v,
waterLevelStDev: waterLevelData.s,
latitude: latitude,
longitude: longitude
}
function decodeToJson(payload) {
var str = decodeToString(payload);
var data = JSON.parse(str);
return data;
}
return result;
which has an Output:
{
"deviceName": "8638610",
"dataType": "water_level",
"time": "2022-02-08 22:42",
"waterLevel": "-0.134",
"waterLevelStDev": "0.003",
"latitude": "36.9467",
"longitude": "-76.3300"
}
I am not sure what process to use to get the data into ThingsBoard to be displayed as a device alongside my other device data.
Thank you for your help.
If you have a specific(and small) number of stations to grab then you can do the following:
Create the devices in Thingsboard manually
Go into rule chains, create a water stations rule chain
For each water station place a 'Generator' node, selecting the originator as required.
Route these into an external "Rest API" node.
Route the result of the post into a blue script node and put your decoder in there
Route result to telemetry
Example rule chain
More complex solution but more scalable:
Use a single generator node
Route the message into a blue script. This will contain a list of station id's that you want to pull info for. By setting the output of a script to the following you can make it send out multiple messages in sequence:
return [{msg:{}, metadata:{}, msgType{}, ...etc...]
Route the blue script into the rest api call and get the station data
Do some post processing with another blue script node if you need to. Don't decode the data here though.
Route all this into another rest api node and POST the data back to your HTTP integration endpoint (if you don't have one you will need to create it. Fairly simple)
Connect your data converter to this integration.
Finally, modify your output so that it is accepted by the converter output
{
"deviceName": "8638610",
"deviceType": "water-station",
"telemetry": {
"dataType": "water_level",
"time": "2022-02-08 22:42",
"waterLevel": "-0.134",
"waterLevelStDev": "0.003",
"latitude": "36.9467",
"longitude": "-76.3300"
}
}
Rough example
Above is how I would do it if I didn't want to use any external services. If you're AWS savvy I'd say set up a CRON job to trigger a lambda function every 6 minutes and post into your platform. Either will work.

Dataflow export to Bigquery: insertAll error, invalid table reference

I'm trying to create and export a stream of synthetic data using Dataflow, Pub/Sub, and BigQuery. I followed the synthetic data generation instructions using the following schema:
{
"id": "{{uuid()}}",
"test_value": {{integer(1,50)}}
}
The schema is in a file gs://my-folder/my-schema.json. The stream seems to be running correctly - I can export from the corresponding Pub/Sub topic to a GCS bucket using the "Export to Cloud Storage" template. When I try to use the "Export to BigQuery" template, I keep getting this error:
Request failed with code 400, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes, HTTP framework says request can be retried, (caller responsible for retrying): https://bigquery.googleapis.com/bigquery/v2/projects/<my-project>/datasets/<my-dataset>/tables/<my-table>/insertAll.
Before starting the export job, I created an empty table <my-project>:<my-dataset>.<my-table> with fields that match the JSON schema above:
id STRING NULLABLE
test_value INTEGER NULLABLE
I have outputTableSpec set to <my-project>:<my-dataset>.<my-table>.
If the BQ table name is given in the form project:dataset.table, then there cannot be any hyphens in the table string. I was using my-project.test.stream-data-102720 when I got the code 400 error. Creating a new table my-project.test.stream_data_102720 and re-running the job with the new name fixed the problem.

Data Factory copy pipeline from API

We use Azure Data Factory copy pipeline to transfer data from REST api's to a Azure SQL Database and it is doing some strange things. Because we loop over a set of API's that need to be transferred the mapping is empty from the copy activity.
But for one API the automatic mapping is going wrong, the destination table is created with all the needed columns and correct datatypes based on the received metadata. When we run the pipeline for that specific API, the following message is showed.
{ "errorCode": "2200", "message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: Ticks must be between DateTime.MinValue.Ticks and DateTime.MaxValue.Ticks.\r\nParameter name: ticks,Source=Microsoft.DataTransfer.ClientLibrary,'", "failureType": "UserError", "target": "Copy data1", "details": [] }
As a test we did do the mapping for that API manually by using the "Import Schema" option on the Mapping page. there we see that all the fields are correctly mapped. We execute the pipeline again using the mapping and everything is working fine.
But of course, we don't want to use a manually mapping because it is used in a loop for different API's also.

Correct JSON to POST to PubSub - Dataflow - BiqQuery? Correct dataschema?

I'm taking my first experimental steps with google-pre-setup templates in a Google Cloud Template (Cloud Pub/Sub to BigQuery).
As a milestone to my final goal (having physical gadgets reporting a data stream to Google Cloud Pub/Bub), my wish is to achieve something like this:
POSTMAN (make authenticated POST request with JSON message to an Google Cloud Platform, GPC, endpoint) --> GPC Pub/Sub --> GPC DataFlow --> GPC BigQuery.
Right now I am following the tutorial found in Executing Templates, https://cloud.google.com/dataflow/docs/templates/executing-templates, "Example 2: Custom template, streaming job". This section states:
...This example projects.templates.launch request creates a streaming job
from a template that reads from a Pub/Sub topic and writes to a
BigQuery table. The BigQuery table must already exist with the
appropriate schema. If successful, the response body contains an
instance of LaunchTemplateResponse. ...
and further more how to do a POST:
https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://[YOUR_BUCKET_NAME]/templates/TemplateName
{
"jobName": "[JOB_NAME]",
"parameters": {
"topic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]",
"table": "[YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE_NAME]"
},
"environment": {
"tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
"zone": "us-central1-f"
}
}
There are two things that confuses me. Let's for the sake of a simple example say that I have multiple vehicles who continuously should report their current status. I have already created my MQTT topic: VEHICLE_STATUS. Each och my vehicles should be able to report its:
Position [String]
Speed [Float]
Time [String]
VehicleID [Integer]
=======
I'm aware of the prototype for a PubsubMessage:
{
"data": string,
"attributes": {
string: string,
...
},
"messageId": string,
"publishTime": string,
}
My questions:
How should my BigQuery table schema look (which columns do I need to create)?
How should the entire corresponding JSON message look? What should my vehicle report to the endpoint each time?

Pyspark: how to streaming Data from a given API Url

I was given an API url, and a method getUserPost() which returns the data needed for my data processing function. I am able to get the data by using Client from suds.client as follow:
from suds.client import Client
from suds.xsd.doctor import ImportDoctor, Import
url = 'url'
imp = Import('http://schemas.xmlsoap.org/soap/encoding/')
imp.filter.add('filter')
d = ImportDoctor(imp)
client = Client(url, doctor=d)
tempResult = client.service.getUserPosts(user_ids = '',date_from='2016-07-01 03:19:57', date_to='2016-08-01 03:19:57', limit=100, offset=0)
Now, each tempResult will contain 100 records. I want to stream the data from given API url to RDD for parallelized processing. However, after reading the pySpark.Streaming documentation I can't find a streaming method for customized data source. Could anyone give me an ideal how to do so?
Thank you.
After a while digging, I found out how to solve the problem. I employed the use of Kafka Streaming. Basically you need to create a producer from given API, specify topic and Port for communication. Then a consumer to listen to that specific topic and Port to start streaming the data.
Note that the Producer and Consumer must be working as different threads in order to archive real-time streaming.