I need to find out to place my dataset in GCS in the same zone so that I avoid charges when running training on TPU's and reading large datasets from GCS.
Does anyone know how to find out current Google Colab instance zone?
The data should be stored in a bucket that is in the same zone as the Cloud TPU (or the same region if that specific zone isn't available); to the best of my knowledge Colab itself doesn't play a role in the situation.
Running (within Colab)
!curl ipinfo.io
You get something similar to
{
"ip": "3X.20X.4X.1XX",
"hostname": "13X.4X.20X.3X.bc.googleusercontent.com",
"city": "Groningen",
"region": "Groningen",
"country": "NL",
"loc": "53.21XX,6.56XX",
"org": "AS396XXX Google LLC",
"postal": "9711",
"timezone": "Europe/Amsterdam",
"readme": "https://ipinfo.io/missingauth"
}
Which tells you where you Colab is running.
Related
Question : what is the path forward for using ADLA (U-SQL) with ADLS(Gen2) ?
I have been running Azure Data lake Analytics (U-SQL) jobs via Azure Data factory (ADF v2) with Azure Data lake Store Generation 1 for quite a while now in East US2
I was planning to have another instance deployed to cater Canadian clients and wanted to setup Azure Data lake Store Generation 1
What I tried :
I was not able to create an Azure Datalake Storage Gen 1 account in Central Canada (or any Canadian region for that matter)
I tried to move to Azure Datalake Storage Gen2 but then ran into an issue where Azure Data Factory - U-SQL activity could not be linked with Gen2 Storage linked service to pick up U-SQL script
I stumbled upon multiple links about this topic :
https://feedback.azure.com/forums/327234-data-lake/suggestions/36445702-add-support-for-adls-gen2-to-adla
https://social.msdn.microsoft.com/Forums/en-US/5ce97eef-8940-4591-a19c-934f71825e7d/connect-data-lake-analytics-to-adls-gen-2
which essentially say that U-SQL / ADLA won't be supporting ADLS Gen2
I am a bit confused since there is no official documentation on ADLA's direction
Update:
This is the structure of my u-sql activity. It can work and process successfully:(You can try to create a new json of u-sql activity to replace your u-sql activity.)
{
"name": "pipeline4",
"properties": {
"activities": [
{
"name": "U-SQL1",
"type": "DataLakeAnalyticsU-SQL",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"scriptPath": "test1/u-sql.txt",
"scriptLinkedService": {
"referenceName": "LinkTo0730",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "AzureDataLakeAnalytics1",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
Original Answer:
I was not able to create an Azure Datalake Storage Gen 1 account in
Central Canada (or any Canadian region for that matter)
On my side, I also cannot create datalake gen1 on region Central Canada. This is the limit of my subscription. But you can have a check of the resource manager on your side, maybe you can.(Azure data lake gen1 is 'Microsoft.DataLakeStore')
Resource Manager is supported in all regions, but the resources you deploy might not be supported in all regions. In addition, there may be limitations on your subscription that prevent you from using some regions that support the resource. The resource explorer displays valid locations for the resource type.
Please have a check of this document:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-providers-and-types
I tried to move to Azure Datalake Storage Gen2 but then ran into an
issue where Azure Data Factory - U-SQL activity could not be linked
with Gen2 Storage linked service to pick up U-SQL script
On my side, seems it is reading the u-sql script in gen2, did you get some error?
I am new to Steam analytics and I need help here to achieve a specific task.
I have telemetry data coming from iot hub in this format.
Basically i will be getting machines telemetry data and the stage of the operations on that machine streamed to iot hub.
The stages will be indicated with tag ex:"stageid":"stage1". I need to calculate the time taken for each stage using stream analytics based on timestamp and stage tag.
packet Ex:
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
}
}]
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-24T09:26:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-26T09:24:00.3270000Z"
}
}]
pls help me can we achieve this with query and what could be the query or what is the other best approach?
Thanks,
Per my knowledge,your needs can't be implemented by ASA built-in features. ASA is a real-time collect data and analytics service.In other words,data need to be processed in the real-time.Current data can't wait for next dataset to do some calculate or merge things. Even if you could use windows function and group by,i believe the frequency of messages pushed by the device is also variable.
As a workaround,my idea is using iot-hub azure function trigger.Inside trigger,you could use code to parse message and save key columns(stageid,timestamp,devid) into some storage,maybe azure table storage. Before every insert,just grab latest row of current device to calculate the time taken with current message so that you could produce that time to store in other residence.In the end, update the latest row for every device.
We have events coming to Kafka and using kafka connect we are syncing these events with aws s3.
Data is visible in s3 in below dir structure:
bucket_name/sub_folder/
Partition=0/events.json
Partition=1/events.json
Partition=2/events.json
is there a way to store in below dir structure:
Bucket_name/sub_folder/date=today_date/ events.json or Partition=0..2/date=today/events.json
Bucket_name/sub_folder/date=today_date/ events.json or
Motivation is to store that days events in that that days directory, i searched web but could not find any other way .
Thanks in advance.
You can use the TimeBasedPartitioner which
partitions data according to ingestion time.
e.g. for hourly partioning:
[…]
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "my_record_field_with_timestamp_in",
[…]
in all Amadeus Self-Service APIs only some Airport codes work. For example "FRA" works, but "TXL" doesn't. Is this because the API is in Beta and I only use the Sandbox version?
Example:
https://test.api.amadeus.com/v1/shopping/flight-destinations?origin=FRA&oneWay=false&nonStop=false
WORKS
https://test.api.amadeus.com/v1/shopping/flight-destinations?origin=TXL&oneWay=false&nonStop=false
{
"errors": [
{
"status": 500,
"code": 141,
"title": "SYSTEM ERROR HAS OCCURRED",
"detail": "DATA DOMAIN NOT FOUND FOR REQUEST"
}
]
}
The APIs available in the test environment have a limit set of data (cache or fake data).
In the test environment, this API doesn't have data for TXL as origin, for Germany you have FRA and MUC.
So far our data set covers more the US and some big cities in the world. We will publish soon the list of available data on our portal.
You can find the list of available data in the test environment on our GitHub page.
I have a URL that produces JSON,
{
"status": "success",
"totalRecords": 55,
"records": [
{
"timestamp": 1393418044341,
"load": 40,
"deviceId": 285
},
{
"timestamp": 1393418104337,
"load": 42,
"deviceId": 285
},
{
"timestamp": 1393418164328,
"load": 24.5,
"deviceId": 285
},
{
"timestamp": 1393418224322,
"load": 42.5,
"deviceId": 285
},
It goes on and on, producing data every 30 seconds or so.
I have used Pentaho data-integration to parse and extract each of the data and put them into individual groups - timestamp, load and deviceId.
When I saved this it produced a .ktr file.
From this i have used the report-designer to upload the .ktr file and make charts with the data and then I uploaded the charts to the BI Server.
BUT
Can I just take the data, feed it into the BI Server and produce charts, bypassing the report-designer?
Yes you can do this - and using report designer would definately be the wrong way.
However you've inadvertantly made the right choice in building the first bit in PDI! Thats a good move.
Next step is to install CTools, add your .ktr to a CDA datasource (within CDE) and then using CDE define your charts and finally a refresh interval on the dashboard.
There's lots of good CTools tutorials around if you havent used it yet - it is also easily installed from the marketplace, or via ctools-installer.sh