I am trying to call an AzureML UDF from Stream Analytics query and that UDF expects an array of 5 rows and 2 columns. The input data is streamed from an IoT hub and we have two fields in the incoming messages: temperature & humidity.
This would be the 'passthrough query' :
SELECT GetMetadataPropertyValue([room-telemetry], 'IoTHub.ConnectionDeviceId') AS RoomId,
Temperature, Humidity
INTO
[maintenance-alerts]
FROM
[room-telemetry]
I have an AzureML UDF (successfully created) that should be called with the last 5 records per RoomId and that will return one value from the ML Model. Obviously, there are multiple rooms in my stream, so I need to find a way to get some kind of windowing of 5 records Grouped per RoomId. I don't seem to find a way to call the UDF with the right arrays selected from the input stream. I know I can create a Javascript UDF that would return an array from the specific fields, but that would be record/by record, where here I would need this with multiple records that are grouped by the RoomId.
Someone has any insights?
Best regards
After the good suggestion of #jean-sébastien and an answer to an isolated question for the array-parsing, I finally was able to stitch everything together in a solution that builds. (still have to get it to run at runtime, though).
So, the solution exists in using CollectTop to aggregate the latest rows of the entity you want to group by, including the specification of a Time Window.
And the next step was to create the javascript UDF to take that data structure and parse it into a multi-dimensional array.
This is the query I have right now:
-- Taking relevant fields from the input stream
WITH RelevantTelemetry AS
(
SELECT engineid, tmp, hum, eventtime
FROM [engine-telemetry]
WHERE engineid IS NOT NULL
),
-- Grouping by engineid in TimeWindows
TimeWindows AS
(
SELECT engineid,
CollectTop(2) OVER (ORDER BY eventtime DESC) as TimeWindow
FROM
[RelevantTelemetry]
WHERE engineid IS NOT NULL
GROUP BY TumblingWindow(hour, 24), engineid
)
--Output timewindows for verification purposes
SELECT engineid, Udf.Predict(Udf.getTimeWindows(TimeWindow)) as Prediction
INTO debug
FROM TimeWindows
And this is the Javascript UDF:
function getTimeWindows(input){
var output = [];
for(var x in input){
var array = [];
array.push(input[x].value.tmp);
array.push(input[x].value.hum);
output.push(array);
}
return output;
}
I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:
partitionBy('date', 't', 's', 'p')
now I want to get number of partitions through using
df.rdd.getNumPartitions()
but it returns a much larger number (15642 partitions) that expected (18 partitions):
show partitions command in hive:
date=2019-10-02/t=u/s=u/p=s
date=2019-10-03/t=u/s=u/p=s
date=2019-10-04/t=u/s=u/p=s
date=2019-10-05/t=u/s=u/p=s
date=2019-10-06/t=u/s=u/p=s
date=2019-10-07/t=u/s=u/p=s
date=2019-10-08/t=u/s=u/p=s
date=2019-10-09/t=u/s=u/p=s
date=2019-10-10/t=u/s=u/p=s
date=2019-10-11/t=u/s=u/p=s
date=2019-10-12/t=u/s=u/p=s
date=2019-10-13/t=u/s=u/p=s
date=2019-10-14/t=u/s=u/p=s
date=2019-10-15/t=u/s=u/p=s
date=2019-10-16/t=u/s=u/p=s
date=2019-10-17/t=u/s=u/p=s
date=2019-10-18/t=u/s=u/p=s
date=2019-10-19/t=u/s=u/p=s
Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)
spark.sql("show partitions hivetablename").count()
The number of partitions in rdd is different from the hive partitions.
Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task.
You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().
I found a detour easier way:
>>> t = spark.sql("show partitions my_table")
>>> t.count()
18
I am new to PIG. I have written one query which is not working as expected. I am trying to process Google ngrams dataset provided to me.
I load the data which is 1GB
bigrams = LOAD '$(INPUT)' AS (bigram:chararray, year:int, occurrences:int, books:int);
Then I select a subset which is limited to 2000 entries
limbigrams = LIMIT bigrams 2000;
Then see the dump of the limited data (pasting sample output)
(GB product,2006,1,1)
(GB product,2007,5,5)
(GB wall_NOUN,2007,27,7)
(GB wall_NOUN,2008,35,6)
(GB2 ,_.,1906,1,1)
(GB2 ,_.,1938,1,1)
Now I do a group by on limbigrams
D = GROUP limbigrams BY bigram;
When I see the data dump of D I see an entirely different data set (sample)
(GLABRIO .,1977,3,3),(GLABRIO .,1992,3,3),(GLABRIO .,1997,1,1),(GLABRIO .,2000,6,6),(GLABRIO .,2001,9,1),(GLABRIO .,2002,24,3),(GLABRIO .,2003,3,1)})
(GLASS FILMS,{(GLASS FILMS,1978,1,1),(GLASS FILMS,1976,2,1),(GLASS FILMS,1970,3,3),(GLASS FILMS,1966,7,1),(GLASS FILMS,1962,1,1),(GLASS FILMS,1958,1,1),(GLASS FILMS,1955,1,1),(GLASS FILMS,1899,2,2),(GLASS FILMS,1986,6,3),(GLASS FILMS,1984,1,1),(GLASS FILMS,1980,7,3)})
Now I am not attaching the entire output because there is not even a single row of overlap between both the outputs (i.e. before group-by and after group-by). Hence it really doesn't matter to see the output files.
Why does this happen?
The dumps are accurate. The GROUP BY operator in Pig creates a single record for each group and puts every record belonging to that group inside a bag. You can indeed see this in the last record of your second dump. The record stands for the group GLASS FILMS and has a bag containing records which have the bigram as GLASS FILMS. You can read more about the GROUP BY operator here: https://www.tutorialspoint.com/apache_pig/apache_pig_group_operator.htm
I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()
Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.
I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
I am trying to get the number of rows in a table in BigQuery, using the method num_rows, but I get None as a the result. When checked the documentation, it shows in the code :returns: the row count (None until set from the server). When will the server set the number of rows in a table or should I perform any operations before calling this method.
Below is my code
from google.cloud import bigquery
bqclient = bigquery.Client.from_service_account_json('service_account.json')
datasets = list(bqclient.list_datasets())
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print(table.num_rows)
Try this instead:
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print("Table {} has {} rows".format(table.table_id,
bqclient.get_table(table).num_rows))