Dataflow BigQuery read from ValueProvider: 'StaticValueProvider' object has no attribute 'projectId' - google-bigquery

I'm using the Python SDK for apache beam. Im attempting to read data from BigQuery via a ValueProvider (as the documentation states that these are allowed).
def run(bq_source_table: ValueProvider,
pipeline_options=None):
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from BigQuery" >> ReadFromBigQuery(table=bq_source_table)
)
The options are declared as follows:
class CPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
"--bq_source_table",
help="The BigQuery source table name..\n"
'"<project>:<dataset>.<table>".'
)
Executing the pipeline yields the error below:
AttributeError: 'StaticValueProvider' object has no attribute 'projectId' [while running 'Read from BigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']
Any suggestions on how to resolve this? I do not want to use Flex Templates.
EDIT: Good thing to mention is that the query param does support the ValueProvider. Could this be a bug?

My only suggestion would be to use Flex Templates.

Related

pyspark.sql.utils.IllegalArgumentException: requirement failed: Temporary GCS path has not been set

On Google Cloud Platform, I am trying to submit a pyspark job that writes a dataframe to BigQuery.
The code that executes the writing is as the following:
finalDF.write.format("bigquery")\
.mode('overwrite')\
.option("table","[PROJECT_ID].dataset.table")\
.save()
And I get the mentioned error in the title. How can I set the GCS temporary path?
As the github repository of spark-bigquery-connector states
One can specify it when writing:
df.write
.format("bigquery")
.option("temporaryGcsBucket","some-bucket")
.save("dataset.table")
Or in a global manner:
spark.conf.set("temporaryGcsBucket","some-bucket")
Property "temporaryGcsBucket" needs to be set either at the time of writing dataframe or while creating sparkSession.
.option("temporaryGcsBucket","some-bucket")
or like .option("temporaryGcsBucket","some-bucket/optional_path")
1. finalDF.write.format("bigquery") .mode('overwrite').option("temporaryGcsBucket","some-bucket").option("table","[PROJECT_ID].dataset.table") .save()

Processing Event Hub Capture AVRO files with Azure Data Lake Analytics

I'm attempting to extract data from AVRO files produced by Event Hub Capture. In most cases this works flawlessly. But certain files are causing me problems. When I run the following U-SQL job, I get the error:
USE DATABASE Metrics;
USE SCHEMA dbo;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
REFERENCE ASSEMBLY [Avro];
REFERENCE ASSEMBLY [log4net];
USING Microsoft.Analytics.Samples.Formats.ApacheAvro;
USING Microsoft.Analytics.Samples.Formats.Json;
USING System.Text;
//DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{filename}";
DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/2018/01/16/19/rcpt-metrics-us-es-eh-metrics-v3-us-0-35-36.avro";
#eventHubArchiveRecords =
EXTRACT Body byte[],
date DateTime,
filename System.String
FROM #input
USING new AvroExtractor(#"
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
");
#json =
SELECT Encoding.UTF8.GetString(Body) AS json
FROM #eventHubArchiveRecords;
OUTPUT #json
TO "/outputs/Avro/testjson.csv"
USING Outputters.Csv(outputHeader : true, quoting : true);
I get the following error:
Unhandled exception from user code: "The given key was not present in the dictionary."
An unhandled exception from user code has been reported when invoking the method 'Extract' on the user type 'Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor'
Am I correct in assuming the problem is within the AVRO file produced by Event Hub Capture, or is there something wrong with my code?
The Key Not Present error is referring to the fields in your extract statement. It's not finding the data and filename fields. I removed those fields and your script runs correctly in my ADLA instance.
The current implementation only supports primitive types, not complex types of the Avro specification at the moment.
You have to build and use an extractor based on apache avro and not use the sample extractor provided by MS.
We went the same path

Jmeter Beanshell: Accessing global lists of data

I'm using Jmeter to design a test that requires data to be randomly read from text files. To save memory, I have set up a "setUp Thread Group" with a BeanShell PreProcessor with the following:
//Imports
import org.apache.commons.io.FileUtils;
//Read data files
List items = FileUtils.readLines(new File(vars.get("DataFolder") + "/items.txt"));
//Store for future use
props.put("items", items);
I then attempt to read this in my other thread groups and am trying to access a random line in my text files with something like this:
(props.get("items")).get(new Random().nextInt((props.get("items")).size()))
However, this throws a "Typed variable declaration" error and I think it's because the get() method returns an object and I'm trying to invoke size() on it, since it's really a List. I'm not sure what to do here. My ultimate goal is to define some lists of data once to be used globally in my test so my tests don't have to store this data themselves.
Does anyone have any thoughts as to what might be wrong?
EDIT
I've also tried defining the variables in the setUp thread group as follows:
bsh.shared.items = items;
And then using them as this:
(bsh.shared.items).get(new Random().nextInt((bsh.shared.items).size()))
But that fails with the error "Method size() not found in class'bsh.Primitive'".
You were very close, just add casting to List so the interpreter will know what's the expected object:
log.info(((List)props.get("items")).get(new Random().nextInt((props.get("items")).size())));
Be aware that since JMeter 3.1 it is recommended to use Groovy for any form of scripting as:
Groovy performance is much better
Groovy supports more modern Java features while with Beanshell you're stuck at Java 5 level
Groovy has a plenty of JDK enhancements, i.e. File.readLines() function
So kindly find Groovy solution below:
In the first Thread Group:
props.put('items', new File(vars.get('DataFolder') + '/items.txt').readLines()
In the second Thread Group:
def items = props.get('items')
def randomLine = items.get(new Random().nextInt(items.size))

How to use Google DataFlow Runner and Templates in tf.Transform?

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.
We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples
it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:
from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name)
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)
TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:
A) in PipelineOptions derived types, change option types to ValueProvider (python way: type inference or type hints ???)
B) change runner to TemplatingDataflowPipelineRunner
C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
D) gcloud beta dataflow jobs run --gcs-location —parameters
The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?
Python templates are available as of April 2017 (see documentation). The way to operate them is the following:
Define UserOptions subclassed from PipelineOptions.
Use the add_value_provider_argument API to add specific arguments to be parameterized.
Regular non-parameterizable options will continue to be defined using argparse's add_argument.
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--value_provider_arg', default='some_value')
parser.add_argument('--non_value_provider_arg', default='some_other_value')
Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).
Unfortunately, Python pipelines cannot be used as templates. It is only available for Java today. Since you need to use the python library, it will not be feasible to do this.
tensorflow_transform would also need to support ValueProvider so that you can pass in options as a value provider type through it.

Renaming an Amazon CloudWatch Alarm

I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.