Writing apache beam pCollection to bigquery causes type Error - google-bigquery

I have a simple beam pipeline, as follows:
with beam.Pipeline() as pipeline:
output = (
pipeline
| 'Read CSV' >> beam.io.ReadFromText('raw_files/myfile.csv',
skip_header_lines=True)
| 'Split strings' >> beam.Map(lambda x: x.split(','))
| 'Convert records to dictionary' >> beam.Map(to_json)
| beam.io.WriteToBigQuery(project='gcp_project_id',
dataset='datasetID',
table='tableID',
create_disposition=bigquery.CreateDisposition.CREATE_NEVER,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND
)
)
However upon runnning I get a typeError, stating the following:
line 2147, in __init__
self.table_reference = bigquery_tools.parse_table_reference(if isinstance(table,
TableReference):
TypeError: isinstance() arg 2 must be a type or tuple of types
I have tried defining a TableReference object and passing it to the WriteToBigQuery class but still facing the same issue. Am I missing something here? I've been stuck at this step for what feels like forever and I don't know what to do. Any help is appreciated!

This probably occurred since you installed Apache Beam without the GCP modules. Please make sure to do following (in a virtual environment).
pip install apache-beam[gcp]
It's a weird error though so feel free to file a Github issue against the Apache Beam project.

Related

Using schema update option in beam.io.writetobigquery

I am loading a bunch log files into BigQuery using apache beam data flow. The file format can change over a period of time by adding new columns to the files. I see Schema Update Option ALLOW_FILED_ADDITION.
Anyone know how to use it? This is how my WriteToBQ step looks:
| 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', ,write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
I haven't actually tried this yet but digging into the documentation, it seems you are able to pass whatever configuration you like to the BigQuery Load Job using additional_bq_parameters. In this case it might look something like:
| 'write to bigquery' >> beam.io.WriteToBigQuery(
'project:datasetId.tableId',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
additional_bq_parameters={
'schemaUpdateOptions': [
'ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION',
]
}
)
Weirdly, this is actually in the Java SDK but doesn't seem to have made its way to the Python SDK.

How do I get Source Extractor to Analyze an Image?

I'm relatively inexperienced in coding, so right now I'm just familiarizing myself with the basics of how to use SE, which I'll need to use in the near future.
At the moment I'm trying to get it to analyze a FITS file on my computer (which is a Mac). I'm sure this is something obvious, but I haven't been able to get it do that. Following the instructions in Chapters 6 and 7 of Source Extractor for Dummies (linked below), I input the following:
sex MedSpiral_20deg_Serl2_.45_.fits.fits -c configuration_file.txt
And got the following error message:
WARNING: configuration_file.txt not found, using internal defaults
----- SExtractor 2.19.5 started on 2020-02-05 at 17:10:59 with 1 thread
Setting catalog parameters
ERROR: can't read default.param
I then tried entering parameters manually:
sex MedSpiral_20deg_Ser12_.45_.fits.fits -c configuration_file.txt -DETECT_TYPE CCD -MAG_ZEROPOINT 2.5 -PIXEL_SCALE 0 -SATUR_LEVEL 1 -SEEING_FWHM 1
And got the same error message. I tried referencing default.sex directly:
sex MedSpiral_20deg_Ser12_.45_.fits.fits -c default.sex
And got the same error message again, substituting "configuration_file.txt not found" with "default.sex not found" (I checked that default.sex was on my computer, it is). The same thing happened when I tried to use default.param.
Here's the link to SE for Dummies (Chapter 6 begins on page 19):
http://astroa.physics.metu.edu.tr/MANUALS/sextractor/Guide2source_extractor.pdf
If you run the command "sex MedSpiral_20deg_Ser12_.45_fits.fits -c default.sex" within the config folder (within the sextractor folder), you will be able to run it.
However, I wonder how I can possibly run sextractor command from any folder in my computer?

Error in BQ shell Loading Datastore with write_disposition as Write append

1: I tried to load on an existing table [using Datastore file]
2. Bq Shell asked me to add write_disposition to write append to load to existing table
3. If I do the above, throws an error as follows:
load --source_format=DATASTORE_BACKUP --write_disposition=WRITE_append --allow_jagged_rows=None sample_red.t1estchallenge_1 gs://test.appspot.com/bucket/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.entity.backup_info
Error parsing command: flag --allow_jagged_rows=None: ('Non-boolean argument to boolean flag',None)
I tried allow jagged rows = 0 and allow jagged rows = None, nothing works just the same error.
Please advise on this.
UPDATE: As Mosha suggested --allow_jagged_rows=false has worked. It should be before --write_disposition=Write_truncate. But this has led to another issue on encoding. Can anyone say what should be the encoding type for DATASTORE_BACKUP?. I tried both --encoding=UTF-8 and --encoding=ISO-8859.
load --source_format=DATASTORE_BACKUP --allow_jagged_rows=false --write_disposition=WRITE_TRUNCATE sample_red.t1estchallenge_1 gs://test.appspot.com/STAGING/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.entityname.backup_info
Please advise.
You should use "false" (or "true") with boolean arguments, i.e.
--allow_jagged_rows=false

Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG

I have been trying to run a very simple task with Pig on Amazon EMR. When I run the commands in interactive shell, everything works fine. But when I ran the same thing as batch job, I get
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal
error creating job configuration.
and the running the script fails.
Here's my 7 line script. It's just computing averages over tuples of Google bigrams. mc is match count and vc is volume count.
bigrams = LOAD 's3n://<<bucket-name>>/gb­bigrams/*' AS (bigram:chararray, year:int, mc:int, vc:int);
grouped_bigrams = group bigrams by bigram;
answer1 = foreach grouped_bigrams generate group, ((DOUBLE) SUM(bigrams.mc))/COUNT(bigrams) AS avg_mc;
sort_answer1 = ORDER answer1 BY avg_mc desc;
answer2 = LIMIT sort_answer1 5;
STORE answer1 INTO 's3n://<bucket-name>/output/bigram/20130409/answer1';
STORE answer2 INTO 's3n://<bucket-name>/output/bigram/20130409/answer2';
I was guessing the error has to something to do with STORE and s3 path. So I have tried various combinations like using $OUTPUT, backslashes, etc. But keep getting the same error.
Any help would be highly appreciated.
Have you tried using the S3 Block File System instead of the native file system?
e.g.
s3://<<bucket-name>>/gb­bigrams/*
s3://<bucket-name>/output/bigram/20130409/answer1

Python file location conventions and Import errors

I'm new to Python and I simply don't know how to handle this specific problem:
I'm trying to run an executable (named ros_M5e.py) that's located in the directory /opt/ros/diamondback/stacks/hrl/hrl_rfid/src/hrl_rfid/ros_M5e.py (annoyingly long filepath, I know, but necessary). However, within the ros_M5e.py file there is a call to another file that is further up the file path: from hrl.hrl_rfid.msg import RFIDread. The directory msg indeed is located at /opt/ros/diamondback/stacks/hrl/hrl_rfid/ and it does indeed contain the file RFIDread. However, whenever I try to execute ros_M5e.py I get this error:
Traceback (most recent call last):
File "/opt/ros/diamondback/stacks/hrl/hrl_rfid/src/hrl_rfid/ros_M5e.py", line 37, in <module>
from hrl.hrl_rfid.msg import RFIDread
ImportError: No module named hrl.hrl_rfid.msg
Would someone with some expertise please shine some light on this problem for me? It seems like just a rudimentary file location problem, but I just don't know the appropriate Python conventions to fix it. I've tried putting the ros_M5e.py file in the same directory as the files it calls and changing the filepaths but to no avail.
Thanks a lot,
Khiya
Sure, I can help you get it up and running.
From the StackOverflow posting, it would seem that you're checking out the stack to /opt/ros/diamondback. This is no good, as it is a system path. You need to install into your local path. The reason for "readonly" on the repository is that you do not have permissions to make changes to the code -- it will still work just fine for you on your local machine. I spent a fair amount of time showing how to use this package (at least the python version) here:
http://www.ros.org/wiki/hrl_rfid
I'll try to do a quick run-through for installing it.... Run the following commands:
cd
mkdir sandbox
cd sandbox/
svn checkout http://gt-ros-pkg.googlecode.com/svn/trunk/hrl/hrl_rfid hrl_rfid (double-check that this checkout works OK!)
Add the following line to the bottom of your bashrc to tell ROS where to find the new package. (You may use "gedit ~/.bashrc")
export ROS_PACKAGE_PATH=$ROS_PACKAGE_PATH:$HOME/sandbox/hrl_rfid
Now execute the following:
roscd hrl_rfid (did you end up in the correct directory?)
rosmake hrl_rfid (did it make without errors?)
roscd hrl_rfid/src/hrl_rfid
At this point everything is actually installed correctly. By default, ros_M5e.py assumes that the reader is located at "/dev/robot/RFIDreader". Unless you've already altered udev rules, this will not be the case on your machine. I suggest running through the code:
http://www.ros.org/wiki/hrl_rfid
using iPython (a command-line python prompt that will let you execute python commands one at a time) to make sure everything is working (replace /dev/ttyUSB0 with whatever device your RFID reader is connected as):
import lib_M5e as M5e
r = M5e.M5e( '/dev/ttyUSB0', readPwr = 3000 )
r.ChangeAntennaPorts( 1, 1 )
r.QueryEnvironment()
r.TrackSingleTag( 'test_tag_id_' )
r.ChangeTagID( 'test_tag_id_' )
r.QueryEnvironment()
r.TrackSingleTag( 'test_tag_id_' )
r.ChangeAntennaPorts( 2, 2 )
r.QueryEnvironment()
This means that the underlying library is working just fine. Next, test ROS (make sure "roscore" is running!), by putting this in a python file and executing:
import lib_M5e as M5e
def P1(r):
r.ChangeAntennaPorts(1,1)
return 'AntPort1'
def P2(r):
r.ChangeAntennaPorts(2,2)
return 'AntPort2'
def PrintDatum(data):
ant, ids, rssi = data
print data
r = M5e.M5e( '/dev/ttyUSB0', readPwr = 3000 )
q = M5e.M5e_Poller(r, antfuncs=[P1, P2], callbacks=[PrintDatum])
q.query_mode()
t0 = time.time()
while time.time() - t0 < 3.0:
time.sleep( 0.1 )
q.track_mode( 'test_tag_id_' )
t0 = time.time()
while time.time() - t0 < 3.0:
time.sleep( 0.1 )
q.stop()
OK, everything works now. You can make your own node that is tuned to your setup:
#!/usr/bin/python
import ros_M5e as rm
def P1(r):
r.ChangeAntennaPorts(1,1)
return 'AntPort1'
def P2(r):
r.ChangeAntennaPorts(2,2)
return 'AntPort2'
ros_rfid = rm.ROS_M5e( name = 'my_rfid_server',
readPwr = 3000,
portStr = '/dev/ttyUSB0',
antFuncs = [P1, P2],
callbacks = [] )
rospy.spin()
ros_rfid.stop()
Or, ping me back and I can tweak ros_M5e.py to take an optional "portStr" -- though I recommend making your own so that you can name your antennas sensibly. Also, I highly recommend setting udev rules to ensure that the RFID reader always gets assigned to the same device: http://www.hsi.gatech.edu/hrl-wiki/index.php/Linux_Tools#udev_for_persistent_device_naming
BUS=="usb", KERNEL=="ttyUSB*", SYSFS{idVendor}=="0403", SYSFS{idProduct}=="6001", SYSFS{serial}=="ftDXR6FS", SYMLINK+="robot/RFIDreader"
If you do not do this... there is no guarantee that the reader will always be enumerated at /dev/ttyUSBx.
Let me know if you have any further problems.
~Travis Deyle (Hizook.com)
PS -- Did you modify ros_M5e.py to "from hrl.hrl_rfid.msg import RFIDread"? In the repo, it is "from hrl_rfid.msg import RFIDread". The latter is correct. As long as you have your ROS_PACKAGE_PATH correctly defined, and you've run rosmake on the package, then the import statement should work just fine. Also, I would not recommend posting ROS-related questions to StackOverflow. Very few people on here are going to be familiar with the ROS ecosystem (which is VERY complex). Please post questions here instead:
http://answers.ros.org/
http://code.google.com/p/gt-ros-pkg/issues/list
You need to make sure that following are true:
Directory /opt/ros/diamondback/stacks/ is in your python path.
/opt/ros/diamondback/stacks/hr1 contains __init__.py
/opt/ros/diamondback/stacks/hr1/hr1_rfid contians __init__.py
/opt/ros/diamondback/stacks/hr1/hr1_rfid/msg contians __init__.py
As the asker explained in comments that the RFIDRead does not have .py extension, so here is how that can be imported.
import imp
imp.load_source('RFIDRead', '/opt/ros/diamondback/stacks/hr1/hr1_rfid/msg/RFIDRead.msg')
Check out imp documentation for more information.