Airflow Xcom with SSHOperator - ssh

Im trying to get param from SSHOperator into Xcom and get it in python.
def decision_function(**context):
ti = context['ti']
output_ssh= ti.xcom_pull(task_ids='ssh_task')
print('ssh output is: {}'.format(output_ssh))
ls = SSHOperator(
task_id="ssh_task",
command= "ls",
ssh_hook = sshHook,
dag = mydag)
get_xcom = PythonOperator(
task_id='test',
python_callable=decision_function,
provide_context=True,
dag=mydag
ls >> get_xcom
I get the wrong result in the xcom_pull:
ssh output is: MTcySyBhaXJmbG93CTIuMUcgQXV0b0hpdHVtX05ld
Every thing is work fine with BashOperator but when I try to use the SSH, it not working currect.
I also try to change in the config the enable_xcom_pickling param, but still not working.
airflow: 2.1.4
linux
thank you.

The value of Xcom may be b64encoded depending on the value of enable_xcom_pickling as you can see in the source code
The issue you are facing has been discussed in PR which suggested to change functionality to be like other operators but the PR was not merged and you can see the reasons for it in the code review comments.
if you wish you can create custom version of the operator without the enable_xcom_pickling condition:
class MySSHOperator(SSHOperator):
def execute(self, context=None) -> Union[bytes, str]:
result: Union[bytes, str]
if self.command is None:
raise AirflowException("SSH operator error: SSH command not specified. Aborting.")
# Forcing get_pty to True if the command begins with "sudo".
self.get_pty = self.command.startswith('sudo') or self.get_pty
try:
with self.get_ssh_client() as ssh_client:
result = self.run_ssh_client_command(ssh_client, self.command)
except Exception as e:
raise AirflowException(f"SSH operator error: {str(e)}")
return result.decode('utf-8')

Related

How does snakemake --show-failed-logs works

I want to use snakemake with the --show-failed-logs parameter but not sure how it works or what to expect.
For example (not a working example I just typed and copied some code parts, if necessary I can create a working example)
My command is:
snakemake -s Snakefile_test.smk --configfile test.yml -j 4 --restart-times 2 --show-failed-logs
In Snakefile_test.smk I have a rule:
rule testrule:
input:
R1_trimmed = rules.trim.output.R1_trimmed,
R2_trimmed = rules.trim.output.R2_trimmed
output:
plot_R1 = output+"/{sample}/figures/{sample}_R1_plot.png",
plot_R2 = output+"/{sample}/figures/{sample}_R2_plot.png",
log:
testrule_log = output+"/{sample}/logs/plot_log.txt"
run:
shell("python scripts/createplot.py -r1_trimmed {input.R1_trimmed} -r2_trimmed {input.R2_trimmed} -plot_r1 {output.plot_R1} -plot_r2 {output.plot_R2} -log {log.testrule_log}")
In createplot.py to test I have something like:
if __name__ == "__main__":
logging.basicConfig(filename=args.log, level=logging.DEBUG, format='%(asctime)s %(levelname)s %(name)s %(message)s')
logger=logging.getLogger(__name__)
#to add some content to the log file
try:
1/0
except ZeroDivisionError as err:
logger.error(err)
main()
#to let the script crash
a = 1/0
Because obviously createplot.py will crash the pipeline I expected that snakemake would print the contents of testrule_log to the screen. But that is not the case. I am using version 5.25.0
EDIT:
I am familiar with
script:
"scripts/createplot.py"
But need to use shell in this case. If that causes the problem let me know.

pdfminer3k - pdf2txt.py error

I want to convert my pdf files to txt files and used pdfminer3k module & pdf2txt.py, however, I got an error.
pdf2txt.py -o file.txt -t tag file.pdf
This is my code at cmd screen.
Traceback (most recent call last):
File "C:\Python36\lib\site.py", line 67, in
import os
File "C:\Python36\lib\os.py", line 409
yield from walk(new_path, topdown, onerror, followlinks)
^
SyntaxError: invalid syntax
This is an error message that I got.
Could you help me to fix this problem??
Added for reference: Great resourse:
http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
The -t flag is the type of output. The options are text, tag, xml, and html.
Tag refers to generating a tag for xml. Replace tag with text in your command and try it.
The order of optional input also matters.
You also must invoke python, your command line does'nt know what import means, yet some of your environment seems to be setup. My example is for windows cmd from Anaconda3\Scripts directory. If your in juptyer notebook or a console, you should be able to run import pdf2txt with the .py
To setup your environment you need to append the os.path.append(yourpdfdirectory) otherwise file.pdf will not be found.
Try python pdf2txt.py -t text -o file.txt file.pdf
Or if you are brave...this is how to do programmatically. The trouble with xml is if you want to get the text, each character from xml tree is returned in an arbitrary order. You can get it to work but you need to build the string character by character which is not that hard, its just logically time consuming.
fp = open(filesin,'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager(caching=False)
laparams = LAParams(all_texts=True)
laparams.boxes_flow = -0.2
laparams.paragraph_indent = 0.2
laparams.detect_vertical = False
#laparams.heuristic_word_margin = 0.03
laparams.word_margin = 0.2
laparams.line_margin = 0.3
outfp = open(filesin+".out.tag" ,'wb')
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#process_pdf(rsrcmgr, device, pdfparse, pagenos,caching=c, check_extractable=True)
for p,page in enumerate(doc.get_pages()):
if p == 0: #temporary for page 1
interpreter.process_page(page)
layout = device.get_result()
alltextinbox = ''
#This is a rich environment so categorization of this object hierarchy is needed
for c,lt_obj in enumerate(layout):
#print(type(lt_obj),"This is type ",c,"th object on the ",p,"th page")
if isinstance(lt_obj,LTTextBoxHorizontal) or isinstance(lt_obj,LTTextBox) or isinstance(lt_obj,LTTextLine):
print("Type ,",type(lt_obj)," and text ..",lt_obj.get_text())
obj_textbox_line.update({lt_obj:lt_obj.get_text()})
elif p != 0:
pass
fp.close()
#print(obj_textbox_line)
#call the column finder here
#check_matching("example", "example1")
#text_doc_df = pd.DataFrame(obj_textbox_line,columns=['text'])
#print (text_doc_df)
pass
I'm working on a generic row/column matcher. If you don't want to bother, you can buy this software already for like 150 bucks for a pro converter.

EOF error using input Python 3

I keep getting an EOF error but unsure as to why. I have tried with and without int() but it makes no difference. I'm using Pycharm 3.4 and Python 3.
Thanks,
Chris
while True:
try:
number = int(input("what's your favourite number?"))
print (number)
break
You must close a try statement because you are declaring that there might be an error and you want to handle it
while True:
try:
number = int(input("what's your favourite number?"))
print(number)
break
except ValueError as e:
print("Woah, there is an error: {0}".format(e))

Python 2.7.3 process.Popen() failures

I'm using python 2.7.3 on a Windows 7 64-bit machine currently (also developing in Linux 32 bit, Ubuntu 12.04) and am having odd difficulties getting python to communicate with the command prompt/terminal successfully. My code looks like this:
import subprocess, logging, platform, ctypes
class someClass (object):
def runTerminalCommand:
try:
terminalOrCmdLineCommand = self.toString()
process = subprocess.Popen(terminalOrCmdLineCommand, shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
output = process.stdout.readlines()
if len(output) < 1:
logging.exception("{} error : {}".format(self.cmd,process.stderr.readlines()))
raise ConfigError("cmd issue : {}".format(self.cmd))
return output
except ValueError as err:
raise err
except Exception as e:
logging.exception("Unexpected error : " + e.message)
raise ConfigError("unexpected error")
Now, I know that the self.toString() returned value will process correctly if I enter it manually, so I'm limiting this to an issue with how I'm sending it to the command line via the subprocess. I've read the documentation, and found that the subprocess.check_call() doesn't return anything if it encounters an error, so I'm using .Popen()
The exception I get is,
[date & time] ERROR: Unexpected error :
Traceback (most recent call last):
File "C:\[...]"
raise ConfigError("cmd issue : {}".format(self.cmd))
"ConfigError: cmd issue : [the list self.cmd printed out]"
What I am TRYING to do, is run a command, and read the input back. But I seem to be unable to automate the call I want to run. :(
Any thoughts? (please let me know if there are any needed details I left out)
Much appreciated, in advance.
The docs say:
Use communicate() rather than .stdin.write, .stdout.read or
.stderr.read to avoid deadlocks due to any of the other OS pipe
buffers filling up and blocking the child process.
You could use .communicate() as follows:
p = Popen(cmd, stdout=PIPE, stderr=PIPE)
stdout_data, stderr_data = p.communicate()
if p.returncode != 0:
raise RuntimeError("%r failed, status code %s stdout %r stderr %r" % (
cmd, p.returncode, stdout_data, stderr_data))
output_lines = stdout_data.splitlines() # you could also use `keepends=True`
See other methods to get subprocess output in Python.

How do I set a backend for django-celery. I set CELERY_RESULT_BACKEND, but it is not recognized

I set CELERY_RESULT_BACKEND = "amqp" in celeryconfig.py
but I get:
>>> from tasks import add
>>> result = add.delay(3,5)
>>> result.ready()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/result.py", line 105, in ready
return self.state in self.backend.READY_STATES
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/result.py", line 184, in state
return self.backend.get_status(self.task_id)
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/backends/base.py", line 414, in _is_disabled
raise NotImplementedError("No result backend configured. "
NotImplementedError: No result backend configured. Please see the documentation for more information.
I just went through this so I can shed some light on this. One might think for all of the great documentation stating some of this would have been a bit more obvious.
I'll assume you have both RabbitMQ up and functioning (it needs to be running), and that you have dj-celery installed.
Once you have that then all you need to do is to include this single line in your setting.py file.
BROKER_URL = "amqp://guest:guest#localhost:5672//"
Then you need to run syncdb and start this thing up using:
python manage.py celeryd -E -B --loglevel=info
The -E states that you want events captured and the -B states you want celerybeats running. The former enable you to actually see something in the admin window and the later allows you to schedule. Finally you need to ensure that you are actually going to capture the events and the status. So in another terminal run this:
./manage.py celerycam
And then finally your able to see the working example provided in the docs.. -- Again assuming you created the tasks.py that is says to.
>>> result = add.delay(4, 4)
>>> result.ready() # returns True if the task has finished processing.
False
>>> result.result # task is not ready, so no return value yet.
None
>>> result.get() # Waits until the task is done and returns the retval.
8
>>> result.result # direct access to result, doesn't re-raise errors.
8
>>> result.successful() # returns True if the task didn't end in failure.
True
Furthermore then you are able to view your status in the admin panel.
I hope this helps!! I would add one more thing which helped me. Watching the RabbitMQ Log file was key as it helped me identify that django-celery was actually talking to RabbitMQ.
Are you running django celery?
If so, you need to start a python shell in the context of django (or whatever the technical term is).
Type:
python manage.py shell
And try your commands from that shell
HI tried everything to work celery v3.1.25 with Django 1.8 version nothing worked..
Finally below line helped me ,feeling happy
app = Celery('documents',backend="celery.backends.amqp:AMQPBackend")
Setting backend="celery.backends.amqp:AMQPBackend" fixed my error.