Need to use Pandas in Airflow V2: pandas vs apache-airflow[pandas] - pandas

I need to use Pandas in an airflow job. Even though I am an experienced programmer, I am relatively new to Python. I want to know in my requirements.txt, do I install pandas from PyPI or apache-airflow[pandas].
Also, I am not entirely sure what the provider apache-airflow[pandas] does? And how does pip resolve it (it seems like it is not in PyPi.
Thank you in advance for the answers.
I tried searching in PyPI for apache-airflow[pandas]
I also tried searching in SO for related questions

apache-airflow[pandas] only installs pandas>=0.17.1: https://github.com/apache/airflow/blob/0d2555b318d0eb4ed5f2d410eccf20e26ad004ad/setup.py#L308-L310. For context, this was the PR that originally added it: https://github.com/apache/airflow/pull/17575.
Since >=0.17.1 is quite broad, I suggest limiting Pandas to a more specific version in your requirements.txt. This gives you more control over the Pandas version, instead of the large number of possible Pandas versions that Airflow limits itself to.

I suggest to install Airflow with constraints as explained in the docs:
pip install "apache-airflow[pandas]==2.5.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.7.txt"
this will guarantee stable installation of Airflow without conflicts. Airflow also updates the constraints when release is cut thus when you upgrade Airflow you will get the latest possible version that "agrees" with all other Airflow dependencies.
For example:
Airflow 2.5.1 with Python 3.7 the version is:
pandas==1.3.5
Airflow 2.5.1 with Python 3.9 the version is:
pandas==1.5.2
Personally, I don't recommend overriding the versions in constraints. It carry a risk that your production environment will not be stable/consistent (unless you implement your own mechanism to generate constraints). Should you have a specific task that requires other version of a library (pandas or other) then I suggest using PythonVirtualenvOperator, DockerOperator or any other alternative that allows you to set specific libraries version for this task. This also gives DAG author the freedom to set whatever library version they need without being depended on other teams that share the same Airflow instance and need other versions for the same library, or even the same team but with another project that needs different versions (think of it the same way as you manage virtual environments in your IDE).
As for your question about apache-airflow[pandas]. Note that this is extra dependency it's not Airflow provider as you mentioned. The reason for having it is because Airflow had dependency on pandas in the past (as part of Airflow core) however pandas is heavy library and not everyone needs it thus moving it to optional dependency makes sense. That way only users who need to have pandas in their Airflow environment will install it.

Related

diff between airflow.providers and airflow.contrib

I am new to python and airflow. was trying to use the Bigquery hook operator and came to know there are two packages for the hook. airflow.providers.google.cloud.hooks.bigquery
airflow.contrib.hooks.bigquery_hook . so what is the difference between those
contrib is deprecated (See source code). You should always use providers.
If you will check your logs you will see deprecation warning raised whenever you import from contrib.
The reason for this is because previously integrations to services like BigQuery were coupled to Airflow core. This means that new versions were frequent only as Airflow core releases. To avoid that Airflow decoupled each service to its own provider package which is released separately.

Origins of distribute_setup.py file in matplotlib

What are the origins of the distribute_setup.py file in Matplotlib? Is the file originally from some other source or is it from the Matplotlib project?
I am interested in using setuptools in my package's setup.py,and I want to know what the best approach to this is.
distribute was a fork of setuptools with python 3.x support and number of bugfixes due to lack of maintenance of the latter package.
ez_setup.py is a python module that fetches and installs setuptools automatically on as-needed basis. distribute_setup.py provided the same functionality to install distribute package instead.
As of Jule 2013 most of the changes done in distribute fork has been merged back and packages should switch over to using setuptools again.
Latest version of ez_setup.py can be found in setuptools repository at https://bitbucket.org/pypa/setuptools/src/tip/ez_setup.py?at=default
The usage is documented in http://setuptools.readthedocs.org/en/latest/using.html

Two completely separate versions of trac

I would like to install a complete new version of Trac alongside of our current version (0.11.7) and I am looking for ways to do this. After some research, it says to use python's virtualenv, but I am trying to find specific steps on how to accomplish this without interfering with our 0.11.7 version at all.
I am using Ubuntu as the OS. Any input including any possible pitfalls is appreciated.
Try virtualenvwrapper that makes using python-virtualenv a breeze.
The steps to create and use such a Python virtual environment are explained in the user documentation. These environments form the core setup of my own Trac plugin development. It allows to even use custom python versions, if you ever need that. I found the need to give each environment a self-explaining name and use it with different Trac environment directories matching the db version required by different Trac versions, i.e. virtualenv "trac-0.11_py2.4" with Trac env "sandbox_0.11", "trac-0.12_py2.6" with Trac env "sandbox_0.12", etc.

When do we get an "AssertionError: HDF dataset not available. Check your clearsilver installation"

I am trying to install a dbauth plugin for trac. I know that I probably should be chasing this on other trac and trac-hacks related forums but still I am wondering, why do one get this error? What exactly is happening?
In my case the dbauth plugin is trying to read things like: "trac_permissions" and "trac_users" from a sqlite or mysql database. I have checked the databases, the values are in there but neither of them work. clearsilver is installed and running as well.
So what is usually causing this error? Is it that the HDF parser is receiving wrong info? Please do not take this as a trac question, just explain me why these types of errors occur.
Thanks.
a Google Search should get you started. You should also consider an alternative, because DbAuth is deprecated.
What version of Trac are you running? Recent versions use Genshi instead of Clearsilver, which means that Clearsilver-based plugins likely won't work correctly (not without modifications, at least). According to the Trac wiki, Trac version 0.11 still had the infrastructure to support Clearsilver-based plugins, version 0.12 retained this support in an unsupported form (meaning use at your own risk, you're on your own if something doesn't work), and version 0.13 dropped support for Clearsilver-based plugins entirely. Unless you're still running an older Trac install that's version 0.10 or 0.11, I'm inclined to say that this problem is due to the phasing out of Clearsilver support.
According to this trac-hacks ticket, you may want to try re-compiling Clearsilver with the Python bindings (this would only be useful if you're running Trac 0.11 or older).

How to develop and package with ActivePython?

I have been developing (a somewhat complicated) app on core python (2.6) and have also managed to use pyinstaller to create an executable for deployment in measurements, or distribution to my colleagues. I work on the Ubuntu OS.
What has troubled me is in upgrading the versions of numpy or scipy. Some features I need are in 0.9 and I'm still on 0.7. The process of upgrading them, or matplotlib, for that matter are not elegant. The way I've upgraded on my local machine was to delete the folders of these libraries, and then manually install the newer versions.
However, this does not work for machines where I do not have root access. While trying to find a workaround, I found ActivePython. I gave it a quick try and it seems to use PyPM to download the newest scipy and numpy to its custom install location. Excellent! I don't need root access and can use the latest version of the libraries.
QUESTION:
If there are libraries not available on the PyPM index with ActivePython, how can I directly use the source code of those libraries (example wxpython) to include into this installation?
How can I use pyinstaller to build an executable using only the libraries in the ActivePython installation?