TFTransform ValueError: "Split does not exist over all example artifacts" - tfx

I'm attempting to construct a TFX pipeline, but keep running into an error during the TFTransform component stem. After diving into the error message and its code on GitHub, it appears to have something to do with a function def get_split_uris(). From what I can glean, there is a mismatch between the number of Artifacts being consumed by this function during runtime and the number of URIs being retrieved and being matched back to this list.
It's odd because my CSVExampleGen() function doesn't seem to have any problems ingesting my original data set that's already split into two CSV files: 'target' and 'candidate'. I cannot find any documentation on this error on the TFX website so my apologies for not having more information.
I can provide additional details if needed.

Related

How to convert multiple LCI ecospold files to a custom excel format/ how to use parse_file from pyecospold/ how to read ecospold into brightway

I have multiple ecospold (version 1) files with LCI data that I want to convert to a custom excel format. I need all data given in the ecospold file. For my own convinience I want to use python to complete this task.
My research until now has lead me to the following conclusions:
There exist at least two converters (by GLAD and openLCA) to convert ecospold formats (1 and 2) to e.g. the ILCD. But those formats are not helping me to go anywhere, since I need to have all the data accessible in python and in order to then write it into my custom excel format.
To get the data in python, the package pyecospold (https://github.com/sami-m-g/pyecospold) seems to be a suitable choice.
According to the README that can be found at the pyecospold github repository,
ecoSpold = parse_file("data/v1/v1_1.xml") # Replace with your own XML file
should do the job. So I implemented the following lines:
import os
from pyecospold import parse_file, save_file, Defaults
from lxml import etree
cd = os.getcwd()
path_input = cd + r'\inputs\ecospold_test.xml'
# Parse the required XML file to EcoSpold class.
es = parse_file('inputs/ecospold_test.xml')
Now I run into the error:
TypeError: parse_file() missing 2 required positional arguments: 'schema_path' and 'ecospold_lookup'
I understood that a schema in xsd format is needed, therefore I got the schema files from the github and amended my last line of code:
es = parse_file('inputs/ecospold_test.xml', 'inputs/schemas/v1/EcoSpold01Dataset.xsd')
Now there is still one argument missing:
TypeError: parse_file() missing 1 required positional argument: 'ecospold_lookup'
Since I have no experience in parsing xml files in python, I have no idea what to do with this. Additionally, I am confused why the README does not say anything about those additionally needed arguments.
My second idea was to use brightway to get the data into python. But since brightway itself is quite an extensive package, I could not find a simple (or any) way to do this. (Sadly, the notebooks linked in the answer of this question Import Ecoinvent 2.2 Ecospold files into Brightway do not exist anymore)
Another option would of course be to write my own parser. But because I am lacking experience and pyecospold does exactly this (at least in my understanding), I would like to avoid this option.
Additionally, there in openLCA it is possible to read in ecospold files and then export them to an excel format. From this excel format I could of course make my custom excel format. The problem here is that I have no idea how to automize this, because I do not want to read in and export each file individually and manually in openLCA.
If anyone has an idea on how to solve one of my subproblems or a good alternative on how to solve my general problem, I would be very thankful. :)

ValueError: Unexpected option 'height' for Points type when using the 'matplotlib' extension. No similar options found

I am trying to do this project in the link. But various errors first concerning the library installation to now this value error. I am writing the code as how it is shown in this link. But unable to run.
First error:
Unexpected option 'height' for Points type when using the 'matplotlib' extension. No similar options found.
Second error:
ValueError: zero-size array to reduction operation minimum which has no identity
Link: https://towardsdatascience.com/interactive-geospatial-data-visualization-with-geoviews-in-python-7d5335c8efd1
[![
zero-size array to reduction operation minimum which has no identity
https://i.stack.imgur.com/4I6nc.png)](https://i.stack.imgur.com/4I6nc.png)
[![
Unexpected option 'height' for Points type when using the 'matplotlib' extension. No similar options found.
](https://i.stack.imgur.com/kJrEK.png)](https://i.stack.imgur.com/kJrEK.png)
I have tried writing same code in the link which is provided. I even tried using a different dataset but everything seemed to be not working. These codes are important while doing a GIS problem.

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?
I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

Issues pulling change log using python

I am trying to query and pull changelog details using python.
The below code returns the list of issues in the project.
issued = jira.search_issues('project= proj_a', maxResults=5)
for issue in issued:
print(issue)
I am trying to pass values obtained in the issue above
issues = jira.issue(issue,expand='changelog')
changelog = issues.changelog
projects = jira.project(project)
I get the below error on trying the above:
JIRAError: JiraError HTTP 404 url: https://abc.atlassian.net/rest/api/2/issue/issue?expand=changelog
text: Issue does not exist or you do not have permission to see it.
Could anyone advise as to where am I going wrong or what permissions do I need.
Please note, if I pass a specific issue_id in the above code it works just fine but I am trying to pass a list of issue_id
You can already receive all the changelog data in the search_issues() method so you don't have to get the changelog by iterating over each issue and making another API call for each issue. Check out the code below for examples on how to work with the changelog.
issues = jira.search_issues('project= proj_a', maxResults=5, expand='changelog')
for issue in issues:
print(f"Changes from issue: {issue.key} {issue.fields.summary}")
print(f"Number of Changelog entries found: {issue.changelog.total}") # number of changelog entries (careful, each entry can have multiple field changes)
for history in issue.changelog.histories:
print(f"Author: {history.author}") # person who did the change
print(f"Timestamp: {history.created}") # when did the change happen?
print("\nListing all items that changed:")
for item in history.items:
print(f"Field name: {item.field}") # field to which the change happened
print(f"Changed to: {item.toString}") # new value, item.to might be better in some cases depending on your needs.
print(f"Changed from: {item.fromString}") # old value, item.from might be better in some cases depending on your needs.
print()
print()
Just to explain what you did wrong before when iterating over each issue: you have to use the issue.key, not the issue-resource itself. When you simply pass the issue, it won't be handled correctly as a parameter in jira.issue(). Instead, pass issue.key:
for issue in issues:
print(issue.key)
myIssue = jira.issue(issue.key, expand='changelog')

Textacy - Vectorizer Weighting Error

I've recently found Textacy and as i go through the API reference guide I'm running into an error for the Vectorizer. If i add any options from the API reference I get a TypeError: unexpected keyword argument. I get this error for other options in addition to weighting.
I installed textacy using pip and I'm using Python3 on Ubuntu. Any help is appreciated. Thanks!
vectorizer = textacy.vsm.Vectorizer(weighting='tfidf')
TypeError: __init__() got an unexpected keyword argument 'weighting'
Ran into the same problem. The API documentation does not reflect the current Vectorizer keyword arguments. The Vectorizer now provides different keyword arguments to allow more control over how TF*IDF is applied.
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_type applies standard term frequency (TF), apply_idf=True applies the inverse document frequency (IDF). From the repo comments, idf_type='smooth' adds one to each document frequency in order to avoid zero divisions.
To see more information about the options check the comment at line 182 in the repository here: https://github.com/chartbeat-labs/textacy/blob/master/textacy/vsm/vectorizers.py