How can I run Neural Machine Translation with Attention in Google Colab with a different paired language? - google-colaboratory

I want to use a different language pair at the example provided in TernsorFlow website, Google Colab notebook only picks spanish-english
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/nmt_with_attention.ipynb
I tried changing the link to the esp-eng data that download's from it, but that didn't help
How can I try a different language set, without locally setting-up colab, it did mention at the end on that page, that I can try a different language set.

The final note on using a different dataset refers to this website which includes tab-delimited files.
You mainly need to change the values in this cell according to the link to the zip file you need.
# Download the file
path_to_zip = tf.keras.utils.get_file(
'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
You can try other datasets from:
OPUS
WMT
However, in these corpora, the source and target are in two separate files, so you have to adjust the code that extracts pairs, instead of split('\t') it should open two files and get the source and target line by line.

Related

How to properly use QSkyBoxEntity?

I looked everywhere, but there are not any guides or explanations of how to use QSkyBoxEntity.
I created Entity and filled it with transform (set translation and 3d scale). Also changed name and extension.
When I'm trying to run program it says
"Qt3D.Renderer.OpenGL.Backend: Unable to find suitable Texture Unit for "skyboxTexture""
I checked several times and tried different png files but no luck.
My image (I know it's fake transparency, but it shouldn't change anything, right?)
And here's part of a code:
Qt3DCore::QEntity *resultEntity = new Qt3DCore::QEntity;
Qt3DExtras::QSkyboxEntity *skyboxEntity = new Qt3DExtras::QSkyboxEntity(resultEntity);
skyboxEntity->setBaseName("skybox"); //I tried using path as well
skyboxEntity->setExtension("png");
Qt3DCore::QTransform *skyTransform = new Qt3DCore::QTransform(skyboxEntity);
skyTransform->setTranslation(QVector3D(0.0f,0.0f,0.0f));
skyTransform->setScale3D(QVector3D(0.1f,0.1f,0.1f));
skyboxEntity->addComponent(skyTransform);
Looks like it's not finding the skybox texture. Did you use an absolute path when you say "I tried using path as well"? The path you set is relative to the build path, i.e. it's not where your C++ file lies.
Alternatively, you could use a resources file and then load then image using
"qrc:/[prefix]/[filename without extension]"
You can also check out the Qt3D manual SkyBox test here:
https://github.com/qt/qt3d/tree/dev/tests/manual/skybox
It's important to properly name files in order for skybox to work and use resource file for storing.
I recommend .tga, but other formats should work as well.
You can read about it here:
https://doc.qt.io/qt-6/qml-qt3d-extras-skyboxentity.html
And here's example how it should look

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

How do you differentiate between QVD source files and target files when reading a QVW's XML MetaData?

I am currently trying to find an alternative to the Governance Dashboard that Rob Wunderlich (Qlik founder) created, since I am currently encountering errors when using it.
How do you differentiate between a data source (QVD, aka source) that is used by a QVW or a data file (QVD, aka target) that is generated by that QVW?
QVW:
LOAD
Lower(Discriminator) AS DataFile.Filepath
FROM C:\Sample_Transform_file.qvw (xmlSimple, Table is[DocumentSummary/LineageInfo])
Below is an example of what I found when parsing through the XML Metadata
(discriminator subtag within the lineageinfo tag) for one specific Transform QVW.
Sample Table Output
Are targets just identified by this?
STORE - [qvdName.qvd](qvd)
From what I have found, That appears to be the case, to a degree.
All of our QVW files that output a QVD utilize DIRECTORY statements rather than either hard-coded file location paths or variablized paths. Hence why all of the Targets are getting displayed as "STORE - qvdname.qvd", instead of displaying the filepath. In a sense, that is a flaw on QlikView's part, regarding its Governance Dashboard (or at the very least, they don't seem to recommend variablizing those paths as a standard in order to avoid breaking the lineage).

How to adapt tf.contrib.data.TextLineDataset for text from other sources?

For example, if my text data come from a database, how can I get one line/doc(as a database record) using the same mechanism (subclassing Dataset such that the pipeline described here still works) as TextLineDataset ?
By looking at the source code of TextLineDataset, I find that make_dataset_resource() seems an import method to be implemented. But I can't find where the actual code of yielding a line from a file as the docstring of TextLineDataset says: A Dataset comprising lines from one or more text files.

How to store data from Google Ngram API?

I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)