Loading SentencePiece tokenizer - tokenize

When I use SentencePieceTrainer.train(), it returns a .model and .vocab file. However when trying to load it using AutoTokenizer.from_pretrained() it expects a .json file. How would I get a .json file from the .model and .vocab file?

Related

Using blockchain_parser to parse blk files stored on S3

I'm trying to use alecalve's bitcoin-parser pkg for python.
The problem is, my node is saved on an s3 bucket.
As the parser uses os.path.expanduser for the .dat files dir, expecting a filesystem, I can't just use my s3 path. The example from the documentation is:
import os
from blockchain_parser.blockchain import Blockchain
blockchain = Blockchain(os.path.expanduser('~/.bitcoin/blocks'))
for block in blockchain.get_ordered_blocks(os.path.expanduser('~/.bitcoin/blocks/index'), end=1000):
print("height=%d block=%s" % (block.height, block.hash))
And the error I'm getting is as follows:
'name' arg must be a byte string or a unicode string
Is there a way to use s3fs or any different s3-to-filesystem method to use the s3 paths as dirs for the parser to work as intended?

Pandas to_csv with ZIP compresses whole directory

df.to_csv("/path/to/destination.zip", compression="zip")
The above line will generate a file called destination.zip in the directory /path/to/.
Decompressing the ZIP file, will result in a directory structure path/to/destination.zip where destination.zip is the CSV file.
Why is the path/to/ folder structure included in the compressed file? Is there any way to avoid this?
Was blown away by this, currently writing the ZIP locally (destination.zip) and using os.rename to move it to the desired location.. Is this a bug ?

file format of training/ testing dataset

I was building lately a dataset that I gather from the internet to use for training NN models. now I have a bunch of jpg images in one file and their labels on a txt file. the question is to which file format should I convert this data to make it easily callable in frameworks (python). a second question is how to build a metadata file about this dataset and which format should it have
In my opinion the easiest way is to build csv file to with two columns: directory and label. The directory value is the path (relative path) to the image, and label is of course the label. It requires you a merge from txt file and all jpg file into one csv files, but essentially it is easier to work with csv in pandas

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

How do I read in a text file in python 3.3.3 and store it in a variable?

How do I read in a text file in python 3.3.3 and store it in a variable? I'm struggling with this unicode coming from python 2.x
Given this file:
utf-8: áèíöû
This works as you expect (IFF utf-8 is your default encoding):
with open('/tmp/unicode.txt') as f:
variable=f.read()
print(variable)
It is better to explicitly state your intensions if you are unsure what the default is by using a keyword argument to open:
with open('/tmp/unicode.txt', encoding='utf-8') as f:
variable=f.read()
The keyword encodings supported are in the codec module. (For Python 2, you need to use codecs open to open the file rather than Python 2's open BTW.)