Chronic (Ruby NLP date/time parser) for python? - chronic

Does anyone know of a library like chronic but for python?
Thanks!

Have you tried parsedatetime?

You can try Stanford NLP's SUTime. Related Python bindings are here: https://github.com/FraBle/python-sutime
Make sure that all the Java dependencies are installed.

I was talking to Stephen Russett at chronic. I came up with a Python example after he suggested tokenization.
Here is the Python example. You run the output into chronic.
import nltk
import MySQLdb
import time
import string
import re
#tokenize
sentence = 'Available June 9 -- August first week'
tokens = nltk.word_tokenize(sentence)
parts_of_speech = nltk.pos_tag(tokens)
print parts_of_speech
#allow white list
white_list = ['first']
#allow only prepositions
#NNP, CD
approved_prepositions = ['NNP', 'CD']
filtered = []
for word in parts_of_speech:
if any(x in word[1] for x in approved_prepositions):
filtered.append(word[0])
elif any(x in word[0] for x in white_list):
#if word in white list, append it
filtered.append(word[0])
print filtered
#normalize to alphanumeric only
normalized = re.sub(r'\s\W+', ' ', ' '.join(filtered))
print filtered

Related

Maplotlib and Basemap: cannot import name 'dedent'

I'm trying to draw a network on a basemap overlay.
I have packages:
basemap=1.3.0=py36ha7665c8_0
matplotlib=3.3.1=0
matplotlib-base=3.3.1=py36hba9282a_0
networkx=2.5=py_0
When I run only the line
from mpl_toolkits.basemap import Basemap
I get
from matplotlib.cbook import dedent
ImportError: cannot import name 'dedent'
I've tried several different versions of the packages but cannot manage to find the right functioning combination.
Anyone have any ideas on combinations of matpltlib and basemap that work? Or another way to plot my network over a basemap?
The best that worked for me is:
downgrade to :
pip install -U matplotlib==3.2
I was have the same error with importing Basemap from when I pass from python 3.2.7 to 3.3.6.
The error message comes from the fact that you try to import Basemap from mpl_toolkits.basemap, but the mpl_toolkits.basemap module requires to import the dedent function from the matplotlib.cbook module, but this function is not there.
So I guess there were two possible solution : to comment the line which import this function or to copy it. I chose the second option.
I don't know why the dedent function is not present in the matplotlib.cbook module.
Here it is the dedent funtion as I found it on the link I put, it was also have for header : #deprecated("3.1", alternative="inspect.cleandoc")
def dedent(s):
"""
Remove excess indentation from docstring *s*.
Discards any leading blank lines, then removes up to n whitespace
characters from each line, where n is the number of leading
whitespace characters in the first line. It differs from
textwrap.dedent in its deletion of leading blank lines and its use
of the first non-blank line to determine the indentation.
It is also faster in most cases.
"""
# This implementation has a somewhat obtuse use of regular
# expressions. However, this function accounted for almost 30% of
# matplotlib startup time, so it is worthy of optimization at all
# costs.
if not s: # includes case of s is None
return ''
match = _find_dedent_regex.match(s)
if match is None:
return s
# This is the number of spaces to remove from the left-hand side.
nshift = match.end(1) - match.start(1)
if nshift == 0:
return s
# Get a regex that will remove *up to* nshift spaces from the
# beginning of each line. If it isn't in the cache, generate it.
unindent = _dedent_regex.get(nshift, None)
if unindent is None:
unindent = re.compile("\n\r? {0,%d}" % nshift)
_dedent_regex[nshift] = unindent
result = unindent.sub("\n", s).strip()
return result
I copy the function dedent from the site of matplotlib : https://matplotlib.org/3.1.1/_modules/matplotlib/cbook.html#dedent inside the module init.py - matplolib\cbook.
And now it is working for me.
Be aware to copy it at the correct line, because some of its variable pre-defined in the module such as : _find_dedent_regex for the line :
match = _find_dedent_regex.match(s)
and _dedent_regex for the lines :
unindent = _dedent_regex.get(nshift, None)
if unindent is None:
unindent = re.compile("\n\r? {0,%d}" % nshift)
_dedent_regex[nshift] = unindent
This is where I put the function in the moddule
I deeply apologize for spelling and/or grammar errors that I could do, I will do my best to correct those I would have missed and reported to me.
I hope this was usefull.
Matplotlib has removed cbook.dedent since 3.3.0.
Upgrade basemap to fix this: pip install -U basemap
I tried to find a solution online that does not require altering any files. While in an Anaconda environment, use:
conda install matplotlib=3.2
Please note that support for Basemap is discontinued and should be avoided.

Export vectors from fastText to spaCy

I downloaded the fasttext.cc vectors of 1.5gb, I used example code spaCy examples vectors_fast_text. I executed the following command in the terminal:
python config/vectors_fast_text.py vectors_loc data/vectors/wiki.pt.vec
After a few minutes with the processor at 100%, I received the following text:
class colspan 0.32231358
What happens from here? How can I export these vectors elsewhere, such as for example with my AWS S3 training templates?
I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
Follow vectors_fast_text.py:
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
#plac.annotations()
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk("./models/new_nlp/")
if __name__ == '__main__':
plac.call(main)
Type in the terminal:
python vectors_fast_text.py
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python setup.py sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:
https://spacy.io/usage/training

Accessing carray of pointcloud using pytables

I am having a hard time understanding how to access the data in a carray.
http://carray.pytables.org/docs/manual/index.html
I have a carray that I can view in a group structure using vitables - but how to open it and retrieve the data it beyond me.
The data are a point cloud that is 3 levels down that I want to make a scatter plot of and extract as a .obj file..
I then have to loop through (many) clouds and do the same thing..
Is there anyone that can give me a simple example of how to do this?
This was my attempt:
import carray as ca
fileName = 'hdf5_example_db.h5'
a = ca.open(rootdir=fileName)
print a
I managed to solve my issue.. I wasn't treating the carray differently to the rest of the hierarchy. I needed to first load the entire db, then refer to the data I needed. I ended up not having to use carray, and just stuck to h5py:
from __future__ import print_function
import h5py
import numpy as np
# read the hdf5 format file
fileName = 'hdf5_example_db.h5'
f = h5py.File(fileName, 'r')
# full path of carry type data (which is in ply format)
dataspace = '/objects/object_000/object_model'
# view the data
print(f[dataspace])
# print to ply file
with open('object_000.ply', 'w') as fo:
for line in f[dataspace]:
fo.write(line+'\n')

Complementary Filter Code Not functioning

I've been scratching my head too long.
The data is coming from an 3D accelerometer and 3D gyro. I am using a complementary filter to control drift.
I have it working in excel but can't seem to get this python code to do the same thing:
r1_angle_cfx = np.zeros(len(r1_angle_ax))
r1_angle_cfx[0] = r1_angle_ax[0]
for i in xrange(len(r1_angle_ax)-1):
j = i + 1
r1_angle_cfx[j] = 0.98 *(r1_angle_cfx[i] + r1_alpha_x[j]*fs) + (0.02 * r1_angle_ax[j]) #complementary filter
In excel (correct) I get:
In python (incorrect) I get:
What is going wrong? and is there a better way to do this in python?
Thanks,
Scott
EDIT: Link to data files -
sample data
1. The csv file contains accelerometer, gyro data that is entered into the filter formula as well as those values that were calculated in excel.
2. The excel file contains all raw data (steps not mentioned above but I have triple checked and are equivalent up to the point of being entered in the filter formula).
EDIT 2: update - Turns out my code works. It was sloppy debugging. fs should be fs = 0.01. In my code I had fs = 1/100 which ends up = 0 in the script.
Your Python code looks pretty reasonable. Without example data, I can't do much more than say that.
But I can guess. I looked up "complementary filters" and found a link explaining them:
https://sites.google.com/site/myimuestimationexperience/filters/complementary-filter
This link gives an example equation that is very similar to yours:
angle = (1-alpha)*(angle + gyro * dt) + (alpha)*(acc)
You have fs where this has dt, and dt is computed as 1/sampling_frequency. If fs is the sampling frequency, maybe you should try inverting it?
EDIT: Okay, now that you posted the data, I played around with this. Here is my program that gets a correct result.
Your code looks basically correct, so I think you must have made a mistake in your code that collected the values. I'm not quite sure because your variable names confuse me.
I used a namedtuple and for the names, I used the column headers from the CSV file (with spaces and periods removed to make a valid Python identifier).
import collections as coll
import csv
import matplotlib.pyplot as plt
import numpy as np
import sys
fs = 100.0
dt = 1.0/fs
alpha = 0.02
Sample = coll.namedtuple("Sample",
"accZ accY accX rotZ rotY rotX r acc_angZ acc_angY acc_angX cfZ cfY cfX")
def samples_from_file(fname):
with open(fname) as f:
next(f) # discard header row
csv_reader = csv.reader(f, dialect='excel')
for i, row in enumerate(csv_reader, 1):
try:
values = [float(x) for x in row]
yield Sample(*values)
except Exception:
lst = list(row)
print("Bad line %d: len %d '%s'" % (i, len(lst), str(lst)))
samples = list(samples_from_file("data.csv"))
cfx = np.zeros(len(samples))
# Excel formula: =R12
cfx[0] = samples[0].acc_angX
# Excel formula: =0.98*(U12+N13*0.01)+0.02*R13
# Excel: U is cfX N is rotX R is acc_angX
for i, s in enumerate(samples[1:], 1):
cfx[i] = (1.0 - alpha) * (cfx[i-1] + s.rotX*dt) + (alpha * s.acc_angX)
check_line = [s.cfX - cf for s, cf in zip(samples, cfx)]
plt.figure(1)
plt.plot(check_line)
plt.plot(cfx)
plt.show()
check_line is the difference between the saved cfX value from the CSV file, and the new computed cfx value. As you can see in the plot, this is a straight line at 0, so my calculation is agreeing quite well with yours.
So I guess the mapping of names is:
your_name my_name
________________________
r1_angle_cfx cfx
r1_alpha_x rotX
r1_angle_ax acc_angX

inputting and aligning protein sequence

I have a script for finding mutated positions in protein sequence.The following script will do this.
import pandas as pd #data analysis python module
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN' #protein sequences
df = pd.DataFrame(map(list,data.split(',')))
I = df.columns[(df.ix[0] != df).any()]
J = [pd.get_dummies(df[i], prefix=df[i].name+1, prefix_sep='') for i in I]
print df[[]].join(J)
Here I gave the data(hard coded) ie, input protein sequences .Normally in an application user has to give the input sequences ie, I mean soft coding.
Also here alignment is not done.I read biopython tutorial and i got following script,but I don't know how to add these scripts to above one.
from Bio import AlignIO
alignment = AlignIO.read("c:\python27\proj\data1.fasta", "fasta")
print alignment
How can I do these
What I have tried :
>>> import sys
>>> import pandas as pd
>>> from Bio import AlignIO
>>> data=sys.stdin.read()
MTAQDDSYSDGKGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYLGAVFQLN
MTSQEDSYSDGKGNYNTIMPGAVFQLN
MTAQDDSYSDGRGDYNTIMPGAVFQLN
MKAQDDSYSDGRGNYNTIYLGAVFQLQ
MKSQEDSYSDGRGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYPGAVFQLN
MTAQEDSYSDGRGEYNTIYLGAVFQLQ
MTAQDDSYSDGKGDYNTIMLGAVFQLN
MTAQDDSYSDGRGEYNTIYLGAVFQLN
^Z
>>> df=pd.DataFrame(map(list,data.split(',')))
>>> I=df.columns[(df.ix[0]!=df).any()]
>>> J=[pd.get_dummies(df[i],prefix=df[i].name+1,prefix_sep='')for i in I]
>>> print df[[]].join(J)
But it is giving empty DataFrame as output.
I also tried following, but i don't know how to load these sequences into my script
while 1:
var=raw_input("Enter your sequence here:")
print "you entered ",var
Please help me.
When you read in data via:
sys.stdin.read()
Sequences are separating using '\n' rather than ',' (printing data would confirm whether this is the case, it may be system dependent), so you should split using this:
df = pd.DataFrame(map(list,data.split('\n')))
A good way to check this kind of thing is to go through it line by line, where you would see that df was a one row DataFrame (which then propagates to make I empty).
Aside: what a well written piece of code you are using! :)