Why am I getting error TypeError: 'numpy.ndarray' object is not callable? - numpy

I am trying to use the Maclaurin series to estimate the sine of a number. This is my function:
from math import sin, factorial
import numpy as np
fac = np.vectorize(factorial)
def sin_x(x, terms=10):
""" Calcul une approximation de sin(x) en utlisant
un nombre donné de termes de la serie de Maclaurin"""
n = np.arange(terms)
return np.sum[((-1)**n)(x**(2*n+1)) / fac(2*n+1)]
if __name__ == "__main__":
print("Valeur actuelle:", sin(5)) # En utilisant la fonction sinus de la librairie math
print("N (terms)\tMaclaurin\tErreur")
for n in range(1, 6):
maclaurin = sin_x(5, terms=n)
print(f"{n}\t\t{maclaurin:.03f}\t\t{sin(10) - maclaurin:.03f}")
and this is the error I get
PS C:\Users\tapef\Desktop\NUMPY-TUT> python maclaurin_sin.py
Valeur actuelle: -0.9589242746631385
N (terms) Maclaurin Erreur
Traceback (most recent call last):
File "C:\Users\tapef\Desktop\NUMPY-TUT\maclaurin_sin.py", line 20, in <module>
maclaurin = sin_x(5, terms=n)
File "C:\Users\tapef\Desktop\NUMPY-TUT\maclaurin_sin.py", line 12, in sin_x
return np.sum[((-1)**n)(x**(2*n+1)) / fac(2*n+1)]
TypeError: 'numpy.ndarray' object is not callable
How do I get rid of this error? Thanks
I've tried to use brackets instead of parenthetis.

You forgot the multiplication between (-1)**n) and (x**(2*n+1). Furthermore, np.sum() needs to be called as a function, so your return statement becomes:
return np.sum(((-1)**n)*(x**(2*n+1)) / fac(2*n+1))

Related

QGis: Clipping a single raster by a shapefile multiple features

I am trying to make raster clipping code in Qgis. The intention is, through a shapefile with 6 different polygons, that the code goes through this shapefile and cuts the raster into 6 different images that correspond to the 6 polygons. Therefore the question is a code for massive clipping of images from a raster to a shapefile with 6 entities.
I've tried looping through the shapefile and including it as a vectorLayer for image cropping, but it gives me the error:
"in_parse_args a = os.fspath(a)
TypeError: expect str, bytes or os.PathLike objectx, not list"
How do I solve the problem to clip the raster? Any ideas to find the code? Any solution for my problem?
from osgeo import gdal
import os
from pathlib import Path
#Variable para la ruta al directorio
carpeta = r'C:\Users\User\Desktop\...\Script\Shapes'
#Lista vacia para incluir los ficheros
listaArchivos = []
#Lista con todos los ficheros del directorio:
lstDir = os.walk(carpeta) #os.walk()Lista directorios y ficheros
#Crea una lista de los ficheros shp que existen en el directorio y los incluye a la lista.
for root, dirs, files in lstDir:
for fichero in files:
(nombreFichero, extension) = os.path.splitext(fichero)
if(extension == ".shp"):
listaArchivos.append(nombreFichero+extension)
#print (nombreFichero+extension)
print(listaArchivos)
print ('LISTADO FINALIZADO')
print ("longitud de la lista = ", len(listaArchivos))
basepath = Path().home().joinpath(r'C:\Users\User\Desktop\...\Script')
rasterPath = basepath.joinpath("Ortos")
vectorPath = basepath.joinpath(listaArchivos)
#Guarda el recorte en una carpeta
dir_recorte = rasterPath.joinpath('Carpeta recortes')
if not os.path.exists(dir_recorte):
os.mkdir(dir_recorte)
#Recorte de la imagen
for lyr in rasterPath.glob("*.tif"):
output_path = dir_recorte.joinpath("ID_uno_{}".format(lyr.name))
parameters = {
'ALPHA_BAND': False,
'CROP_TO_CUTLINE': True,
'DATA_TYPE': 0,
'INPUT': lyr.as_posix(),
'KEEP_RESOLUTION': False,
'MASK': vectorPath.as_posix(),
'MULTITHREADING': False,
'OPTIONS': '',
'OUTPUT': output_path.as_posix(),
'SET_RESOLUTION': False,
'SOURCE_CRS': None,
'TARGET_CRS': None,
'X_RESOLUTION': None,
'Y_RESOLUTION': None
}
resultado = processing.run("gdal:cliprasterbymasklayer", parameters)
QgsProject.instance().addMapLayer(QgsRasterLayer(resultado['OUTPUT'], output_path.stem, 'gdal'))

Why does parallelization of panda dataframes (which I manipulate using apply()) takes forever when using multiprocessing, and how to solve it?

I have a very large panda dataframe (915000 rows) and two columns, and each row represents a newspaper article I have scraped (the entire database has around 2.5 gb). I have to manipulate it and read that parralelization could offer great gains. I first defined a function, add_features(df) which takes a dataframe and applies several transformations, returning the transformed dataframe. Apply() is used, and all other functions I created that I call in add_features(df) are written inside the function add_features.
To perform the parallelization process, I used the following code (which can be found in https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1 and in several other stackoverflow pages):
from multiprocessing import Pool #for parallelization
#importing other conventional packages
def add_features(df)
#several functions being created
#apply() being used on several columns of the dataframe, using functions previously created
#return (df)
def parallelize_dataframe(df, func, n_cores=4):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
transformed_df = parallelize_dataframe(toy_df, add_features)
where toy_df is a very small subset (100 rows) of my originally large database.
When I apply add_features() on toy_df, I get all results I desire. However, when I perform the code above, it never stops running. I read a few other questions, the answers to which recommended me to write add_features() in a py file and import it. I did that, checked if the imported function was working on toy_df (it was), but when I performed the code above (except for the part where I define add_features, as I actually imported it using from add_features_file import *, add_features_file.py being the file in which I defined the function), I also faced the same problem.
What I want is to simply be able to use this function to clean my dataframe through the parallelization process, basically how to avoid it lasting forever.
EDIT: I was asked to write the code behind the function add_features(x), but my main concern is with the parralelization process, as the function add_features works (I have tested it on a subset of my dataframe). The dataframe contains newspaper articles, the dates in which they were released, as well as website links.
import pandas as pd
df = pd.DataFrame({'day_dates': ['09/04/2020', '09/04/2020', '09/04/2020'],
'day_everything': ["Petrobrás diz que 53 trabalhadores de plataforma entraram em greve. Estado é de tensão.", "CNI e centrais sindicais podem acompanhar ação judicial. Ainda há dúvidas sobre prazo.", "Bancos públicos e privados negociam empréstimo. Prazo médio está sendo negociado."],'day_links': [link_1, link_2, link_3]})
The function of interest is quite large. For the most part I am defining individual functions inside of it, and in the end I use these individual functions in the pandas.apply() method to generate the final dataframe of interest. I did it this way because, when I defined the intermediate functions and then defined another one consisting of the apply method (which use the intermediate functions in single rows), there was a problem with local/implicitly global variables such that the parralelization part wouldn`t even run ("name is not defined error" would arise when intermediate functions were called by the add_features functions, and were not defined inside of it). Thus I wrote all the intermediate functions inside the add_features function, the final chunck of which is effectively applied over the dataframe and performs textual transformations.
def add_features(df):
def at_least_one_lower_case(sentence):
if sentence.isupper():
return ''
else:
return sentence
def no_dirty_expressions(article):
#palavras contendo partes abaixo sao completamente removidas
palavras_a_tirar_caso_contenham=['-----------------------------------------------------------------','---------------------------------------------------------------','--------------------------------------------------------','-------------------------------------------------------','--------------------------------------------','------------------------------------------','------------------------------------','segunda-feira','terça-feira','quarta-feira','quinta-feira','sexta-feira','2ª-feira','3ª-feira','4ª-feira','5ª-feira','6ª-feira','2.ª','3.ª','4.ª','5.ª','6.ªs','6.ª','sábado','domingo','janeiro','fevereiro','março','abril','maio','junho','julho','agosto','setembro','outubro','novembro','dezembro','jan','fev','mar','abr','mai','jun','jul','ago','set','out','nov','dez','http','www.','quadrissemana','**','0-ações','2ª-feiraA','semestreAtuais','#','e-mail',]
#https://economia.estadao.com.br/noticias/negocios,iea-preco-agricola-sobe-1-87-na-terceira-semana-de-novembro,20041122p1280 https://economia.estadao.com.br/noticias/geral,taxas-cobradas-por-fundos-diminuem-ganhos,20021003p13232 https://economia.estadao.com.br/noticias/geral,saldo-em-conta-corrente-fica-em-us-725-milhoes-em-fevereiro,20060321p34175
palavras_a_tirar_caso_contenham_higher=[word.capitalize() for word in palavras_a_tirar_caso_contenham]
palavras_a_tirar_caso_contenham=list(set(palavras_a_tirar_caso_contenham+palavras_a_tirar_caso_contenham_higher))
pattern= ['/'] #colocar \u00C0-\u00FF depois de letras pra accent insensitivity #caso eu queira tirar algo especifico da palavra, mas nao elimina-la
combined_pattern = r'|'.join(pattern)
#palavras abaixo sao tiradas pontualmente de frases
palavras_soltas_a_tirar=['--leia mais em','ler mais na pág','leia também','Leia Também','leia o especial','Leia o especial','Leia também','Leia Também', 'leia o','Leia o','(cotação)','(variação)','CORREÇÃO (OFICIAL)','CORREÇÃO(OFICIAL)', 'Este texto foi atualizado às']
palavras_soltas_a_tirar_higher=[word.capitalize() for word in palavras_soltas_a_tirar]
palavras_soltas_a_tirar=list(set(palavras_soltas_a_tirar+palavras_soltas_a_tirar_higher))
#Ultima frase que contëm expressoes abaixo sera eliminada
bad_last_sentences=['(com','(Por','(por','Por','por','colaborou', 'Colaborou', 'COLABORARAM', 'colaboraram','CONFIRA','Confira','confira','Não deixe de ver','veja abaixo as cotações de fechamento','*', '(Edição de','Este texto foi atualizado às','As informações são da Dow Jones','É proibido todo tipo de reprodução sem autorização por escrito do The New York Times']
bad_last_sentences_higher=[word.capitalize() for word in bad_last_sentences]
bad_last_sentences=list(set(bad_last_sentences+bad_last_sentences_higher))
#Frases que contëm expressoes abaixo serao eliminadas
bad_expressions=['TRADUÇÃO DE','Matéria alterada','matéria foi atualizada','Reportagem de','reportagem de','as informações são da Dow Jones','as informações são do jornal','BBC Brasil - Todos os direitos reservados','é proibido todo tipo de reprodução sem autorização por escrito da BBC','as informações são da Dow Jones','DOW JONES NEWSWIRES','CONFIRA','Confira','confira','não deixe de ver','veja abaixo as cotações de fechamento','telefones das lojas','veja abaixo o','Veja, abaixo, tabela','semestreAtuais','industrialAnoVagas','VarejoGrupoFevereiro','Ibovespa19951996199719981999200020012002','5319,5Ibovespa-1']
bad_expressions_higher=[word.capitalize() for word in bad_expressions]
bad_expressions=list(set(bad_expressions+bad_expressions_higher))
article=article.strip() #garantindo que nao ha excesso de espaco nas bordas
if article[-1] != '.': #garantindo que ultimo caractere eh ponto final
article=article+'.'
#tiro palavras sujas e que podem conter ponto ou virgula ou dois pontos ou exclamacao ou interrogacao. Faco isso agora pois se nao, corro o risco de criar mais palavras sujas na parte de correct_error()
pre_token=article.split()
replaced_token=[re.sub(combined_pattern,'', word) for word in pre_token if not any (symbol in word for symbol in palavras_a_tirar_caso_contenham)]
article_sem_palavras_que_contenham_sujeira=' '.join(replaced_token)
#tiro palavras soltas de dentro de sentences
def palavras_soltas_a_tirar_fun(sentence):
for sujeira in palavras_soltas_a_tirar:
if sujeira in sentence:
sentence=sentence.replace(sujeira,'')
return (sentence)
sentences=article_sem_palavras_que_contenham_sujeira.split('.')
replaced_sentences=[palavras_soltas_a_tirar_fun(sentence) for sentence in sentences]
article_sem_palavras_soltas='.'.join(replaced_sentences)
#tiro palavras sujas identificaveis via regex pertencendo a estas palavras, ao inves de uma substring pertencendo a elas
def fun_remove_exact_regex(word):
exact_regex= [r'[0-9]{4}[-]+[0-9]{4}']
for pat in exact_regex:
if not re.search(pat,word):
return(word)
else:
return ''
pre_token_2=article_sem_palavras_soltas.split()
replaced_token_2=[fun_remove_exact_regex(word) for word in pre_token_2 ]
article_sem_palavras_que_contenham_exact_regex=' '.join(replaced_token_2)
article_sem_palavras_que_contenham_exact_regex=article_sem_palavras_que_contenham_exact_regex.strip() #garantindo que nao ha excesso de espaco nas bordas
if article_sem_palavras_que_contenham_exact_regex[-1] != '.': #garantindo que ultimo caractere eh ponto final
article_sem_palavras_que_contenham_exact_regex=article_sem_palavras_que_contenham_exact_regex+'.'
#tiro a última frase caso esta seja suja
sentences=article_sem_palavras_que_contenham_exact_regex.split('.')
last_sentence=article_sem_palavras_que_contenham_exact_regex.split('.')[-2]
no_dirty_last_sentence=sentences[:-2] if any (bad_last_sentence in last_sentence for bad_last_sentence in bad_last_sentences) else sentences
no_dirty_last_sentence='.'.join(no_dirty_last_sentence)
#tiro alguma frase suja central restante
sentences=no_dirty_last_sentence.split('.')
clean_sentences=[sentence for sentence in sentences if not any(expression in sentence for expression in bad_expressions)]
not_upper_clean_sentences=[at_least_one_lower_case(sentence) for sentence in clean_sentences]
clean_text='.'.join(not_upper_clean_sentences)
return clean_text
def fun_split(text):
return text.split()
def at_least_one_letter(word):
if any(character.isalpha() for character in word):
return word
else:
return ''
def fun_no_tokens_with_no_letters_and_bad_symbols(list_of_tokens):
bad_symbols=['#','$'] #o ultimo elemento engloba tokens formados por letras maisculas dentro de parentesis, como (PIB). Eu perderia (PIB) no caminho, mas tudo bem desde que esteja depois de produto interno bruto. PIB solto eu não perco. Faco isso pq mts vezes tem tags de acoes da bovespa escritos dessa forma. Eu antes substituia isso por espacos vazios, dessa forma acabava tendo uma porrada de palavras vazias. agora elimino tokens assim por completo (e dificilmente uma palavra normal contera nela parenthesis com letras maiusculas, a nao ser q esquecao o espaco)
pattern=[''] #colocar \u00C0-\u00FF depois das letras pra accent insensititity#r'\([A-Z]+?\)' decidi nao eliminar palavras como (PIB)
combined_pattern = r'|'.join(pattern)
clean=[re.sub(combined_pattern,'',at_least_one_letter(word)) for word in list_of_tokens if not any(symbol in word for symbol in bad_symbols)]
return clean
def fun_clean_extreme(list_of_tokens):
def clean_extreme(word):
if any(character.isalnum() for character in word):
while not (word[0].isalnum()):
word=word[1:]
while not (word[-1].isalnum()):
word=word[:-1]
return (word)
else: #Pois se nao levasse em conta tokens nao alfanumericos, daria problema
return ''
clean=[clean_extreme(word) for word in list_of_tokens]
return clean
def flatten_list(_2d_list):
flat_list = []
# Iterate through the outer list
for element in _2d_list:
if type(element) is list:
# If the element is of type list, iterate through the sublist
for item in element:
flat_list.append(item)
else:
flat_list.append(element)
return flat_list
def no_initial_repetition_of_vowel(word): #perderei siglas neste passo
if len(word)>=2 and word not in ['Aaabar','Aaa.br','Aa3.br']:
if (word[0]=='A' and word[1]=='A') or (word[0]=='A' and word[1]=='a') or (word[0]=='a' and word[1]=='A') or (word[0]=='a' and word[1]=='a'):
index=1
splitat= index
l, r = word[:splitat], word[splitat:]
return [l,r]
else:
return(word)
else:
return(word)
def correct_dot(word):
if '.' in word:
index=word.find('.') #vai estar no meio
if word[index+1].isupper():
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
else:
return(word)
def correct_comma(word):
if ',' in word:
index=word.find(',') #vai estar no meio
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
def correct_comma_dot(word):
if ';' in word:
index=word.find(';') #vai estar no meio
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
def correct_exclamation(word):
if '!' in word:
index=word.find('!') #vai estar no meio
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
def correct_question_mark(word):
if '?' in word:
index=word.find('?') #vai estar no meio
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
def correct_two_dots(word):
if ':' in word:
index=word.find(':') #vai estar no meio
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [l,r]
else:
return(word)
def no_two_hifens(word): #nao quero '2000--tesouro' #assumo que mais hifens alem de dois entre palavras pode ser sujeira de tabela
pattern=r'^.*[0-9]*[ A-Za-z\u00C0-\u00FF]*.*[-]{2}.*[0-9]*[ A-Za-z\u00C0-\u00FF]*.*[ A-Za-z\u00C0-\u00FF]*$' #pra accent insensitivy
if re.match(pattern,word):
index=word.find('--')
splitat= index
l, r = word[:splitat], word[splitat+2:]
return [l,r]
else:
return(word)
def single_hifen_exception(word):
special_list=['19,5%-Unica','13,7%-Anfavea','6,22%-IBGE']
if word in special_list:
index=word.find('-')
splitat= index
l, r = word[:splitat], word[splitat+1:]
return [r]
else:
return(word)
def correct_error(article):
article=flatten_list([no_initial_repetition_of_vowel(word) for word in article])
article=flatten_list([correct_dot(word) for word in article])
article=flatten_list([correct_comma(word) for word in article])
article=flatten_list([correct_two_dots(word) for word in article])
article=flatten_list([correct_exclamation(word) for word in article])
article=flatten_list([correct_question_mark(word) for word in article])
article=flatten_list([no_initial_repetition_of_vowel(word) for word in article])
article=flatten_list([no_two_hifens(word) for word in article])
article=flatten_list([single_hifen_exception(word) for word in article])
return article
def fun_no_symbols_except_hyphens_replaced_without_space(list_of_tokens):
patterns=[r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿ-/]+'] #\u00C0-\u00FF
combined_pattern = r'|'.join(patterns)
clean=[re.sub(combined_pattern, '', word) for word in list_of_tokens]
return clean
def fun_no_spaces_in_middle(list_of_tokens):
clean=flatten_list([word.split() for word in list_of_tokens])
return clean
def fun_not_the_same_character(list_of_tokens):
def not_the_same_character(word):
if len(word)>=2:
if word == len(word) * word[0]:
return ''
else:
return word
else:
return ''
clean=[not_the_same_character(word) for word in list_of_tokens]
return clean
#ELIMINATING CERTAIN VOCALUBARY
def fun_sep_letters_and_numb_hifen(list_of_tokens):
def separate_right_letters_from_left_numbers_when_hifen(word):
pat_numbers_before_letters_hifen=r'^[0-9]+[-]{1}[0-9]*[ A-Za-z\u00C0-\u00FF]+$' #damos espaco para hifen. por exemplo, 145paises-membros viraria paises-membros. A regra so vale se o numero NAO eh seguido de hifen. Faco isso pois assumo que neste caso pode ter havido uma falta de espaco do jornal. Posso inclusive botar isso mais no inicio
if re.match(pat_numbers_before_letters_hifen,word):
index=word.find('-')
splitat= index
l, r = word[:splitat], word[splitat+1:]
return r
else:
return word
def separate_left_letters_from_right_numbers_when_hifen(word):
pat_letters_before_numbers_hifen=r'^[ A-Za-z\u00C0-\u00FF]+[-]{1}[0-9]+$' #o hifen nao pode estar colado na letra, pois caso contrario perderiamos covid-19 so para covid, o que nao eh de todo pior.
if re.match(pat_letters_before_numbers_hifen,word) and word not in ['covid-19','Covid-19']:
index=word.find('-')
splitat= index
l, r = word[:splitat], word[splitat:]
return l
else:
return word
clean=[separate_right_letters_from_left_numbers_when_hifen(separate_left_letters_from_right_numbers_when_hifen(word)) for word in list_of_tokens]
return clean
def sep_letter_and_numb(list_of_tokens):
def separate_right_letters_from_left_numbers(word):
pat_numbers_before_letters=r'^[0-9]+[ A-Za-z\u00C0-\u00FF]*[-]*[ A-Za-z\u00C0-\u00FF]+$' #damos espaco para hifen. por exemplo, 145paises-membros viraria paises-membros. A regra so vale se o numero NAO eh seguido de hifen. Faco isso pois assumo que neste caso pode ter havido uma falta de espaco do jornal. Posso inclusive botar isso mais no inicio
if re.match(pat_numbers_before_letters,word):
index=word.find(next(filter(str.isalpha,word)))
splitat= index
l, r = word[:splitat], word[splitat:]
return r
else:
return word
def separate_left_letters_from_right_numbers(word):
pat_letters_before_numbers=r'^[ A-Za-z\u00C0-\u00FF]+[0-9]*[0-9]+$' #o hifen nao pode estar colado na letra, pois caso contrario perderiamos covid-19 so para covid, o que nao eh de todo pior.
if re.match(pat_letters_before_numbers,word) and word not in ['G20','G8']:
index=word.find(next(filter(str.isnumeric,word)))
splitat= index
l, r = word[:splitat], word[splitat:]
return l
else:
return word
clean=[separate_left_letters_from_right_numbers(separate_right_letters_from_left_numbers(word)) for word in list_of_tokens]
return clean
def fun_no_letters_surrounded_by_numbers(list_of_tokens):
patterns=[r'[0-9]+[ A-Za-z\u00C0-\u00FF]+[0-9]+[ A-Za-z\u00C0-\u00FF]*']
combined_pattern = r'|'.join(patterns)
clean=[word for word in list_of_tokens if not re.match(combined_pattern,word)]
return clean
def fun_no_single_character_token(list_of_tokens):
clean=[word for word in list_of_tokens if len(word) > 2]
return clean
def fun_no_stopwords(list_of_tokens):
nltk_stopwords = nltk.corpus.stopwords.words('portuguese') #é uma lista
personal_stopwords = ['Esse','Essa','Essas','Esses','De','A','As','O','Os','Em','Também','Um','Uma','Uns','Umas','Eu','eu','Ele','ele','Nós','nós','Você','você','entanto','Entanto','ser','Ser','com','Com','como','Como','se','Se','portanto','Portanto','Enquanto','enquanto','No','no','Na','na','Nessa','Nesse','Nesses','Nessas','As','A','Os','às','Às','ante','Ante','entre','Entre','sim','não','Sim','Não','Onde','onde','Aonde','aonde','após','Após','ser','Ser','hoje','Se','vai','Hoje','Por','Quando','Também','Depois','Mesmo','Numa','Numas','Pelos','Aquele','Aquela','Aqueles','Aquelas','Aquilos','Há','Ou','Isso','Segundo','segundo','pois','Pois','outro','Outro','outros','Outros','outra','Outras','outras','Outra','Este','cada','Cada','para','Para','disso','Disso','dessa','desse','deste','desta','Deste','Desta','Destes','Destas','já','Já','Mas','mas','ao','Ao','porém','Porém','Este','mesma','Mais','cujo','cuja','cujos','cujas','caso','quanto','a partir','cujo','caso','quanto','devido','pelo','pela','pelas','à','do','disto','mesmas','mesmos','nós','primeiro','primeiros','primeira','primeiras','segunda','quanto','neste','nesta','nestes','nestas']
list_of_numbers=list(range(0, 100))
lista_numeros_extenso=[num2words(number,lang='pt_BR') for number in list_of_numbers]
initial_stopwords=nltk_stopwords+personal_stopwords+lista_numeros_extenso
initial_stopwords_lower=[word.lower() for word in initial_stopwords]
exceptions=['são','fora'] #Assim, preservamos 'São Paulo' e 'Fora Temer'
initial_stopwords_higher=[word.capitalize() for word in initial_stopwords_lower if not word in exceptions]
stopwords=list(set(initial_stopwords_lower+initial_stopwords_higher))
stopwords.sort()
clean=[word for word in list_of_tokens if word not in stopwords]
return clean
#Lematizing and _lower_case
! python -m spacy download pt_core_news_sm
nlp = spacy.load("pt_core_news_sm") #, disable=['parser', 'ner'] # just keep tagger for lemmatization
def fun_verb_lematized_lower_case(list_of_tokens):
#nlp = spacy.load("pt_core_news_sm") #, disable=['parser', 'ner'] # just keep tagger for lemmatization
spacy_object=nlp(' '.join(list_of_tokens))
clean=[token.lemma_.lower() if token.pos_=='VERB' else token.text.lower() for token in spacy_object]
return clean
def fun_verb_adv_adj_pron_det_lematized_lower_case(list_of_tokens):
#nlp = spacy.load("pt_core_news_sm") #, disable=['parser', 'ner'] # just keep tagger for lemmatization
spacy_object=nlp(' '.join(list_of_tokens))
clean=[token.lemma_.lower() if (token.pos_=='VERB' or token.pos_== 'ADV' or token.pos_== 'ADJ' or token.pos_== 'DET' or token.pos_== 'PRON') else token.text.lower() for token in spacy_object]
return clean
def fun_completely_lematized_lower_case(list_of_tokens):
#nlp = spacy.load("pt_core_news_sm") #, disable=['parser', 'ner'] # just keep tagger for lemmatization
spacy_object=nlp(' '.join(list_of_tokens))
clean=[token.lemma_.lower() for token in spacy_object]
return clean
def function_unique_tokens(text):
clean=set(text)
return clean
print ('no_dirty_expression')
df['no_dirty_expression']=df['everything'].apply(lambda x: no_dirty_expressions(x))
print ('tokenized')
df['tokenized']=df['no_dirty_expression'].apply(lambda x: fun_split(x))
print ('no_tokens_with_no_letters_and_bad_symbols')
df['no_tokens_with_no_letters_and_bad_symbols']=df['tokenized'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('no_symbols_in_extreme')
df['no_symbols_in_extreme']=df['no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: fun_clean_extreme(x))
print ('correction_errors')
df['correction_errors']=df['no_symbols_in_extreme'].apply(lambda x: correct_error(x))
print ('no_symbols_except_hyphens_replaced_without_space')
df['no_symbols_except_hyphens_replaced_without_space']=df['correction_errors'].apply(lambda x: fun_no_symbols_except_hyphens_replaced_without_space(x))
print ('no_symbols_in_extreme_again')
df['no_symbols_in_extreme_again']=df['no_symbols_except_hyphens_replaced_without_space'].apply(lambda x: fun_clean_extreme(x))
print ('no_spaces_in_middle')
df['no_spaces_in_middle']=df['no_symbols_in_extreme_again'].apply(lambda x: fun_no_spaces_in_middle(x))
print ('no_same_character_word')
df['no_same_character_word']=df['no_spaces_in_middle'].apply(lambda x: fun_not_the_same_character(x))
print ('again_no_tokens_with_no_letters_and_bad_symbols')
df['again_no_tokens_with_no_letters_and_bad_symbols']=df['no_same_character_word'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('no_letters_and_numbers_divided_by_hyphen')
df['no_letters_and_numbers_divided_by_hyphen']=df['again_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: fun_sep_letters_and_numb_hifen(x))
print ('no_letters_and_numbers_glued')
df['no_letters_and_numbers_glued']=df['no_letters_and_numbers_divided_by_hyphen'].apply(lambda x: sep_letter_and_numb(x))
print ('no_letters_surrounded_by_numbers')
df['no_letters_surrounded_by_numbers']=df['no_letters_and_numbers_glued'].apply(lambda x: fun_no_letters_surrounded_by_numbers(x))
print ('no_single_character_token')
df['no_single_character_token']=df['no_letters_surrounded_by_numbers'].apply(lambda x: fun_no_single_character_token(x))
print ('no_stopwords')
df['no_stopwords']=df['no_single_character_token'].apply(lambda x: fun_no_stopwords(x))
print ('once_more_no_tokens_with_no_letters_and_bad_symbols')
df['once_more_no_tokens_with_no_letters_and_bad_symbols']=df['no_stopwords'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('verb_lematized_lower_case')
df['verb_lematized_lower_case']=df['once_more_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: fun_verb_lematized_lower_case(x))
print ('verb_adv_adj_pron_det_lematized_lower_case')
df['verb_adv_adj_pron_det_lematized_lower_case']=df['once_more_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: fun_verb_adv_adj_pron_det_lematized_lower_case(x))
print ('completely_lematized_lower_case')
df['completely_lematized_lower_case']=df['once_more_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: fun_completely_lematized_lower_case(x))
print ('verb_final_no_tokens_with_no_letters_and_bad_symbols')
df['verb_final_no_tokens_with_no_letters_and_bad_symbols']=df['verb_lematized_lower_case'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('all_but_noun_final_no_tokens_with_no_letters_and_bad_symbols')
df['all_but_noun_final_no_tokens_with_no_letters_and_bad_symbols']=df['verb_adv_adj_pron_det_lematized_lower_case'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('complete_final_no_tokens_with_no_letters_and_bad_symbols')
df['complete_final_no_tokens_with_no_letters_and_bad_symbols']=df['completely_lematized_lower_case'].apply(lambda x: fun_no_tokens_with_no_letters_and_bad_symbols(x))
print ('unique_verb_tokens')
df['unique_verb_tokens']=df['verb_final_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: function_unique_tokens(x))
print ('unique_all_but_nouns_tokens')
df['unique_all_but_nouns_tokens']=df['all_but_noun_final_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: function_unique_tokens(x))
print ('unique_all_tokens')
df['unique_all_tokens']=df['complete_final_no_tokens_with_no_letters_and_bad_symbols'].apply(lambda x: function_unique_tokens(x))
return df
Once again, my main concern, given also a lack of time for me to finnish my thesis, is to be able to perform the parralelization process with THIS function (though if I find some time, it will be a pleasure to try to make it more efficient). The only problem I have faced is that the parralelization chunck never stops running, although the function works and I have tested it on smaller dataframes.

How do I get scipy.stats.truncnorm.rvs to use numpy.random.default_rng()?

I am having trouble with random_state in scipy.stats.truncnorm. Here is my code:
from scipy.stats import truncnorm
from numpy.random import default_rng
rg = default_rng( 12345 )
truncnorm.rvs(0.0,1.0,size=10, random_state=rg)
I get the following error:
File "test2.py", line 4, in <module>
truncnorm.rvs(0.0,1.0,size=10, random_state=rg)
File "/opt/anaconda3/envs/newbase/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py", line 1004, in rvs
vals = self._rvs(*args, size=size, random_state=random_state)
File "/opt/anaconda3/envs/newbase/lib/python3.8/site-packages/scipy/stats/_continuous_distns.py", line 7641, in _rvs
out = self._rvs_scalar(a.item(), b.item(), size, random_state=random_state)
File "/opt/anaconda3/envs/newbase/lib/python3.8/site-packages/scipy/stats/_continuous_distns.py", line 7697, in _rvs_scalar
U = random_state.random_sample(N)
AttributeError: 'numpy.random._generator.Generator' object has no attribute 'random_sample'
I am using numpy 1.19.1 and scipy 1.5.0. The problem does not occur with scipy.norm.rvs.
In scipy 1.7.1, the problem line has been changed to:
def _rvs_scalar(self, a, b, numsamples=None, random_state=None):
if not numsamples:
numsamples = 1
# prepare sampling of rvs
size1d = tuple(np.atleast_1d(numsamples))
N = np.prod(size1d) # number of rvs needed, reshape upon return
# Calculate some rvs
U = random_state.uniform(low=0, high=1, size=N)
x = self._ppf(U, a, b)
rvs = np.reshape(x, size1d)
return rvs
Both have uniform, but rg does not have random_sample:
In [221]: rg.uniform
Out[221]: <function Generator.uniform>
In [222]: np.random.uniform
Out[222]: <function RandomState.uniform>
np.random.random_sample has this note:
.. note::
New code should use the ``random`` method of a ``default_rng()``
instance instead; please see the :ref:`random-quick-start`.

Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon

I am writing an algorithm in python which processes a column of sentences and then gives the polarity (positive or negative) of each cell of my column of sentences. The script uses a list of negative and positive word from the NRC emotion lexicon (French version) I am having a problem writing the preprocess function. I have already written the count function and the polarity function but since I have some difficulty writing the preprocess function, I am not really sure if those functions works.
The positive and negative words were in the same file (lexicon) but I export positive and negztive words separately because I did not know how to use the lexicon as it was.
My function count occurrence of positive and negative does not work and I do not know why it Always sends me 0. I Added positive word in each sentence so the should appear in the dataframe:
stacktrace :
[4 rows x 6 columns]
id Verbatim ... word_positive word_negative
0 15 Je n'ai pas bien compris si c'était destiné a ... ... 0 0
1 44 Moi aérien affable affaire agent de conservati... ... 0 0
2 45 Je affectueux affirmative te hais et la Foret ... ... 0 0
3 47 Je absurde accidentel accusateur accuser affli... ... 0 0
=>
def count_occurences_Pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))
This my csv_data : line 44 , 45 contains positive words and line 47 more negative word but in the column of positive and negative word , it is alwaqys empty, the function does not return the number of words and the final column is Always positive whereas the last sentence is negative
id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue
Here the full code :
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict, Counter
csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())
stopWords = set(stopwords.words('french'))
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
def process_text(text):
'''extract lemma and lowerize then removing stopwords.'''
text_preprocess =[]
text_without_stopwords= []
text = tagger.tag_text(text)
for word in text:
parts = word.split('\t')
try:
if parts[2] == '':
text_preprocess.append(parts[1])
else:
text_preprocess.append(parts[2])
except:
print(parts)
text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
return text_without_stopwords
csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)
lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)
lexiconneg = open('negative.txt', 'r', encoding='utf-8')
def count_occurences_neg(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, lexiconpos)
negatives_text =count_occurences_neg(text, lexiconneg)
if positives_text > negatives_text :
return "positive"
else :
return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)
If you could also see if the rest of the code is good to thank you.
I have found your error!
It comes from the Polarity_score function.
It's just a typo :
In your, if statement you were comparing count_occurences_Pos and count_occurences_Neg which are function instead of comparing the results of the function count_occurences_pos and count_occurences_peg
Your code should be like this :
def Polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
count_text_pos =count_occurences_Pos(text, word_list)
count_text_neg =count_occurences_Neg(text, word_list)
if count_occurences_pos > count_occurences_peg :
return "Positive"
else :
return "negative"
In the future, you need to learn how to have meaningful names for your variables to avoid those kinds of errors
With correct variables names, your function should be :
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, word_list)
negatives_text =count_occurences_neg(text, word_list)
if positives_text > negatives_text :
return "Positive"
else :
return "negative"
Another improvement you can make in your count_occurences_pos and count_occurences_neg function is to use set instead of the list. Your text and world_list can be converted to sets and you can use the set intersection to retrieve the positive texts in them.Because set are faster than lists

Time Dependant 1D Schroedinger Equation using Numpy and SciPy solve_ivp

I am trying to solve the 1D time dependent Schroedinger equation using finite difference methods, here is how the equation looks and how it undergoes discretization
Say I have N spatial points (the x_i goes from 0 to N-1), and suppose my time span is K time points.
I strive to get a K by N matrix. each row (j) will be the function at time t_j
I suspect that my issue is that I am defining the system of the coupled equations in a wrong way.
My boundary conditions are psi=0, or some constant at the sides of the box so I make the ode's in the sides of my X span to be zero
My Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
#Defining the length and the resolution of our x vector
length = 2*np.pi
delta_x = .01
# create a vector of X values, and the number of X values
def create_x_vector(length, delta_x):
x = np.arange(-length, length, delta_x)
N = len(x)
return x, N
# create initial condition vector
def create_initial_cond(x,x0,Gausswidth):
psi0 = np.exp((-(x-x0)**2)/Gausswidth)
return psi0
#create the system of ODEs
def ode_system(psi,t,delta_x,N):
psi_t = np.zeros(N)
psi_t[0]=0
psi_t[N-1]=0
for i in range(1,N-1):
psi_t[i] = (psi[i+1]-2*psi[i]+psi[i-1])/(delta_x)**2
return psi_t
#Create the actual time, x and initial condition vectors using the functions
t = np.linspace(0,15,5000)
x,N= create_x_vector(length,delta_x)
psi0 = create_initial_cond(x,0,1)
psi = np.zeros(N)
psi= solve_ivp(ode_system(psi,t,delta_x,N),[0,15],psi0,method='Radau',max_step=0.1)
After running I get an error:
runfile('D:/Studies/Project/Simulation Test/Test2.py', wdir='D:/Studies/Project/Simulation Test')
Traceback (most recent call last):
File "<ipython-input-16-bff0a1fd9937>", line 1, in <module>
runfile('D:/Studies/Project/Simulation Test/Test2.py', wdir='D:/Studies/Project/Simulation Test')
File "C:\Users\Pasha\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "C:\Users\Pasha\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/Studies/Project/Simulation Test/Test2.py", line 35, in <module>
psi= solve_ivp(ode_system(psi,t,delta_x,N),[0,15],psi0,method='Radau',max_step=0.1)
File "C:\Users\Pasha\Anaconda3\lib\site-packages\scipy\integrate\_ivp\ivp.py", line 454, in solve_ivp
solver = method(fun, t0, y0, tf, vectorized=vectorized, **options)
File "C:\Users\Pasha\Anaconda3\lib\site-packages\scipy\integrate\_ivp\radau.py", line 288, in __init__
self.f = self.fun(self.t, self.y)
File "C:\Users\Pasha\Anaconda3\lib\site-packages\scipy\integrate\_ivp\base.py", line 139, in fun
return self.fun_single(t, y)
File "C:\Users\Pasha\Anaconda3\lib\site-packages\scipy\integrate\_ivp\base.py", line 21, in fun_wrapped
return np.asarray(fun(t, y), dtype=dtype)
TypeError: 'numpy.ndarray' object is not callable
In a more general note, how can I make python solve N ode's without manually defining each and one of them?
I want to have a big vector called xdot where each cell in this vector will be a function of some X[i]'s and I seem to fail to do that? or maybe my approach is completely wrong?
Also I feel maybe that the "Vectorized" argument of ivp_solve can be connected, but I do not understand the explanation in the SciPy documentation.
The problem is probably that solve_ivp expects a function for its first parameter, and you provided ode_system(psi,t,delta_x,N) which results in a matrix instead (therefore you get type error - ndarray).
You need to provide solve_ivp a function that accepts two variables, t and y (which in your case is psi). It can be done like this:
def temp_function(t, psi):
return ode_system(psi,t,delta_x,N)
and then, your last line should be:
psi= solve_ivp(temp_function,[0,15],psi0,method='Radau',max_step=0.1)
This code solved the problem for me.
For a shorthand way of doing this, you can also just write the function inline using Lambda :
psi= solve_ivp(lambda t,psi : ode_system(psi,t,delta_x,N),[0,15],psi0,method='Radau',max_step=0.1)