The code bellow gets me in a loop and prints a table of data from a website.
How can i get the data from this 'output' table into a new orgaized table?
I need to get the 'Códigos de Negociação' and the 'CNPJ' into this new table.
This is a sample of the scraped table
0 1
0 Nome de Pregão FII ALIANZA
1 Códigos de Negociação ALZR11
2 CNPJ 28.737.771/0001-85
3 Classificação Setorial Financeiro e Outros/Fundos/Fundos Imobiliários
4 Site www.btgpactual.com
This is the code
import pandas as pd
list = pd.read_html('http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListados.aspx?tipoFundo=imobiliario&Idioma=pt-br')[0]
Tickers = list['Código'].tolist()
removechars = str.maketrans('', '', './-')
for i in Tickers:
try:
df = pd.read_html("http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListadosDetalhe.aspx?Sigla="+i+"&tipoFundo=Imobiliario&aba=abaPrincipal&idioma=pt-br")[0]
print(df)
except:
print('y')
And i would like to apply the removechars in the CNPJ, to clear it from dots, bars and dashes.
Expected result:
Código CNPJ
0 ALZR11 28737771000185
This code worked for me
import pandas as pd
list = pd.read_html('http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListados.aspx?tipoFundo=imobiliario&Idioma=pt-br')[0]
Tickers = list['Código'].tolist()
print(list)
CNPJ = []
Codigo = []
removechars = str.maketrans('', '', './-')
for i in Tickers:
try:
df = pd.read_html("http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListadosDetalhe.aspx?Sigla="+i+"&tipoFundo=Imobiliario&aba=abaPrincipal&idioma=pt-br")[0]
print(df)
Codigo.append(df.at[1, 1])
CNPJ.append(df.at[2,1])
df2 = pd.DataFrame({'Codigo':Codigo,'CNPJ':CNPJ})
CNPJ_No_S_CHAR = [s.translate(removechars) for s in CNPJ]
df2['CNPJ'] = pd.Series(CNPJ_No_S_CHAR)
print(df2)
except:
print('y')
Related
I have a pandas colum which has special characters such as {{,}},[,],,. (commas are separators).
I tried using the following to replace the special characters with an underscore ('_'), but it is not working. Can you please let me know what I am doing wrong? Thanks.
import pandas as pd
data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Marketing'])
print(df)
df['Marketing'].str.replace(r"\(|\)|\{|\}|\[|\]|\|", "_")
print(df)
Output:
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
From this DataFrame :
>>> import pandas as pd
>>> data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
>>> df = pd.DataFrame(data, columns = ['Marketing'])
>>> df
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
We can use replace as you suggested with a regex, including | which is a or operator except for the final \| which is the symbol |.
Then we deduplicate the double _ and we remove the final remaining _ to get the expected result :
>>> df['Marketing'] = df['Marketing'].str.replace(r"\(+|\)+|\{+|\}+|\[+|\]+|\|+|\_+|\.+", "_", regex=True).str.replace(r"_+", "_", regex=True).str.replace(r"_$", "", regex=True)
>>> df
0 facebook_campaign_name
1 google_email
Name: Marketing, dtype: object
I think my code works well.
But the problem is that my code does not leave every answer on DataFrame R.
When I print R, only the last answer appeared.
What should I do to display every answer?
I want to add answer on the next column.
import numpy as np
import pandas as pd
DATA = pd.DataFrame()
DATA = pd.read_excel('C:\gskim\P4DS/Monthly Stock adjClose2.xlsx')
DATA = DATA.set_index("Date")
DATA1 = np.log(DATA/DATA.shift(1))
DATA2 = DATA1.drop(DATA1.index[0])*100
F = pd.DataFrame(index = DATA2.index)
for i in range (0, 276):
Q = DATA2.iloc[i].dropna()
W = sorted(abs(Q), reverse = False)
W_qcut = pd.qcut(W, 5, labels = ['A', 'B', 'C', 'D', 'E'])
F = Q.groupby(W_qcut).sum()
R = pd.DataFrame(F)
print(R)
the first table is the current result, I want to fill every blank tables on the second table as a result:
I wrote a query in the data frame and want to save it in CSV file
I tried this code and didn't work
q1 = "SELECT * FROM df1 join df2 on df1.Date = df2.Date"
df = pd.read_sql(q1,None)
df.to_csv('data.csv',index=False)
You can try following code:
import pandas as pd
df1 = pd.read_csv("Insert file path")
df2 = pd.read_csv("Insert file path")
df1['Date'] = pd.to_datetime(df1['Date'] ,errors = 'coerce',format = '%Y-%m-%d')
df2['Date'] = pd.to_datetime(df2['Date'] ,errors = 'coerce',format = '%Y-%m-%d')
df = df1.merge(df2,how='inner', on ='Date')
df.to_csv('data.csv',index=False)
This should solve your problem.
I'm iterating through PDF's to obtain the text entered in the form fields. When I send the rows to a csv file it only exports the last row. When I print results from the Dataframe, all the row indexes are 0's. I have tried various solutions from stackoverflow, but I can't get anything to work, what should be 0, 1, 2, 3...etc. are coming in as 0, 0, 0, 0...etc.
Here is what I get when printing results, only the last row exports to csv file:
0
0 1938282828
0
0 1938282828
0
0 22222222
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2])
print(df)`
Thank you for any help!
You are replacing the same dataframe each time:
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2]) # this creates new df each time
print(df)
Correct Code:
infile = glob.glob('./*.pdf')
df = pd.DataFrame()
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = df.append([myfieldvalue2])
print(df)
I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])