Is there a way to speed up this webscraping iteration? Pandas - pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.

Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.

Related

Flask: how to paginate cx_Oracle data between successive requests?

My Flask app needs to return a huge dataframe to the client application.
I'm using the pandas function read_sql to fetch chunks of data, for example:
import pandas as pd
sql = "select * from huge_table"
iterator = pd.read_sql(sql, con=my_cx_oracle_connection, chunksize=1000)
Where iterator would be used to fetch the whole data divided into small chunks of 1000 records each:
data = next(iterator, [])
while data:
yield data
data = next(iterator, [])
With this approach, I guess can "stream", or at least, paginate the data just as described in the Flask documentation.
However, to be able to do so, I would need to retain the state of the iterator between the HTTP /GET requests. How should one do this? Do I need some sort of global variable? But then, what about multiple clients?!
I'm missing something to make it work properly, and avoid fetching the same part of the data over and over.
Thanks.

How to create a sequence of pandas dataframes using a for loop

I have information in different dataframes related to several securities. I would like to be able to run a loop where I take the columns I need from different dataframes and I create one dataframe per security with the columns I need and then I store the dataframes as elements of a dictionary, something along the following lines:
securities={}
for tick in tickers:
df_'tick' = pd.Dataframe(data,columns['a','b','c'])
securities[tick] = df_tick
Thanks for your help!!
Try:
securities = { tick : pd.Dataframe(data,columns['a','b','c'])
for tick in tickers:
}

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Pandas, importing JSON-like file using read_csv

I would like to import data from .txt to dataframe. I can not import it using classical pd.read_csv, while using different types of sep it throws me errors. Data I want to import Cell_Phones_&_Accessories.txt.gz is in a format.
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
....
You can use jen for separator, then split by first : and pivot:
df = pd.read_csv('Cell_Phones_&_Accessories.txt', sep='¥', names=['data'], engine='python')
df1 = df.pop('data').str.split(':', n=1, expand=True)
df1.columns = ['a','b']
df1 = df1.assign(c=(df1['a'] == 'product/productId').cumsum())
df1 = df1.pivot('c','a','b')
Python solution with defaultdict and DataFrame constructor for improve performance:
from collections import defaultdict
data = defaultdict(list)
with open("Cell_Phones_&_Accessories.txt") as f:
for line in f.readlines():
if len(line) > 1:
key, value = line.strip().split(':', 1)
data[key].append(value)
df = pd.DataFrame(data)

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())