Flask: how to paginate cx_Oracle data between successive requests? - pandas

My Flask app needs to return a huge dataframe to the client application.
I'm using the pandas function read_sql to fetch chunks of data, for example:
import pandas as pd
sql = "select * from huge_table"
iterator = pd.read_sql(sql, con=my_cx_oracle_connection, chunksize=1000)
Where iterator would be used to fetch the whole data divided into small chunks of 1000 records each:
data = next(iterator, [])
while data:
yield data
data = next(iterator, [])
With this approach, I guess can "stream", or at least, paginate the data just as described in the Flask documentation.
However, to be able to do so, I would need to retain the state of the iterator between the HTTP /GET requests. How should one do this? Do I need some sort of global variable? But then, what about multiple clients?!
I'm missing something to make it work properly, and avoid fetching the same part of the data over and over.
Thanks.

Related

How to filter Socrata API dataset by multiple values for a single field?

I am attempting to create a CSV file using Python by reading from this specific api:
https://dev.socrata.com/foundry/data.cdc.gov/5jp2-pgaw
Where I'm running into trouble is that I would like to specify multiple values of "loc_admin_zip" to search for at once. For example, returning a CSV file where the zip is either "10001" or "10002". However, I can't figure out how to do this, I can only get it to work if "loc_admin_zip" is set to a single value. Any help would be appreciated. My code so far:
import pandas as pd
from sodapy import Socrata
client = Socrata("data.cdc.gov", None)
results = client.get("5jp2-pgaw",loc_admin_zip = 10002)
results_df = pd.DataFrame.from_records(results)
results_df.to_csv('test.csv')

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.

Flask button to save table from query as csv

I have a flask app that runs a query and returns a table. I would like to provide a button on the page so the user can export the data as a csv.
The problem is that the query is generated dynamically based on form input.
#app.route('/report/<int:account_id>', methods=['GET'])
def report(account_id):
if request == 'GET':
c = g.db.cursor()
c.execute('SELECT * FROM TABLE WHERE account_id = :account_id', account_id=account_id)
entries = [dict(title=row[0], text=row[1]) for row in c.fetchall()]
return render_template('show_results.html', entries=entries)
On the html side it's just a simple table, looping over the rows and rendering them. I'm using bootstrap for styling, and included a tablesorter jquery plugin. None of this is really consequential. I did try one javascript exporter I found, but since my content is rendered dynamically, it saves a blank CSV.
Do I need to do some ajax-style trickery to grab a csv object from the route?
I solved this myself. For anyone who comes across this I find it valuable for the specific use case within flask. Here's what I did.
import cx_Oracle # We are an Oracle shop, and this changes some things
import csv
import StringIO # allows you to store response object in memory instead of on disk
from flask import Flask, make_response # Necessary imports, should be obvious
#app.route('/export/<int:identifier>', methods=['GET'])
def export(load_file_id):
si = StringIO.StringIO()
cw = csv.writer(si)
c = g.db.cursor()
c.execute('SELECT * FROM TABLE WHERE column_val = :identifier', identifier=identifier)
rows = c.fetchall()
cw.writerow([i[0] for i in c.description])
cw.writerows(rows)
response = make_response(si.getvalue())
response.headers['Content-Disposition'] = 'attachment; filename=report.csv'
response.headers["Content-type"] = "text/csv"
return response
For anyone using flask with sqlalchemy, here's an adjustment to tadamhicks answer, also with a library update:
import csv
from io import StringIO
from flask import make_response
si = StringIO()
cw = csv.writer(si)
records = myTable.query.all() # or a filtered set, of course
# any table method that extracts an iterable will work
cw.writerows([(r.fielda, r.fieldb, r.fieldc) for r in records])
response = make_response(si.getvalue())
response.headers['Content-Disposition'] = 'attachment; filename=report.csv'
response.headers["Content-type"] = "text/csv"
return response