Facebook NeuralProphet - Generating model file - pandas

Trying to understand if I can use pickle for storing the model in a file system.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
df = pd.read_csv('data.csv')
pipe = NeuralProphet()
pipe.fit(df, freq="D")
pickle.dump(pipe, open('model/pipe_model.pkl', 'wb'))
Question:- Loading multiple CSV files. I have multiple CSV file. How can i dump multiple CSV files in the same pickle file and load later for the prediction?

I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.
The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:
import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")
# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])
# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
name = 'My Table',
con = con,
if_exists='append'
)
Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet files or another fast and compressed format, then reading them all in at training time may be more appropriate.
When you need the data, just read everything from the table:
pd.read_sql('SELECT * from [My Table]', con=con)

Related

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

using dask read_csv to read filename as a column name

I am importing 4000+ csv files all with the same columns, columns=['Date', 'Datapint'] the importing the csv's to dask is pretty straight forward and is working fine for me.
file_paths = '/root/data/daily/'
df = dd.read_csv(file_paths+'*.csv',
delim_whitespace=True,
names=['Date','Datapoint'])
The task I am trying to achive is to be able to name the 'Datapoint' column the filename of the .csv. I know you can set a column to the path using include_path_column = True. But I am wondering if there is a simple way use that pathname as a column name with out having to run a separate step down the line.
I was able to do this (fairly straight forward) using dask's delayed function:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
import glob
path = r'/root/data/daily' # use your path
file_list = glob.glob(path + "/*.csv")
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename,
delim_whitespace=True,
names=['Date','Close'])
df_csv.rename(columns={'Close':path_2_column}, inplace=True)
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
It is unclear to me what exactly you are trying to accomplish. If you are just trying to change the name of the column that the filepaths are written to, you can set include_path_column='New Column Name'. If you are naming a column based on the path to each file, it seems like you'll get a rather sparse array once the data are concatenated and I would argue that a groupby would probably work better.

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.
I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.
I learnt to convert single parquet to csv file using pyarrow with the following code:
import pandas as pd
df = pd.read_parquet('par_file.parquet')
df.to_csv('csv_file.csv')
But I could'nt extend this to loop for multiple parquet files and append to single csv.
Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.
I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you really want to just append to the file:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)
I'm having a similar need and I read current Pandas version supports a directory path as argument for the read_csv function. So you can read multiple parquet files like this:
import pandas as pd
df = pd.read_parquet('path/to/the/parquet/files/directory')
It concats everything into a single dataframe so you can convert it to a csv right after:
df.to_csv('csv_file.csv')
Make sure you have the following dependencies according to the doc:
pyarrow
fastparquet
This helped me to load all parquet files into one data frame
import glob
files = glob.glob("*.snappy.parquet")
data = [pd.read_parquet(f,engine='fastparquet') for f in files]
merged_data = pd.concat(data,ignore_index=True)
If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.
import pandas as pd
# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
print('Reading par_file1.parquet')
df = pd.read_parquet('par_file1.parquet')
df.to_csv(csv_file, index=False)
print('par_file1.parquet appended to csv_file.csv\n')
csv_file.close()
# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
files.append(f'par_file{i}.parquet')
# open files and append to csv_file.csv
for f in files:
print(f'Reading {f}')
df = pd.read_parquet(f)
with open('csv_file.csv','a') as file:
df.to_csv(file, header=False, index=False)
print(f'{f} appended to csv_file.csv\n')
You can remove the print statements if you want.
Tested in python 3.6 using pandas 0.23.3
a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):
import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
with remote_reader.open(f, 'rb') as f_reader:
merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))
Adds a little temporary memory overhead though.
You can use Dask to read in the multiple Parquet files and write them to a single CSV.
Dask accepts an asterisk (*) as wildcard / glob character to match related filenames.
Make sure to set single_file to True and index to False when writing the CSV file.
import pandas as pd
import numpy as np
# create some dummy dataframes using np.random and write to separate parquet files
rng = np.random.default_rng()
for i in range(3):
df = pd.DataFrame(rng.integers(0, 100, size=(10, 4)), columns=list('ABCD'))
df.to_parquet(f"dummy_df_{i}.parquet")
# load multiple parquet files with Dask
import dask.dataframe as dd
ddf = dd.read_parquet('dummy_df_*.parquet', index=False)
# write to single csv
ddf.to_csv("dummy_df_all.csv",
single_file=True,
index=False
)
# test to verify
df_test = pd.read_csv("dummy_df_all.csv")
Using Dask for this means you won't have to worry about the resulting file size (Dask is a distributed computing framework that can handle anything you throw at it, while pandas might throw a MemoryError if the resulting DataFrame is too large) and you can easily read and write from cloud data storage like Amazon S3.

Import huge data-set from SQL server to HDF5

I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use.
I plan to do some basic exploratory analysis and later on might create some decision trees/Liner regression models using pandas.
import pyodbc
import numpy as np
import pandas as pd
con = pyodbc.connect('Trusted_Connection=yes',
driver = '{ODBC Driver 13 for SQL Server}',
server = 'SQL_ServerName')
df = pd.read_sql("select * from table_a",con,index_col=['Accountid'],chunksize=1000)
Try this:
sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)
hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]
for chunk in sql_reader:
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')