I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use.
I plan to do some basic exploratory analysis and later on might create some decision trees/Liner regression models using pandas.
import pyodbc
import numpy as np
import pandas as pd
con = pyodbc.connect('Trusted_Connection=yes',
driver = '{ODBC Driver 13 for SQL Server}',
server = 'SQL_ServerName')
df = pd.read_sql("select * from table_a",con,index_col=['Accountid'],chunksize=1000)
Try this:
sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)
hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]
for chunk in sql_reader:
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
Related
I want to convert my CSV file to a parquet file. My code below causes my kernel to be KILLED regardless of the chunksize parameter. I do not know the number of rows x columns in my file, but I suspect that I have many columns.
What is the ideal solution?
With Pandas:
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = "kipan_exon.csv.gz"
parquet_file = "kipan_exon.csv.gz"
chunksize = 1000000
df = pd.read_csv(csv_file, sep="\t", chunksize=chunksize, low_memory=False, compression="gzip")
for i, chunk in enumerate(df):
print("Chunk", i)
if i == 0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pd.ParquetWriter(parquet_file, parquet_schema, compression="gzip")
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
With dask:
df = dd.read_csv(csv_file, sep="\t", compression="gzip", blocksize=None)
df = df.repartition(partition_size="100MB")
df.to_parquet(parquet_file, write_index=False)
Another (more recent) solution is to use a LazyFrame approach in polars:
csv_file = "kipan_exon.csv" # this doesn't work with compressed files right now
parquet_file = "kipan_exon.parquet" # #MichaelDelgado's comment re: same value as `csv_file`
from polars import scan_csv
ldf = scan_csv(csv_file)
ldf.sink_parquet(parquet_file)
This should work well in memory-constrained situations since the data is not loaded fully, but streamed to the parquet file.
When using dask for csv to parquet conversion, I'd recommend avoiding .repartition. It introduces additional data shuffling that can strain workers and the scheduler. The simpler approach would look like this:
csv_file = "kipan_exon.csv.gz"
parquet_file = "kipan_exon.parquet" # #MichaelDelgado's comment re: same value as `csv_file`
from dask.dataframe import read_csv
df = read_csv(csv_file, sep="\t", compression="gzip")
df.to_parquet(parquet_file, write_index=False)
` from deepface import DeepFace
import pandas as pd
import os, os.path
from os import path
test = '(path to jpg images)'
csv_records = '(path to save csv records)'
df = pd.DataFrame()
for file in os.listdir(test):
if file.endswith('.jpg'):
thisframe = 0
filename = str(os.path.join(test, str(thisframe) + '.jpg'))
predictions = DeepFace.analyze(filename, actions = ['emotion'])
df2 = pd.DataFrame(predictions)
df3 = df.append(df2)
df3.to_csv(os.path.join(csv_records, 'record.csv'))
thisframe += 1`
I'm making an emotion recognition program and the images are obtained from a program which exports frames in jpg format from real-time footage. I want to pass those images into the deepface.analyze program and then combine the 'emotion' part of the analysis for all of the frames into a dataframe and subsequently export the results into a csv file. So if there's 24 frames, there should be 24 'emotion' results in the csv file.
I've tried searching other questions here but they either don't answer my question or I don't understand them. Thanks in advance.
Trying to understand if I can use pickle for storing the model in a file system.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
df = pd.read_csv('data.csv')
pipe = NeuralProphet()
pipe.fit(df, freq="D")
pickle.dump(pipe, open('model/pipe_model.pkl', 'wb'))
Question:- Loading multiple CSV files. I have multiple CSV file. How can i dump multiple CSV files in the same pickle file and load later for the prediction?
I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.
The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:
import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")
# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])
# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
name = 'My Table',
con = con,
if_exists='append'
)
Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet files or another fast and compressed format, then reading them all in at training time may be more appropriate.
When you need the data, just read everything from the table:
pd.read_sql('SELECT * from [My Table]', con=con)
I'm trying to store an ndarray from a pandas data frame
to postgres. Putting the ndarrays in an column and using to_sql() stores
them very inefficiently. Is there a more efficient way(memory wise) of doing this ?
Note: Of course normalizing the ndarrays into rows in a table would be much better for searching and maybe reduce memory usage, but this is specifically about keeping the ndarray since the structure dimensions are not precisely known beforehand.
Using BytesIO in combination with numpy.save() seems to do the trick. Also, explicit types in to_sql ensure bytea is used. Something like:
import io
import numpy as np
import pandas as pd
from sqlalchemy import String, LargeBinary
df = pd.DataFrame([file_path],columns=["filename"])
f = io.BytesIO()
np.save(f, blob_data)
f.seek(0)
blob = f.read()
df['image'] = [blob]
And then save it like:
df.to_sql(con=engine, name=destination_table_name, schema=destination_schema_name, dtype={"filename": String, "image": LargeBinary})
To read it back do something like:
df2 = pull_dataframe_from_postgres_function()
f = io.BytesIO()
f.write(df2["image"][0])
f.seek(0)
data = np.load(f) # data as a ndarray
I read a lot about memory usage of Spark when doing stuff like collect() or toPandas() (like here). The common wisdom is to use it only on small dataset. The point is how small Spark can handle?
I run locally (for testing) with pyspark, the driver memory set to 20g (I have 32g on my 16 cores mac), but toPandas() crashes even with a dataset as small as 20K rows! That cannot be right, so I suspect I do some (setting) wrong. This is the simplified code to reproduce the error:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# setting the number of rows for the CSV file
N = 20000
ncols = 7
c_name = 'ABCDEFGHIJKLMNOPQRSTUVXYWZ'
# creating a pandas dataframe (df)
df = pd.DataFrame(np.random.randint(999,999999,size=(N, ncols)), columns=list(c_name[:ncols]))
file_name = 'random.csv'
# export the dataframe to csv using comma delimiting
df.to_csv(file_name, index=False)
## Load the csv in spark
df = spark.read.format('csv').option('header', 'true').load(file_name)#.limit(5000)#.coalesce(2)
## some checks
n_parts = df.rdd.getNumPartitions()
print('Number of partitions:', n_parts)
print('Number of rows:', df.count())
## conver spark df -> toPandas
df_p = df.toPandas()
print('With pandas:',len(df_p))
I run this within jupyter, and get errors like:
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /192.168.0.104:61536
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
...
My spark local setting is (everything else default):
('spark.driver.host', '192.168.0.104')
('spark.driver.memory', '20g')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.app.id', 'local-1618499935279')
('spark.driver.port', '55115')
('spark.ui.showConsoleProgress', 'true')
('spark.app.name', 'pyspark-shell')
('spark.driver.maxResultSize', '4g')
Is my setup wrong, or it is expected that even 20g of driver memory can't handle a small dataframe with 20K rows and 7 columns? Will repartitioning help?