I'm trying to store an ndarray from a pandas data frame
to postgres. Putting the ndarrays in an column and using to_sql() stores
them very inefficiently. Is there a more efficient way(memory wise) of doing this ?
Note: Of course normalizing the ndarrays into rows in a table would be much better for searching and maybe reduce memory usage, but this is specifically about keeping the ndarray since the structure dimensions are not precisely known beforehand.
Using BytesIO in combination with numpy.save() seems to do the trick. Also, explicit types in to_sql ensure bytea is used. Something like:
import io
import numpy as np
import pandas as pd
from sqlalchemy import String, LargeBinary
df = pd.DataFrame([file_path],columns=["filename"])
f = io.BytesIO()
np.save(f, blob_data)
f.seek(0)
blob = f.read()
df['image'] = [blob]
And then save it like:
df.to_sql(con=engine, name=destination_table_name, schema=destination_schema_name, dtype={"filename": String, "image": LargeBinary})
To read it back do something like:
df2 = pull_dataframe_from_postgres_function()
f = io.BytesIO()
f.write(df2["image"][0])
f.seek(0)
data = np.load(f) # data as a ndarray
Related
I have some PySpark code that aims to run a machine learning model trained in sklearn on a pyspark dataframe looks like this:
from sklearn.ensemble import RandomForestRegressor
X = np.random.rand(1000, 100)
y = np.random.randint(2, size=1000)
tree = RandomForestRegressor(n_jobs=4)
tree.fit(X, y)
pdf = pd.DataFrame(X)
df = spark.createDataFrame(pdf)
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double')
# Input/output are both a pandas.Series of doubles
def pandas_plus_one(*args):
return pd.Series(tree.predict(pd.concat([args[i] for i in range(100)],axis=1)))
df = df.withColumn('result', pandas_plus_one(*[df[i] for i in range(100)]))
My question is that is this the most efficient way to do things with PySpark? In particular, I would like to avoid having to do pd.concat which involves copying all the Series (which were probably adjacent in memory anyways) to a new pandas DataFrame inside of the UDF function. The ideal solution would be for the Pandas UDF to accept a DataFrame as an input, but I haven't found a way to make it work.
Note: I am not looking for solutions that involve SparkML scikit-spark etc.
I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.
I created a DataFrame with 3 indexes. Later, I applied unstack method followd by the stack method. Then checked for equality of new and old data frames. Why are both of them different? Is unstacking not opposite procedure to stacking? Here is my code :
import numpy as np
import pandas as pd
data = pd.Series([7]*9, index = [[1,2,3,2,4,9,6,7,9], ['a','c','f','a', 'k','f','c','d','a'], [np.nan]*9])
data2 = data.unstack().stack()
print(data2.equals(data))
The output returns False, but don't know why!
I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')
I am trying out different things to make the NLTK's naive bayes work using the NLTK and Pandas modules, but I am getting the "too many values to unpack" error.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import nltk
### Remove cases with missing name or missing ethnicity information
def read_file():
data = pd.read_csv("C:\sample.csv", encoding="utf-8")
frame = DataFrame(data)
frame.columns = ["Name", "Gender"]
return frame
#read_file()
def gender_features(word):
return {'last_letter': word[-1]}
#gender_features()
frame = read_file()
featuresets = [(gender_features(n), gender) for (n, gender) in frame]
train_set, test_set = features[500:], featuresets[:500]
classifier = nltkNaiveBayesClassifier.train(train_set)
I suspect you are trying to do something bigger than name classification when using panadas.DataFrame because the DataFrame object is normally used when you have limited RAM and wants to makes use of diskspace as you iterate through the data to extract features:
a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table,
or a dict of Series objects. It is generally the most commonly used
pandas object. Like Series, DataFrame accepts many different kinds of
input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
I suggest you go through the pandas tutorial to learn about the library first: http://pandas.pydata.org/pandas-docs/dev/tutorials.html
And then learn about the NLTK classification from http://www.nltk.org/book/ch06.html
Firstly, there are several things wrong in how you access pandas.DataFrame object.
To iterate through the rows of the dataframe, you should do this:
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
for index, row in df.iterrows():
print row['name'], row['gender']
Next to train a classifier, you should do this:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from nltk.corpus import names
from nltk.classify import NaiveBayesClassifier as nbc
# Create a sample.csv file
male_names = [','.join([i,'m']) for i in names.words('male.txt')]
female_names = [','.join([i,'m']) for i in names.words('female.txt')]
with open('sample.csv', 'w') as fout:
fout.write('\n'.join(male_names+female_names))
# Feature extractor function.
def gender_features(word):
return {'last_letter': word[-1]}
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
# Extract features.
featuresets = [(gender_features(name), gender) for index, (name, gender) in df.iterrows()]
# Split train and test set
train_set, test_set = featuresets[500:], featuresets[:500]
# Train a classifier
classifier = nbc.train(train_set)
# Test classifier on "Neo"
print classifier.classify(gender_features('Neo'))
[out]:
m