MATLAB HDF5 to Dask Dataframe Not Supported Yet? - pandas

I am pulling a dataset out of a MATLAB mat file which is of HDF5 format as shown below:
matfile = 'C:\\....\\dataStuff.mat'
f = h5py.File(matfile, 'r')
data = f['/' + stuff + '/data/'].value
df = pd.DataFrame(data) # How do I create a Dask DF instead from data?
How do I do the same thing but instead of using Pandas, I create a Dask Dataframe?
The below function gives me an error:
ddf = dd.read_hdf(matfile, 'key')
the HDF5 class H5T_COMPOUND is not supported yet
I could attempt to just convert the Pandas DF into a Dask DF as shown below, but I would like to skip this step that takes another 2 minutes, but pulling the HDF5 data directly into a Dask Dataframe like I did with the Pandas.
df = dd.from_pandas(df, npartitions=3) # What I don't want to do

Related

Stack dataframes vertically ignoring rows with duplicate timestamps [duplicate]

I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
I guess I need some help within the for loop?
See pandas: IO tools for all of the available .read_ methods.
Try the following code if all of the CSV files have the same columns.
I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.
import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Or, with attribution to a comment from Sid.
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.
Imports and Setup
from pathlib import Path
import pandas as pd
import numpy as np
path = r'C:\DRO\DCL_rawdata_files' # or unix / linux / mac path
# Get the files from the path provided in the OP
files = Path(path).glob('*.csv') # .rglob to get subdirectories
Option 1:
Add a new column with the file name
dfs = list()
for f in files:
data = pd.read_csv(f)
# .stem is method for pathlib objects to get the filename w/o the extension
data['file'] = f.stem
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 2:
Add a new column with a generic name using enumerate
dfs = list()
for i, f in enumerate(files):
data = pd.read_csv(f)
data['file'] = f'File {i}'
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 3:
Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
[f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
[len(df) for df in dfs] creates a list of lengths
Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]
# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)
# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])
Option 4:
One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)
or
df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
An alternative to darindaCoder's answer:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
import glob
import os
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional third-party libraries. You can do this in two lines using everything Pandas and Python (all versions) already have built in.
For a few files - one-liner
df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))
For many files
import os
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))
For No Headers
If you have specific things you want to change with pd.read_csv (i.e., no headers) you can make a separate function and call that with your map:
def f(i):
return pd.read_csv(i, header=None)
df = pd.concat(map(f, filepaths))
This pandas line, which sets the df, utilizes three things:
Python's map (function, iterable) sends to the function (the
pd.read_csv()) the iterable (our list) which is every CSV element
in filepaths).
Panda's read_csv() function reads in each CSV file as normal.
Panda's concat() brings all these under one df variable.
Easy and Fast
Import two or more CSV files without having to make a list of names.
import glob
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
The Dask library can read a dataframe from multiple files:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)
The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.
I googled my way into Gaurav Singh's answer.
However, as of late, I am finding it faster to do any manipulation using NumPy and then assigning it once to a dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.
I do sincerely want anyone hitting this page to consider this approach, but I don't want to attach this huge piece of code as a comment and making it less readable.
You can leverage NumPy to really speed up the dataframe concatenation.
import os
import glob
import pandas as pd
import numpy as np
path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))
np_array_list = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.columns = ["col1", "col2"....]
Timing statistics:
total files :192
avg lines per file :8492
--approach 1 without NumPy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with NumPy -- 2.289292573928833 seconds ---
A one-liner using map, but if you'd like to specify additional arguments, you could do:
import pandas as pd
import glob
import functools
df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
glob.glob("data/*.csv")))
Note: map by itself does not let you supply additional arguments.
If you want to search recursively (Python 3.5 or above), you can do the following:
from glob import iglob
import pandas as pd
path = r'C:\user\your\path\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
Note that the three last lines can be expressed in one single line:
df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)
You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.
EDIT: Multiplatform recursive function:
You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:
df = read_df_rec('C:\user\your\path', *.csv)
Here is the function:
from glob import iglob
from os.path import join
import pandas as pd
def read_df_rec(path, fn_regex=r'*.csv'):
return pd.concat((pd.read_csv(f) for f in iglob(
join(path, '**', fn_regex), recursive=True)), ignore_index=True)
Inspired from MrFun's answer:
import glob
import pandas as pd
list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()
df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)
Notes:
By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.
In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like
timestamp id valid_frame
0
1
2
.
.
.
0
1
2
.
.
.
With ignore_index=True, it looks like:
timestamp id valid_frame
0
1
2
.
.
.
108
109
.
.
.
IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g.
begin_timestamp = df['timestamp'][0]
Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.
Another one-liner with list comprehension which allows to use arguments with read_csv.
df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])
Alternative using the pathlib library (often preferred over os.path).
This method avoids iterative use of pandas concat()/apped().
From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)
If multiple CSV files are zipped, you may use zipfile to read all and concatenate as below:
import zipfile
import pandas as pd
ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')
train = []
train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]
df = pd.concat(train)
Based on Sid's good answer.
To identify issues of missing or unaligned columns
Before concatenating, you can load CSV files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.
Import modules and locate file paths:
import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.
Load CSV files into a dictionary. Then concatenate:
dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)
Keys are file names f and values are the data frame content of CSV files.
Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.
import os
os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")
Where NR and FNR represent the number of the line being processed.
FNR is the current line within each file.
NR == 1 includes the first line of the first file (the header), while FNR > 1 skips the first line of each subsequent file.
In case of an unnamed column issue, use this code for merging multiple CSV files along the x-axis.
import glob
import os
import pandas as pd
merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)
merged_df.to_csv("merged.csv")
You can do it this way also:
import pandas as pd
import os
new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
for file in f:
complete_file_path = csv_folder_path+file
read_file = pd.read_csv(complete_file_path)
new_df = new_df.append(read_file, ignore_index=True)
new_df.shape
Consider using convtools library, which provides lots of data processing primitives and generates simple ad hoc code under the hood.
It is not supposed to be faster than pandas/polars, but sometimes it can be.
e.g. you could concat csv files into one for further reuse - here's the code:
import glob
from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd
def test_pandas():
df = pd.concat(
(
pd.read_csv(filename, index_col=None, header=0)
for filename in glob.glob("tmp/*.csv")
),
axis=0,
ignore_index=True,
)
df.to_csv("out.csv", index=False)
# took 20.9 s
def test_convtools():
table = None
for filename in glob.glob("tmp/*.csv"):
table_ = Table.from_csv(filename, header=False)
if table is None:
table = table_
else:
table = table.chain(table_)
table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s
Of course if you just want to obtain a dataframe without writing a concatenated file, it will take 4.63 s and 10.9 s correspondingly (pandas is faster here because it doesn't need to zip columns for writing it back).
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")
file_iter = iter(file_path_list)
list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))
for file in file_iter:
lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
This is how you can do it using Colaboratory on Google Drive:
import pandas as pd
import glob
path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

How to convert .fits file/ astropy table to pandas df and reverse?

I need to frequently convert pandas df to .fits files and also the reverse - I am given .fits files for which is easier for me to process them as pandas dfs. What is the most straightforward way for doing the conversions between these file types?
# 1.) how to convert a pandas df
# (or almost any ascii data file) to astropy table
# and save the table as .fits file
from astropy.table import Table
import pandas as pd
df = pd.read_csv('some_file.dat', delim_whitespace=True, skiprows=7, names=['A', 'B', 'C'])
t = Table.from_pandas(df)
t.write('some_file.fits')
# 2.) how to convert a fits file (or astropy table) to pandas df
dat = Table.read('some_file.fits')
df = dat.to_pandas()
# 3.) it is like 1.) but without a df as intermediate step -
# how to read a csv/ascii file directly as astropy table
# and save it as .fits
table = Table.read('some_file.dat', format='ascii.tab')
table.write('some_file.fits', overwrite=True)

create a dask dataframe from a dictionary

I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?
If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')