dask read parquet and specify schema - pandas

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow?
I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could create a custom delayed function to handle these cases after reading them but I'm hoping I could specify the schema when opening them via globing. Maybe not though as I guess opening then via globing is going to try and concatenate them. This currently fails because of the inconsistent field names.
Create a parquet file:
import dask.dataframe as dd
df = dd.demo.make_timeseries(
start="2000-01-01",
end="2000-01-03",
dtypes={"id": int, "z": int},
freq="1h",
partition_freq="24h",
)
df.to_parquet("df.parquet", engine="pyarrow", overwrite=True)
Read it in via dask and specify the schema after reading:
df = dd.read_parquet("df.parquet", engine="pyarrow")
df["z"] = df["z"].astype("float")
df = df.rename(columns={"z": "a"})
Read it in via spark and specify the schema:
from pyspark.sql import SparkSession
import pyspark.sql.types as T
spark = SparkSession.builder.appName('App').getOrCreate()
schema = T.StructType(
[
T.StructField("id", T.IntegerType()),
T.StructField("a", T.FloatType()),
T.StructField("timestamp", T.TimestampType()),
]
)
df = spark.read.format("parquet").schema(schema).load("df.parquet")

Some of the options are:
Specify dtypes after loading (requires consistent column names):
custom_dtypes = {"a": float, "id": int, "timestamp": pd.datetime}
df = dd.read_parquet("df.parquet", engine="pyarrow").astype(custom_dtypes)
This currently fails because of the inconsistent field names.
If the column names are not the same across files, you might want to use a custom delayed before loading:
#delayed
def custom_load(path):
df = pd.read_parquet(path)
# some logic to ensure consistent columns
# for example:
if "z" in df.columns:
df = df.rename(columns={"z": "a"}).astype(custom_dtypes)
return df
dask_df = dd.from_delayed([custom_load(path) for path in glob.glob("some_path/*parquet")])

Related

pandas to_sql with Exasol

When I use to_sql to upload dataframe to exasol and specify if_exists='replace', the default string data type is 'text', which is not supported by Exasol. I think Varchar is the right type. How could I make to_sql to create table with Varchar rather than Text?
I know it's not 100% what you are asking for but I would suggest to use the pyexasol package for communication between Pandas and Exasol. Deleting and following uploading then works like
import pyexasol
import _config as config
# Connect with compression enabled
C = pyexasol.connect(dsn=config.dsn, user=config.user,
password=config.password, schema=config.schema,
compression=True)
C.execute('TRUNCATE TABLE users_copy')
# Import from pandas DataFrame into Exasol table
C.import_from_pandas(pd, 'users_copy')
stmt = C.last_statement()
print(f'IMPORTED {stmt.rowcount()} rows in
{stmt.execution_time}s')
C.close()
Problems with varchar do not appear.

Prepare a csv file for process mining

hope you are doing well !
I was following tutorials for process mining using 'PM4PY', but I found difficulties in the csv file ,
in my csv file I have this columns : 'id', 'status', 'mailID', 'date'.... ('status' is same as 'activity' that contain some specific choises )
my csv file contains a lot of data.
to follow process mining tutorial I must have in my columns something like 'case:concept:name' ... but I don't know how can I make it
In your case, I assume 'id' would be the same as the Case ID in normal process mining terminology. Similarly, 'status' corresponds to Activity ID and 'date' would correspond to the timestamp.
The best option is to first read into a pandas dataframe before feeding into PM4Py.
For a detailed understanding of how to do this, here is an example below. As you have not mentioned all the columns that you have in your csv file, let us assume that currently you only have [ 'id', 'status', 'date' ] as your column list. The following code can be adapted to any number of columns you have (by adding them to the list named cols) :
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
path = '' # Enter path to the csv file
data = pd.read_csv(path)
cols = ['case:concept:name','concept:name','time:timestamp']
data.columns = cols
data['time:timestamp'] = pd.to_datetime(data['time:timestamp'])
data['concept:name'] = data['concept:name'].astype(str)
log = log_converter.apply(data, variant=log_converter.Variants.TO_EVENT_LOG)
Here we have changed the column names and their datatypes as required by the PM4Py package. Convert this dataframe into an event log using the log_converter function. Now you can perform your regular process mining tasks on this event log object. For instance, if you wish to create a Directly-Follows Graph from the event log, you can use the following line of code :
from pm4py.algo.discovery.dfg import algorithm as dfg_algorithm
dfg = dfg_algorithm.apply(log)
first you need import your csv file using pandas, then convert to an event log object, finally you can use in pm4py.
reference:
https://pm4py.fit.fraunhofer.de/documentation

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

How to update Saved Document Extra data in GridFS

I have stored files on gridFS that has extra information stored on them - it is not set on the metadata otherwise I would use this
For Pymongo 3.2.2 we stored information on the same level as the actual data in fs.files (not using the metadata)
so for example we have:
fs.files = [ {
_id, description, title, ...
}]
When I call GridFS.put like so and nothing happens
FS = GridFS(mongo_,)
file.description = request_data.get('description', None)
FS.put(file)
How can I update that file extra information such as description?
What is file in your code example? That is, what are you passing to GridFS.put?
To add metadata to a GridFS file, pass additional keyword arguments to put, as it shows in the PyMongo tutorial:
>>> fs.put(b'data', filename='foo', description='my description')
ObjectId('5825ea8ea08bff9df5059099')
Now the metadata is stored along with your data in GridFS:
>>> gridout = fs.get(ObjectId('5825ea8ea08bff9df5059099'))
>>> gridout.description
u'my description'
Under the hood, you can see that PyMongo stored the metadata in the fs.files collection in MongoDB:
>>> for doc in db.fs.files.find()
... print(doc.get('description'))
...
my description
But the better way to access GridFS data is with PyMongo's GridFS API, not by directly querying the collections.
Another way to store metadata is to create a GridIn, set a field, and call close:
>>> gridin = fs.new_file()
>>> gridin.filename = 'foo'
>>> gridin.description = 'my description'
>>> gridin.write(b'data')
>>> gridin.close()
This is a good option if you need to call write multiple times with chunks of data.
Manually update those extra fields without using GridFS:
file = mongo_.fs.files.find_one({'_id': ObjectId(fileId)})
file['description'] = request_data.get('description', None)
mongo_.fs.files.save(file)