How i can read files from s3 using pyspark which is created after a particular time - amazon-s3

I need to read json files from s3 using pyspark. The S3 location may contain hundreds of thousands of files. and every file have same metdata. But each time i need to read only the files that is created after a particular time. How i can do this?

If you have access to the system that creates these files, the simplest way to approach this would be to add a date partition when you write them:
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=2021831/myfileA.json
And then you can read them with a filter; Pyspark will then only load the files that it needs into memory.
start_dt = '20210831'
end_dt = '20210901'
df = (
spark
.read
.json(path)
.filter(F.col("date").between(start_dt, end_dt))
)
Note that I have not explicitly tested this with JSON files, just with Parquet, so this method may need to be adapted.
If you don't have access to change how the files are written, I don't think Pyspark has direct access to the metadata of the files. Instead, you will want to query S3 directly using boto3 to generate a list of files, filter them using boto3 meta data, and then pass the list of files into the read method:
# generate this by querying via boto3
recent_files = ['s3://mybucket/file1.json', 's3://mybucket/file2.json']
df = spark.read.json(*recent_files)
Info about listing files from boto3.

You can provide modifiedAfter and modifiedBefore parameters to DataFrameReader.json function.
modifiedBefore an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Example
from datetime import datetime
# Fill this variable with your last date
lowerbound = datetime(2021, 9, 1, 13, 0, 0)
# Current execution
upperbound = datetime.now()
df = spark.read.json(source_path,
modifiedAfter=lowerbound.strftime('%Y-%m-%dT%H:%M:%S'),
modifiedBefore=upperbound.strftime('%Y-%m-%dT%H:%M:%S'))

As noted in the discussion on Kafels' answer, modifiedBefore and modifiedAfter don't work with S3 as a data source. This is a real shame!
The next best alternative is to use boto3 to list all objects in the partition, and then filter the results on the lastModified element in the results. The results don't contain a creation timestamp so lastModified is the best you can do. You also need to be careful to handle pagination given the large number of objects.
Something like this should work to retrieve the matching keys:
import boto3
def get_matching_s3_keys(bucket, prefix="", after_date=None):
"""
List keys in an S3 bucket that match specified criteria.
:param bucket: Name of the S3 bucket.
:param prefix: Only get objects whose key starts with
this prefix
:param after_date: Only get objects that were last modified
after this date. Note: this needs to be a timezone-aware date
"""
paginator = s3.get_paginator("list_objects_v2")
kwargs = {'Bucket': bucket, 'Prefix': prefix}
for page in paginator.paginate(**kwargs):
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
last_modified = obj["LastModified"]
if after_date is None or last_modified > after_date:
yield obj["Key"]

Related

Spark (Databricks) unmanaged table from SQL not processing headers

Trying to create an unmanaged table in Spark (Databricks) from a CSV file using the SQL API. But first row is not being used as headers.
Image 2, shows that the first row is correct when using the Dataframe API to create an unmanaged table. The Dataframe was loaded from the same csv file.
However, Image 1, shows that when creating an unmanaged table from a CSV file data source in SQL, does not process the first row as headers. Am I leaving out some "headers" option?
And if so, how would that be coded?
Dataframe API
You just need to provide OPTIONS as it's specified in the documentation.
In the that options block you can list key/value pairs that matches to the options specific to the Spark CSV reader, for example, options ('header' = 'true', 'sep' = ',') will force Spark to ignore header line, and set separator to comma. You can also add the 'inferSchema' = true into options, in this case you can just omit the columns declaration - Spark will infer it for you (it's ok for small datasets, but not for the big ones):
create table test.test using csv
options ('header' = 'true', 'sep' = ',', 'inferSchema' = true)
location '/databricks-datasets/Rdatasets/data-001/csv/COUNT/affairs.csv'

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

Prepare a csv file for process mining

hope you are doing well !
I was following tutorials for process mining using 'PM4PY', but I found difficulties in the csv file ,
in my csv file I have this columns : 'id', 'status', 'mailID', 'date'.... ('status' is same as 'activity' that contain some specific choises )
my csv file contains a lot of data.
to follow process mining tutorial I must have in my columns something like 'case:concept:name' ... but I don't know how can I make it
In your case, I assume 'id' would be the same as the Case ID in normal process mining terminology. Similarly, 'status' corresponds to Activity ID and 'date' would correspond to the timestamp.
The best option is to first read into a pandas dataframe before feeding into PM4Py.
For a detailed understanding of how to do this, here is an example below. As you have not mentioned all the columns that you have in your csv file, let us assume that currently you only have [ 'id', 'status', 'date' ] as your column list. The following code can be adapted to any number of columns you have (by adding them to the list named cols) :
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
path = '' # Enter path to the csv file
data = pd.read_csv(path)
cols = ['case:concept:name','concept:name','time:timestamp']
data.columns = cols
data['time:timestamp'] = pd.to_datetime(data['time:timestamp'])
data['concept:name'] = data['concept:name'].astype(str)
log = log_converter.apply(data, variant=log_converter.Variants.TO_EVENT_LOG)
Here we have changed the column names and their datatypes as required by the PM4Py package. Convert this dataframe into an event log using the log_converter function. Now you can perform your regular process mining tasks on this event log object. For instance, if you wish to create a Directly-Follows Graph from the event log, you can use the following line of code :
from pm4py.algo.discovery.dfg import algorithm as dfg_algorithm
dfg = dfg_algorithm.apply(log)
first you need import your csv file using pandas, then convert to an event log object, finally you can use in pm4py.
reference:
https://pm4py.fit.fraunhofer.de/documentation

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999