Spark (Databricks) unmanaged table from SQL not processing headers - sql

Trying to create an unmanaged table in Spark (Databricks) from a CSV file using the SQL API. But first row is not being used as headers.
Image 2, shows that the first row is correct when using the Dataframe API to create an unmanaged table. The Dataframe was loaded from the same csv file.
However, Image 1, shows that when creating an unmanaged table from a CSV file data source in SQL, does not process the first row as headers. Am I leaving out some "headers" option?
And if so, how would that be coded?
Dataframe API

You just need to provide OPTIONS as it's specified in the documentation.
In the that options block you can list key/value pairs that matches to the options specific to the Spark CSV reader, for example, options ('header' = 'true', 'sep' = ',') will force Spark to ignore header line, and set separator to comma. You can also add the 'inferSchema' = true into options, in this case you can just omit the columns declaration - Spark will infer it for you (it's ok for small datasets, but not for the big ones):
create table test.test using csv
options ('header' = 'true', 'sep' = ',', 'inferSchema' = true)
location '/databricks-datasets/Rdatasets/data-001/csv/COUNT/affairs.csv'

Related

How i can read files from s3 using pyspark which is created after a particular time

I need to read json files from s3 using pyspark. The S3 location may contain hundreds of thousands of files. and every file have same metdata. But each time i need to read only the files that is created after a particular time. How i can do this?
If you have access to the system that creates these files, the simplest way to approach this would be to add a date partition when you write them:
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=2021831/myfileA.json
And then you can read them with a filter; Pyspark will then only load the files that it needs into memory.
start_dt = '20210831'
end_dt = '20210901'
df = (
spark
.read
.json(path)
.filter(F.col("date").between(start_dt, end_dt))
)
Note that I have not explicitly tested this with JSON files, just with Parquet, so this method may need to be adapted.
If you don't have access to change how the files are written, I don't think Pyspark has direct access to the metadata of the files. Instead, you will want to query S3 directly using boto3 to generate a list of files, filter them using boto3 meta data, and then pass the list of files into the read method:
# generate this by querying via boto3
recent_files = ['s3://mybucket/file1.json', 's3://mybucket/file2.json']
df = spark.read.json(*recent_files)
Info about listing files from boto3.
You can provide modifiedAfter and modifiedBefore parameters to DataFrameReader.json function.
modifiedBefore an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Example
from datetime import datetime
# Fill this variable with your last date
lowerbound = datetime(2021, 9, 1, 13, 0, 0)
# Current execution
upperbound = datetime.now()
df = spark.read.json(source_path,
modifiedAfter=lowerbound.strftime('%Y-%m-%dT%H:%M:%S'),
modifiedBefore=upperbound.strftime('%Y-%m-%dT%H:%M:%S'))
As noted in the discussion on Kafels' answer, modifiedBefore and modifiedAfter don't work with S3 as a data source. This is a real shame!
The next best alternative is to use boto3 to list all objects in the partition, and then filter the results on the lastModified element in the results. The results don't contain a creation timestamp so lastModified is the best you can do. You also need to be careful to handle pagination given the large number of objects.
Something like this should work to retrieve the matching keys:
import boto3
def get_matching_s3_keys(bucket, prefix="", after_date=None):
"""
List keys in an S3 bucket that match specified criteria.
:param bucket: Name of the S3 bucket.
:param prefix: Only get objects whose key starts with
this prefix
:param after_date: Only get objects that were last modified
after this date. Note: this needs to be a timezone-aware date
"""
paginator = s3.get_paginator("list_objects_v2")
kwargs = {'Bucket': bucket, 'Prefix': prefix}
for page in paginator.paginate(**kwargs):
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
last_modified = obj["LastModified"]
if after_date is None or last_modified > after_date:
yield obj["Key"]

Bigquery dbt_external_tables External Data Configuration

I need some help when using the dbt_external_tables package.
I realized that in the csv I have in GCS some lines appear to have line breaks and this is causing some issues when trying to query the table created by the macro.
Sometimes when doing this configuration of the external table manually the BigQuery UI has two options:
Allow jagged rows (CSV)
Allow quoted newlines (CSV) true
I usually put those options in true and sometimes the issues are solved.
I don't know how to do this using the dbt_external_tables.
This is important as I am receiving this errors when trying to query the table created by dbt "Error while reading table: kpi-process.file_csv.History, error message: CSV table references column position 9, but line starting at position:10956 contains only 7 columns. "
The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here. In your case, it sounds like you want to turn on allow_jagged_rows and allow_quoted_newlines, so you can specify them like so:
version: 2
sources:
- name: my_external_source
tables:
- name: my_external_table
location: 'gs://bucket/path/*'
options:
format: csv
allow_jagged_rows: true
allow_quoted_newlines: true
And dbt will template a DDL statement accordingly:
create or replace external statement my_external_source.my_external_table
options (
format = 'csv',
allow_jagged_rows = true,
allow_quoted_newlines = true,
uris = ['gs://bucket/path/*']
)

Prepare a csv file for process mining

hope you are doing well !
I was following tutorials for process mining using 'PM4PY', but I found difficulties in the csv file ,
in my csv file I have this columns : 'id', 'status', 'mailID', 'date'.... ('status' is same as 'activity' that contain some specific choises )
my csv file contains a lot of data.
to follow process mining tutorial I must have in my columns something like 'case:concept:name' ... but I don't know how can I make it
In your case, I assume 'id' would be the same as the Case ID in normal process mining terminology. Similarly, 'status' corresponds to Activity ID and 'date' would correspond to the timestamp.
The best option is to first read into a pandas dataframe before feeding into PM4Py.
For a detailed understanding of how to do this, here is an example below. As you have not mentioned all the columns that you have in your csv file, let us assume that currently you only have [ 'id', 'status', 'date' ] as your column list. The following code can be adapted to any number of columns you have (by adding them to the list named cols) :
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
path = '' # Enter path to the csv file
data = pd.read_csv(path)
cols = ['case:concept:name','concept:name','time:timestamp']
data.columns = cols
data['time:timestamp'] = pd.to_datetime(data['time:timestamp'])
data['concept:name'] = data['concept:name'].astype(str)
log = log_converter.apply(data, variant=log_converter.Variants.TO_EVENT_LOG)
Here we have changed the column names and their datatypes as required by the PM4Py package. Convert this dataframe into an event log using the log_converter function. Now you can perform your regular process mining tasks on this event log object. For instance, if you wish to create a Directly-Follows Graph from the event log, you can use the following line of code :
from pm4py.algo.discovery.dfg import algorithm as dfg_algorithm
dfg = dfg_algorithm.apply(log)
first you need import your csv file using pandas, then convert to an event log object, finally you can use in pm4py.
reference:
https://pm4py.fit.fraunhofer.de/documentation

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

Splunk : formatting a csv file during indexing, values are being treated as new columns?

I am trying to create a new field during indexing however the fields become columns instead of values when i try to concat. What am i doing wrong ? I have looked in the docs and seems according ..
Would appreciate some help on this.
e.g.
.csv file
**Header1**, **Header2**
Value1 ,121244
transform.config
[test_transformstanza]
SOURCE_KEY = fields:Header1,Header2
REGEX =^(\w+\s+)(\d+)
FORMAT =
testresult::$1.$2
WRITE_META = true
fields.config
[testresult]
INDEXED = True
The regex is good, creates two groups from the data, but why is it creating a new field instead of assigning the value to result?. If i was to do ... testresult::$1 or testresult::$2 it works fine, but when concatenating it creates multiple headers with the value as headername. Is there an easier way to concat fields , e.g. if you have a csv file with header names can you just not refer to the header names? (i know how to do these using calculated fields but want to do it during indexing)
Thanks