Efficient way of reading parquet files between a date range in Azure Databricks - azure-data-lake

I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date.
Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019).
Read all data using * wildcard:
df = spark.read.parquet(uat/EntityName/*/*/*/*)
Add a Column FileTimestamp that extracts timestamp from EntityName_2019_01_01_HHMMSS.parquet using string operation and converting to TimestampType()
df.withColumn(add timestamp column)
Use filter to get relevant data:
start_date = '2018-12-15 00:00:00'
end_date = '2019-02-15 00:00:00'
df.filter(df.FileTimestamp >= start_date).filter(df.FileTimestamp < end_date)
Essentially I'm using PySpark to simulate the neat syntax available in U-SQL:
#rs =
EXTRACT
user string,
id string,
__date DateTime
FROM
"/input/data-{__date:yyyy}-{__date:MM}-{__date:dd}.csv"
USING Extractors.Csv();
#rs =
SELECT *
FROM #rs
WHERE
date >= System.DateTime.Parse("2016/1/1") AND
date < System.DateTime.Parse("2016/2/1");

The correct way of partitioning out your data is to use the form year=2019, month=01 etc on your data.
When you query this data with a filter such as:
df.filter(df.year >= myYear)
Then Spark will only read the relevant folders.
It is very important that the filtering column name appears exactly in the folder name. Note that when you write partitioned data using Spark (for example by year, month, day) it will not write the partitioning columns into the parquet file. They are instead inferred from the path. It does mean your dataframe will require them when writing though. They will also be returned as columns when you read from partitioned sources.
If you cannot change the folder structure you can always manually reduce the folders for Spark to read using a regex or Glob - this article should provide more context Spark SQL queries on partitioned data using Date Ranges. But clearly this is more manual and complex.
UPDATE: Further example Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?
Also from "Spark - The Definitive Guide: Big Data Processing Made Simple"
by Bill Chambers:
Partitioning is a tool that allows you to control what data is stored
(and where) as you write it. When you write a file to a partitioned
directory (or table), you basically encode a column as a folder. What
this allows you to do is skip lots of data when you go to read it in
later, allowing you to read in only the data relevant to your problem
instead of having to scan the complete dataset.
...
This is probably the lowest-hanging optimization that you can use when
you have a table that readers frequently filter by before
manipulating. For instance, date is particularly common for a
partition because, downstream, often we want to look at only the
previous week’s data (instead of scanning the entire list of records).

Related

Converting a STRING to DATE in Big Query [duplicate]

Been struggling with some datasets I want to use which have a problem with the date format.
Bigquery could not load the files and returned the following error:
Could not parse '4/12/2016 2:47:30 AM' as TIMESTAMP for field date (position 1) starting at location 21 with message 'Invalid time zone:
AM'
I have been able to upload the file manually but as strings, and now would like to set the fields back to the proper format, However, I just could not find a way to change the format of the date column from string to proper DateTime format.
Would love to know if this is possible as the file is just too long to be formatted in excel or sheets (as I have done with the smaller files from this dataset).
now would like to set the fields back to the proper format ... from string to proper DateTime format
Use parse_datetime('%m/%d/%Y %r', string_col) to parse datetime out of string
If applied to sample string in your question - you got
As #Mikhail Berlyant rightly said, using the parse_datetime('%m/%d/%Y %r', string_col)
function will convert your badly formatted dates to a standard format as per ISO 8601 accepted by Google Bigquery . the best option will then be to save these query results to a new table on the database in your Bigquery Project.
I had a similar issue.
Below is an image of my table which i uploaded with all columns in String format .
Next up was that i applied the following settings to the query below
The Settings below stored the query output to a new table called heartrateSeconds_clean on the same dataset
The Write if empty option is a good option to avoid overwriting the existing raw data or just arbitrarily writing output to a temporary table, except if you are sure you want to do so. Save the settings and proceed to Run your Query.
As seen below, the output schema of the new table is automatically updated
Below is the new preview of the resulting table
NB: I did not apply an ORDER BY clause to the Results hence the data is not ordered by any specific column in both versions of the same table.
This dataset has over 2M rows.

Bad date format change from string to date in Bigquery

Been struggling with some datasets I want to use which have a problem with the date format.
Bigquery could not load the files and returned the following error:
Could not parse '4/12/2016 2:47:30 AM' as TIMESTAMP for field date (position 1) starting at location 21 with message 'Invalid time zone:
AM'
I have been able to upload the file manually but as strings, and now would like to set the fields back to the proper format, However, I just could not find a way to change the format of the date column from string to proper DateTime format.
Would love to know if this is possible as the file is just too long to be formatted in excel or sheets (as I have done with the smaller files from this dataset).
now would like to set the fields back to the proper format ... from string to proper DateTime format
Use parse_datetime('%m/%d/%Y %r', string_col) to parse datetime out of string
If applied to sample string in your question - you got
As #Mikhail Berlyant rightly said, using the parse_datetime('%m/%d/%Y %r', string_col)
function will convert your badly formatted dates to a standard format as per ISO 8601 accepted by Google Bigquery . the best option will then be to save these query results to a new table on the database in your Bigquery Project.
I had a similar issue.
Below is an image of my table which i uploaded with all columns in String format .
Next up was that i applied the following settings to the query below
The Settings below stored the query output to a new table called heartrateSeconds_clean on the same dataset
The Write if empty option is a good option to avoid overwriting the existing raw data or just arbitrarily writing output to a temporary table, except if you are sure you want to do so. Save the settings and proceed to Run your Query.
As seen below, the output schema of the new table is automatically updated
Below is the new preview of the resulting table
NB: I did not apply an ORDER BY clause to the Results hence the data is not ordered by any specific column in both versions of the same table.
This dataset has over 2M rows.

Formatting a string to time on BigQuery?

I've got a huge (1.5GB) CSV file, with dates in it in the format 2014-12-25. I have managed to upload it to BigQuery with the format string for this column. I'm wondering if I can transform this in situ to a datetime format, without having to download the data, parse it and send it back?
I have used the BigQuery GUI (newbie) but am happy to use the CLI if this will make it easier.
You can use some of Date and time functions to "transform" string represented date to datetime
For example
SELECT '2014-12-25', TIMESTAMP('2014-12-25')
Added:
If you feel that you really need to have your data with date in timestamp format vs string and you have this data (string) already in BigQuery - you can do just similar to below query with writing to new table.
SELECT
TIMESTAMP(date_string) as date_timestamp,
< list all the rest of the fields >
FROM original_table

Custom date format for loading data into BigQuery, using bq?

I'm uploading a CSV file to Google BigQuery using bq load on the command line. It's working great, but I've got a question about converting timestamps on the fly.
In my source data, my timestamps are formatted as YYYYMM, e.g. 201303 meaning March 2013.
However, Google BigQuery's timestamp fields are documented as only supporting Unix timestamps and YYYY-MM-DD HH:MM:SS format strings. So unsurprisingly, when I load the data, these fields don't convert to the correct date.
Is there any way I can convey to BigQuery that these are YYYYMM strings?
If not I can convert them before loading, but I have about 1TB of source data, so I'm keen to avoid that if possible :)
Another alternative is to load this field as STRING, and convert it to TIMESTAMP inside BigQuery itself, copying the data into another table (and deleting the original one afterwards), and doing the following transformation:
SELECT TIMESTAMP(your_ts_str + "01") AS ts
An alternative to Mosha's answer can be achieved by:
SELECT DATE(CONCAT(your_ts_str, "01")) as ts

SQL Server 2005 Import from Excel

I'd like to know what my best option would be to import data from an excel file on a weekly or monthly basis. At first, I thought I would use SSIS, but after much struggle with seemingly simple tasks, I'm starting to rethink my plan. Would it be better/easier to just write the SQL by hand or use the services of an SSIS package? The basic process will be as follows:
A separate process will download an .xls file to a local fileshare.
The xls file will have a filename like: 'myfilename MON YY'.
I will need to read the month and year from the the filename, reformat it to a sql date and then query a DimDate table to find the corresponding date key.
For each row (after the first 2 header rows), insert the data with the date key, unless the row is a total row, then ignore.
Here are some of the issues I've been encountering with SSIS:
I can parse the date string from a flat file datasource, but can't seem to do it with an excel data source. Also, once parsed, i cannot seem to convert the string to a date in order to perform the lookup for the date key. For example, I want to do something like this:
select DateKey from DimDate
where ActualDate = convert(datetime, '01-' + 'JAN-10', 120)
but i don't think it is possible to use the 'convert' or 'datetime' keywords in an expression builder. I have been also unable to find where I can edit the SQL to ignore the first 2 rows of data.
I'm very skeptical of using SSIS because it seems like a Kludgy way of doing something that can probably be accomplished more efficiently writing the SQL yourself, but I may be forced to use SSIS. Thoughts?
SSIS is definitely the direction to go.
To hit on your problems: (DT_DBTIMESTAMP) is the conversion you want. The syntax is a bit different. For instance to convert your example date I would use:
(DT_DBTIMESTAMP)"01/01/2010"
If you use that expression in a derived column to replace your string date (or create a new column), you could then do a lookup against datetime columns in a DB.
If you need to exclude the first two rows, you will either need to write an SQL statement to query the file (as opposed to an excel file reader source) or use a conditional split to throw them away based on any condition that can be repeated with every import.
Flat files easier to work with, and do allow you to throw away x number of initial rows.