I would like to cross check my understanding about the differences in File Formats like Apache Avro and Apache Parquet in terms of Schema Evolution. Looking at various blogs and SO answers gives me the following understanding. I need to verify if my understanding is correct and also I would like to know if I am missing on any other differences with respect to Schema Evolution. Explanation is given in terms of using these file formats in Apache Hive.
Adding column: Adding column (with default value) at the end of the columns is supported in both the file formats. I think adding column (with default value) in the middle of the columns can be supported in Parquet if hive table property is set "hive.parquet.use-column-names=true". Is this not the case?.
Deleting Column: As far as deleting column at the end of the column list is concerned, I think it is supported in both the file formats, i.e if any of the Parquet/Avro file has the deleted column also since the reader schema(hive schema) doesn't have the deleted column, even if the writter's schema(actual Avro or Parquet file schema) has additional column, I think it will be easily ignored in both the formats. Deleting the column in the middle of the column list also can be supported if the property "hive.parquet.use-column-names=true" is set. Is my understanding correct?.
Renaming column: When it comes to Renaming the column, since Avro has "column alias" option, column renaming is supported in Avro but not possible in Parquet because there are no such column aliasing option in Parquet. Am I right?.
Data type change: This is supported in Avro because we can define multiple datatypes for a single column using union type but not possible in Parquet because there is no union type in Parquet.
Am I missing any other possibility?. Appreciate the help.
hive.parquet.use-column-names=true needs to be set for accessing columns by name in Parquet. It is not only for column addition/deletion. Manipulating columns by indices would be cumbersome to the point of being infeasible.
There is a workaround for column renaming as well. Refer to https://stackoverflow.com/a/57176892/14084789
Union is a challenge with Parquet.
Related
I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date.
Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019).
Read all data using * wildcard:
df = spark.read.parquet(uat/EntityName/*/*/*/*)
Add a Column FileTimestamp that extracts timestamp from EntityName_2019_01_01_HHMMSS.parquet using string operation and converting to TimestampType()
df.withColumn(add timestamp column)
Use filter to get relevant data:
start_date = '2018-12-15 00:00:00'
end_date = '2019-02-15 00:00:00'
df.filter(df.FileTimestamp >= start_date).filter(df.FileTimestamp < end_date)
Essentially I'm using PySpark to simulate the neat syntax available in U-SQL:
#rs =
EXTRACT
user string,
id string,
__date DateTime
FROM
"/input/data-{__date:yyyy}-{__date:MM}-{__date:dd}.csv"
USING Extractors.Csv();
#rs =
SELECT *
FROM #rs
WHERE
date >= System.DateTime.Parse("2016/1/1") AND
date < System.DateTime.Parse("2016/2/1");
The correct way of partitioning out your data is to use the form year=2019, month=01 etc on your data.
When you query this data with a filter such as:
df.filter(df.year >= myYear)
Then Spark will only read the relevant folders.
It is very important that the filtering column name appears exactly in the folder name. Note that when you write partitioned data using Spark (for example by year, month, day) it will not write the partitioning columns into the parquet file. They are instead inferred from the path. It does mean your dataframe will require them when writing though. They will also be returned as columns when you read from partitioned sources.
If you cannot change the folder structure you can always manually reduce the folders for Spark to read using a regex or Glob - this article should provide more context Spark SQL queries on partitioned data using Date Ranges. But clearly this is more manual and complex.
UPDATE: Further example Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?
Also from "Spark - The Definitive Guide: Big Data Processing Made Simple"
by Bill Chambers:
Partitioning is a tool that allows you to control what data is stored
(and where) as you write it. When you write a file to a partitioned
directory (or table), you basically encode a column as a folder. What
this allows you to do is skip lots of data when you go to read it in
later, allowing you to read in only the data relevant to your problem
instead of having to scan the complete dataset.
...
This is probably the lowest-hanging optimization that you can use when
you have a table that readers frequently filter by before
manipulating. For instance, date is particularly common for a
partition because, downstream, often we want to look at only the
previous week’s data (instead of scanning the entire list of records).
Can someone please help me by stating the purpose of providing the json schema file while loading a file to BQtable using bq command. what are the advantages?
Dose this file help to maintain data integrity by avoiding any column swap ?
Regards,
Sreekanth
Specifying a JSON schema--instead of relying on auto-detect--means that you are ensured to get the expected types for each column being loaded. If you have data that looks like this, for example:
1,'foo',true
2,'bar',false
3,'baz',true
Schema auto-detection would infer that the type of the first column is an INTEGER (a.k.a. INT64). Maybe you plan to load more data in the future, though, that looks like this:
3.14,'foo',true
1.59,'bar',false
-2.001,'baz',true
In that case, you probably want the first column to have type FLOAT (a.k.a. FLOAT64) instead. If you provide a schema when you load the first file, you can specify a type of FLOAT for that column explicitly.
I have some data which was dumped from a PostgreSQL database (allegedly, using pg_dump) which needs to get imported into SQL Server.
While the data types are ok, I am running into an issue where there seems to be a placeholder for a NULL. I see a backslash followed by an uppercase N in many fields. Below is a snippet of the data, as viewed from within Excel. Left column has a Boolean data type, and the right one has an integer as the data type
Some of these are supposed to be of the Boolean datatype, and having two characters in there is most certainly not going to fly.
Here's what I tried so far:
Import via dirty read - keeping whatever datatypes SSIS decided each field had; to no avail. There were error messages about truncation on all of the boolean fields.
Creating a table for the data based on the correct data types, though this was more fun... I needed to do the same as in the dirty read, as the source would otherwise not load properly. There was also a need to transform the data into the correct data type for insertion into the destination data source; yet, I am getting truncation issues, when it most certainly shouldn't be.
Here is a sample expression in my derived column transformation editor:
(DT_BOOL)REPLACE(observation,"\\N","")
The data type should be Boolean.
Any suggestion would be really helpful!
Thanks!
Since I was unable to circumvent the SSIS rules in order to get my data into my tables without an error, I took the quick-and-dirty approach.
The solution which worked for me was to have the source data read each column as if it were a string, and the destination table had all fields be of the datatype VARCHAR. This destination table will be used as a staging table, once in SS, I can manipulate as needed.
Thank you #cha for your input.
I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).
What is the best way to load this into BigQuery? Do I have to find and edit all these files to either
- add an empty eighth column in all files
- remove the eighth column from all files? I don't care about the value in this column.
Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.
I am using BigQuery APIs to load data if that might help.
You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.
This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows
I've filed a doc bug to make this easier to find.
If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null.
I am not sure how it works with CSV, but I think that you have to have all fields (even empty).
There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)
If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.
If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.
Found a workable "hack".
Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.
I have created a big query table with a schema through browser tool.Next time i am appending a csv to that table through API calls from my local.But that csv don't contain all columns that i already specified. So i am getting the error "Provided Schema does not match Table xxxxxxxxxxxxx:xxxx.xxxx." .So how can i append values to a bigquery table for some specific columns through API calls?
You can append to a table using a narrower schema. That is, if your table has fields 'A', 'B', and 'C', you can append a CSV file that has only fields 'A', and 'C', as long as field 'B' is marked as Optional (which is the default). Just make sure you provide the correct schema of your CSV file with the load job.
Alternately, the 'allowJaggedRows' option may help you here. This lets you specify fewer columns than are in your schema, and the remainder will be padded with nulls. Of course, of you want to skip columns in the middle of the schema, you may be out-of-luck.
Finally, you can update the schema to add columns via the tables.update() or tables.patch() call. This can let you add columns that weren't in the original table.
Using the JSON data format instead of CSV you can do so, provided that your fields are marked as NULLABLE (and not REQUIRED).
This part of the documentation introduces the JSON format: https://developers.google.com/bigquery/preparing-data-for-bigquery