I have chunk of data on s3 in orc format. my requirement is to mask certain columns. What will be the best approach with minimal change ?
Can I define hive table level udf and whenever the column is referred from hive/preso by default the udf will excute and mask the data on the fly ?
Your response will be appreciated.
Thanks!
This is called column masking.
To achieve this, you would typically use Presto (or Hive) with Ranger security, and configure column masking there.
Ranger defines a set of predefined masks (eg mask all but last 4 characters/digits, etc.).
Ranger also allows custom masks (free style expression), but that is not supported by Presto yet.
Related
I want to query data that is stored in MongoDB and exported out into a number of JSON files, stored in S3
I am using AWS Glue to read the files into Athena however the data type for the id on each table is imported as struct<$oid:string>
I have tried every variation of adding quotations around the fields with no luck. everything I try results in the error name expected at the position 7 of 'struct<$oid:string>' but '$' is found.
Is there any way I can read these tables in their current form or do I need to declare their type in Glue?
Glue Crawlers create schemas that match what they find, without considering if they will work with, for example Athena. In Athena you can't have a struct property with an initial $, but Glue doesn't take that into account – partly because maybe you will be using the table with something else where that is not a problem, and partly because what else can it do, that is the name of the property.
There are two ways around it, but neither will work if you continue to use a crawler. You will need to modify the table schema, but if you continue to run the crawler it will just revert it back again.
The first, and probably simplest option, is to change the type of the column to STRING and then use a JSON function at query time to extract the value using JSONPath ($ is a special character in JSONPath, but you should be able to escape it).
The second option is to use the "mappings" feature of the Hive JSON serde. I'm not 100% sure if it will work for this case, but it could be worth a try. The docs are not very extensive on how to configure it, unfortunately.
I would like to cross check my understanding about the differences in File Formats like Apache Avro and Apache Parquet in terms of Schema Evolution. Looking at various blogs and SO answers gives me the following understanding. I need to verify if my understanding is correct and also I would like to know if I am missing on any other differences with respect to Schema Evolution. Explanation is given in terms of using these file formats in Apache Hive.
Adding column: Adding column (with default value) at the end of the columns is supported in both the file formats. I think adding column (with default value) in the middle of the columns can be supported in Parquet if hive table property is set "hive.parquet.use-column-names=true". Is this not the case?.
Deleting Column: As far as deleting column at the end of the column list is concerned, I think it is supported in both the file formats, i.e if any of the Parquet/Avro file has the deleted column also since the reader schema(hive schema) doesn't have the deleted column, even if the writter's schema(actual Avro or Parquet file schema) has additional column, I think it will be easily ignored in both the formats. Deleting the column in the middle of the column list also can be supported if the property "hive.parquet.use-column-names=true" is set. Is my understanding correct?.
Renaming column: When it comes to Renaming the column, since Avro has "column alias" option, column renaming is supported in Avro but not possible in Parquet because there are no such column aliasing option in Parquet. Am I right?.
Data type change: This is supported in Avro because we can define multiple datatypes for a single column using union type but not possible in Parquet because there is no union type in Parquet.
Am I missing any other possibility?. Appreciate the help.
hive.parquet.use-column-names=true needs to be set for accessing columns by name in Parquet. It is not only for column addition/deletion. Manipulating columns by indices would be cumbersome to the point of being infeasible.
There is a workaround for column renaming as well. Refer to https://stackoverflow.com/a/57176892/14084789
Union is a challenge with Parquet.
I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date.
Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019).
Read all data using * wildcard:
df = spark.read.parquet(uat/EntityName/*/*/*/*)
Add a Column FileTimestamp that extracts timestamp from EntityName_2019_01_01_HHMMSS.parquet using string operation and converting to TimestampType()
df.withColumn(add timestamp column)
Use filter to get relevant data:
start_date = '2018-12-15 00:00:00'
end_date = '2019-02-15 00:00:00'
df.filter(df.FileTimestamp >= start_date).filter(df.FileTimestamp < end_date)
Essentially I'm using PySpark to simulate the neat syntax available in U-SQL:
#rs =
EXTRACT
user string,
id string,
__date DateTime
FROM
"/input/data-{__date:yyyy}-{__date:MM}-{__date:dd}.csv"
USING Extractors.Csv();
#rs =
SELECT *
FROM #rs
WHERE
date >= System.DateTime.Parse("2016/1/1") AND
date < System.DateTime.Parse("2016/2/1");
The correct way of partitioning out your data is to use the form year=2019, month=01 etc on your data.
When you query this data with a filter such as:
df.filter(df.year >= myYear)
Then Spark will only read the relevant folders.
It is very important that the filtering column name appears exactly in the folder name. Note that when you write partitioned data using Spark (for example by year, month, day) it will not write the partitioning columns into the parquet file. They are instead inferred from the path. It does mean your dataframe will require them when writing though. They will also be returned as columns when you read from partitioned sources.
If you cannot change the folder structure you can always manually reduce the folders for Spark to read using a regex or Glob - this article should provide more context Spark SQL queries on partitioned data using Date Ranges. But clearly this is more manual and complex.
UPDATE: Further example Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?
Also from "Spark - The Definitive Guide: Big Data Processing Made Simple"
by Bill Chambers:
Partitioning is a tool that allows you to control what data is stored
(and where) as you write it. When you write a file to a partitioned
directory (or table), you basically encode a column as a folder. What
this allows you to do is skip lots of data when you go to read it in
later, allowing you to read in only the data relevant to your problem
instead of having to scan the complete dataset.
...
This is probably the lowest-hanging optimization that you can use when
you have a table that readers frequently filter by before
manipulating. For instance, date is particularly common for a
partition because, downstream, often we want to look at only the
previous week’s data (instead of scanning the entire list of records).
I'm setting up a pipeline in NiFi where I get JSON records which I then use to make a request to an API. The response I get would have both numeric and textual data. I then have to write this data to Hive. I use InferAvroSchema to infer the schema. Some numeric values are signed values like -2.46,-0.1 While inferring the type, the processor considers them as string instead of double or float or decimal type.
I know we can hard code our AVRO schema in the processors but I thought making it more dynamic by utilizing the InferAvroSchema would be even better. Is there any other way we can overcome/resolve this?
InferAvroSchema is good for guessing an initial schema, but once you need something more specific it is better to remove InferAvroSchema and provide the exact schema you need.
Hopefully this is easy to explain, but I have a lookup transformation in one of my SSIS packages. I am using it to lookup the id for an emplouyee record in a dimension table. However my problem is that some of the source data has employee names in all capitals (ex: CHERRERA) and the comparison data im using is all lower case (ex: cherrera).
The lookup is failing for the records that are not 100% case similar (ex: cherrera vs cherrera works fine - cherrera vs CHERRERA fails). Is there a way to make the lookup transformation ignore case on a string/varchar data type?
There isn't a way I believe to make the transformation be case-insensitive, however you could modify the SQL statement for your transformation to ensure that the source data matches the case of your comparison data by using the LOWER() string function.
Set the CacheType property of the lookup transformation to Partial or None.
The lookup comparisons will now be done by SQL Server and not by the SSIS lookup component, and will no longer be case sensitive.
You have to change the source and as well as look up data, both should be in same case type.
Based on this Microsoft Article:
The lookups performed by the Lookup transformation are case sensitive. To avoid lookup failures that are caused by case differences in data, first use the Character Map transformation to convert the data to uppercase or lowercase. Then, include the UPPER or LOWER functions in the SQL statement that generates the reference table
To read more about Character Map transformation, follow this link"
Character Map Transformation