Reading HDFS extended attributes in HiveQL - hive

I am working on a use case where we would like to add metadata (e.g. load time, data source...) to raw files as HDFS extended attributes (xattrs).
I was wondering if there was a way for HiveQL to retrieve such metadata in queries in the result set.
This would avoid storing such metadata in each record within raw files.
Would a custom Hive SerDe be a way to make such xattrs available? Otherwise, do you see another way to make this possible?
I am still relatively novice with this, so bear with me if I misused terms.
Thanks

There may be other ways to implement it, but after I discovered Hive virtual column 'INPUT__FILE__NAME' containing the URL of the source HDFS file, I create a User-Defined Function in Java to read its extended attributes. This function can be used in a Hive query as:
XAttrSimpleUDF(INPUT__FILE__NAME,'user.my_key')
The (quick and dirty) Java source code of the UDF looks like:
public class XAttrSimpleUDF extends UDF {
public Text evaluate(Text uri, Text attr) {
if(uri == null || attr == null) return null;
Text xAttrTxt = null;
try {
Configuration myConf = new Configuration();
//Creating filesystem using uri
URI myURI = URI.create(uri.toString());
FileSystem fs = FileSystem.get(myURI, myConf);
// Retrieve value of extended attribute
xAttrTxt = new Text(fs.getXAttr(new Path(myURI), attr.toString()));
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return xAttrTxt;
}
}
I didn't test the performance of this when querying very large data sets.
I wished that extended attributes could be retrieved directly as a virtual column in a way similar to using virtual column INPUT__FILE__NAME.

Related

Materialized view to use different Serde

Version used: Kafka 3.1.1, Confluent 7.1.0, Avro 1.11.0
I’m creating a REST controller which is “searching” for AVRO objects in a topic. The objects in the topic are serialized using SpecificAvroSerde<>. Each topic has assigned two AVRO schemas. One for the key (with several fields of various types) and one for the value (multiple fields and types).
I’ve done this several times whereby I’m consuming the topic in a KTable and then materialize it. There is only one pair of serdes involved and the serialized format is the same for both the topic and the materialized view (RocksaltDb). The REST controller then can look up the store and either perform a get with a key or do a range scan between two keys. This all works as expected.
private final static String TOPIC_NAME = "input-topic";
private final static String VIEW_NAME = "materialized-view";
private final SpecificAvroSerde<ProductXrefKey> productXrefKeySerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXref> productXrefSerde = new SpecificAvroSerde<>();
final Map<String, Object> props = this.kafkaProperties.buildStreamsProperties();
productXrefKeySerde.configure(props, true);
productXrefSerde.configure(props, false);
KTable<ProductXrefKey, ProductXref> productXrefTable = builder
.table(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde),
Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(productXrefKeySerde)
.withValueSerde(productXrefSerde));
<…>
final ReadOnlyKeyValueStore<ProductXrefKey, ProductXref> store =
streamsBuilderFactoryBean.getKafkaStreams().store(fromNameAndType(VIEW_NAME, keyValueStore()));
try (KeyValueIterator<ProductXrefKey, ProductXref> range = store.range(fromKey, toKey)) {
if (range != null) {
range.forEachRemaining(kv -> {
<…>
});
} else {
log.info("Could not find {} in local ReadOnlyKeyValueStore {}", fromKey, viewName);
}
}
I now want to change this using a prefix scan instead. Since the key contains multiple fields there is no way to only serialize first part (i.e. first few fields) of the key I need a specialized serializer. This also means I have to use a different serializer for the materialized view itself (SpecificAvroSerde puts the magic number and schema ID at the beginning of the byte array) as otherwise the serialized output for the prefix and the key in the materialized view can’t be compared. Hence I created a specialised Serde which serializes the key using the same logic as when used for serializing the prefix but omitting the fields not required for the scan (i.e. omitting the last field). Above code now looks
private final static String TOPIC_NAME = "input-topic";
private final static String VIEW_NAME = "materialized-view";
private final SpecificAvroSerde<ProductXrefKey> productXrefKeySerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXref> productXrefSerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXrefKey> materializedProductXrefKeySerde = new ProductXrefKeySerde();
// for the value part we can still used standard serde as no change in serialization logic needed
private final SpecificAvroSerde<ProductXref> materializedProductXrefSerde = new SpecificAvroSerde<>();
// telling the serializer to cut off last field
private final SpecificAvroSerde<ProductXref> prefixScanProductXrefSerde = new ProductXrefKeySerde(true);
final Map<String, Object> props = this.kafkaProperties.buildStreamsProperties();
productXrefKeySerde.configure(props, true);
productXrefSerde.configure(props, false);
KTable<ProductXrefKey, ProductXref> productXrefTable = builder
.table(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde),
Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(materializedProductXrefKeySerde)
.withValueSerde(materializedProductXrefSerde));
<…>
final ReadOnlyKeyValueStore<ProductXrefKey, ProductXref> store =
streamsBuilderFactoryBean.getKafkaStreams().store(fromNameAndType(VIEW_NAME, keyValueStore()));
try (KeyValueIterator<ProductXrefKey, ProductXref> range = store.prefixScan(prefixKey, prefixScanProductXrefSerde)){
if (range != null) {
range.forEachRemaining(kv -> {
<…>
});
} else {
log.info("Could not find {} in local ReadOnlyKeyValueStore {}", prefixKey, viewName);
}
}
My assumption was, that the topic gets deserialized using the SpecificAvroSerde and then gets serialized for the view using my ProductXrefKeySerde. The problem is, that the content in the materialized view is still serialized using the same logic as in the original topic. It appears that the serializer is never used during the topic being processed and stored in the materialized view. I can verify that also on the file system and see that the keys in the RocksaltDb files are serialized with the magic byte and schema ID and hence prefixScan wont be able to fine anything.
How can I change the serialization format for the materialized view?
Or is there a better way for serializing a prefix AVRO object?
It appears that there is some optimization happening which avoids deserialization/serialization if KTable is directly materialized. I've changed the logic such that it consumes it as a KStream and then creates the KTable (toTable(...))
KTable<ProductXrefKey, ProductXref> productXrefStream = builder
.stream(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde))
.toTable(Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(productXrefKeySerde)
.withValueSerde(productXrefSerde));
With this small change, data now gets deserialized (using SpecificAvroSerde<>) and serialized again using the provided ProductXrefKeySerde. Now also the prefix scan works and returns the records as expected.

Flink statefun and confluent schema registry compatibility

I'm trying to egress to confluent kafka from flink statefun. In confluent git repo
in order to schema check and put data to kafka topic all we need to do is use kafka client ProducerRecord object with avro object.
But in statefun we need to override "ProducerRecord<byte[], byte[]> serialize" method for kafka egress. This causes the following error.
Caused by: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: "bytes"
Schema registery and statefun kafka egress seem to be incompatible. Are there any workaround ?
It is possible to use Confluent Schema Registry with Statefun Egress.
In order to do so, you first register your schema manually with the schema registry and then supply KafkaEgressSerializer a byte[] serialized by KafkaAvroSerializer instance.
Code below is the gist of it and is in compliance with the first one in Igal's workaround suggestions:
public class SpecificRecordFromAvroSchemaSerializer implements KafkaEgressSerializer<SpecificRecordGeneratedFromAvroSchema> {
private static String KAFKA_TOPIC = "kafka_topic";
private static CachedSchemaRegistryClient schemaRegistryClient = new CachedSchemaRegistryClient(
"http://schema-registry:8081",
1_000
);
private static KafkaAvroSerializer kafkaAvroSerializer = new KafkaAvroSerializer(schemaRegistryClient);
static {
try {
schemaRegistryClient.register(
KAFKA_TOPIC + "-value", // assuming subject name strategy is TopicNameStrategy (default)
SpecificRecordGeneratedFromAvroSchema.getClassSchema()
);
} catch (IOException e) {
e.printStackTrace();
} catch (RestClientException e) {
e.printStackTrace();
}
}
#Override
public ProducerRecord<byte[], byte[]> serialize(SpecificRecordGeneratedFromAvroSchema specificRecordGeneratedFromAvroSchema) {
byte[] valueData = kafkaAvroSerializer.serialize(
KAFKA_TOPIC,
specificRecordGeneratedFromAvroSchema
);
return new ProducerRecord<>(
KAFKA_TOPIC,
String.valueOf(System.currentTimeMillis()).getBytes(),
valueData
);
}
}
Schema registry is not directly supported at this version of stateful functions,
but few workarounds are possible:
Connect to the schema registry by your self from the KafkaEgressSerializer class. In your linked example that would need to be happening here.
Provide your own instance of a FlinkKafkaProducer that is based on (see
AvroDeserializationSchema)
Mange the schemas outside of stateful functions, but serialize your Avro record to bytes. Make sure to remove the schema registry from the properties that being passed to the KafkaProducer

Apache Ignite Continuous Queries : How to get the field names and field values in the listener updates when there are dynamic fields?

I am working on a POC on whether or not we should go ahead with Apache Ignite both for commerical and enterprise use. There is a use case though that we are trying to find an answer for.
Preconditions
Dynamically creation of tables i.e. there may be new fields that come to be put into the cache. Meaning there is no precompiled POJO(Model) defining the attributes of the table/cache.
Use case
I would like to write a SELECT continuous query where it gives me the results that are modified. So I wrote that query but the problem is that when the listener gets a notification, I am not able to find all the field names that are modified from any method call. I would like to be able to get all the field names and field values in some sort of Map, which I can use and then submit to other systems.
You could track all modified field values using binary object and continuous query:
IgniteCache<Integer, BinaryObject> cache = ignite.cache("person").withKeepBinary();
ContinuousQuery<Integer, BinaryObject> query = new ContinuousQuery<>();
query.setLocalListener(events -> {
for (CacheEntryEvent<? extends Integer, ? extends BinaryObject> event : events) {
BinaryType type = ignite.binary().type("Person");
if (event.getOldValue() != null && event.getValue() != null) {
HashMap<String,Object> oldProps = new HashMap<>();
HashMap<String,Object> newProps = new HashMap<>();
for (String field : type.fieldNames()) {
oldProps.put(field,event.getOldValue().field(field));
newProps.put(field,event.getValue().field(field));
}
com.google.common.collect.MapDifference<Object, Object> diff = com.google.common.collect.Maps.difference(oldProps, newProps);
System.out.println(diff.entriesDiffering());
}
}
});
cache.query(query);
cache.put(1, ignite.binary().builder("Person").setField("name","Alice").build());
cache.put(1, ignite.binary().builder("Person").setField("name","Bob").build());

Does this saving/loading pattern have a name?

There's a variable persistence concept I have integrated multiple times:
// Standard initialiation
boolean save = true;
Map<String, Object> dataHolder;
// variables to persist
int number = 10;
String text = "I'm saved";
// Use the variables in various ways in the project
void useVariables() { ... number ... text ...}
// Function to save the variables into a datastructure and for example write them to a file
public Map<String, Object> getVariables()
{
Map<String, Object> data = new LinkedHashMap<String, Object>();
persist(data);
return(data);
}
// Function to load the variables from the datastructure
public void setVariables(Map<String, Object> data)
{
persist(data);
}
void persist(Map<String, Object> data)
{
// If the given datastructure is empty, it means data should be saved
save = (data.isEmpty());
dataHolder = data;
number = handleVariable("theNumber", number);
text = handleVariable("theText", text);
...
}
private Object handleVariable(String name, Object value)
{
// If currently saving
if(save)
dataHolder.put(name, value); // Just add to the datastructure
else // If currently writing
return(dataHolder.get(name)); // Read and return from the datastruct
return(value); // Return the given variable (no change)
}
The main benefit of this principle is that you only have a single script where you have to mention new variables you add during the development and it's one simple line per variable.
Of course you can move the handleVariable() function to a different class which also contains the "save" and "dataHolder" variables so they wont be in the main application.
Additionally you could pass meta-information, etc. for each variable required for persisting the datastructure to a file or similar by saving a custom class which contains this information plus the variable instead of the object itself.
Performance could be improved by keeping track of the order (in another datastructure when first time running through the persist() function) and using a "dataHolder" based on an array instead of a search-based map (-> use an index instead of a name-string).
However, for the first time, I have to document this and so I wondered whether this function-reuse principle has a name.
Does someone recognize this idea?
Thank you very much!

GridGain SQL Transform Query Limitations

I am running into an issue with doing an SQL Transform Query. I have a replicated Cache setup with thousands of cached items in various Classes. When I run a transform query that returns specific (summary) items from Classes on the Cache, it looks like the query executes just fine and returns a Collection. However, when I iterate through the Collection, after 2,048 items, the individual items in the Collection (which used to be Cast'able until then) are now simple a 'GridCacheQueryResponseEntry', which I can't seem to cast anymore...
Is 2,048 items the limit for a Transform Query Result Set in GridGain?
here's the code I use to query/transform the cache items (Simplified a bit). This works for exactly 2048 items and then throws an Exception:
GridCacheQuery<Map.Entry<UUID, Object>> TypeQuery = queries.createSqlQuery(Object.class, "from Object where Type = ? and Ident regexp ?");
GridClosure<Map.Entry<UUID, Object>, ReturnObject> Trans = new GridClosure<Map.Entry<UUID, Object>, ReturnGeometry>() {
#Override public ReturnObject apply(Map.Entry<UUID, Object> e) {
try {
ReturnObject tmp = e.getValue().getReturnObject();
} catch (Exception ex) {ex.getMessage()); }
return tmp;
}
};
Collection<ReturnObject> results = TypeQuery .execute(Trans,"VarA","VarB").get();
Iterator iter = results.iterator();
while (iter.hasNext()) {
try {
Object item = iter.next();
ReturnObjectpoint = (ReturnObject) item;
} catch (Exception ex) {}
}
There are no such limitations in GridGain. Whenever you execute a query, you have two options:
Call GridCacheQueryFuture.get() method to get the whole result set as a collection. This works only for relatively small result set, because all the rows in the result set have to be loaded to client node's memory.
Use GridCacheFutureMethod.next() to iterate through result set. In this case results will be acquired from remote nodes page by page. When you finished iteration through a page, it's discarded and next one is loaded. So you have only one page at a time which gives you an opportunity to query result sets of any size.
As for GridCacheQueryResponseEntry, you should not cast to it, because it's an internal GridGain class and is actually a simple implementation of Map.Entry interface which represents a key-value pair from GridGain cache.
In case of transform query you will get Map.Entry instances only in transformer, while client node will receive already transformed values, so I'm not sure how it's possible to get them during iteration. Can you provide a small code example of how you execute the query?