How to read specific columns from a Parquet file in Java - apache

I am using a WriteSupport that knows how to write my custom object 'T' into Parquet. I am interested in only reading 2 or 3 specific columns out of 100 columns of my custom object that are written into the Parquet file.
Most examples online extend ReadSupport and read the entire record. Want to accomplish this without using things like Spark, Hive, Avro, Thrift, etc.
An example in Java, which reads selected columns of a custom object in Parquet?

This post may help.
Read specific column from Parquet without using Spark
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Related

Does Tableau support complex data type of hive columns and flatten it?

I am trying to create a dashboard from the data present in the Hive. The catch is the column which I want to visualize is a nested JSON type. So will tableau able to parse and flatten the JSON column and list out all possible attributes? thanks!
Unfortunately Tableau will not automatically flatten the JSON structure of the field for you, but you can manually do so.
Here is an article that explains the use of Regex within Tableau to extract pertinent information from your JSON field.
I realize this may not be the answer you were looking for, but hopefully it gets you started down the right path.
(In case it helps, Tableau does have a JSON connector in the event you are able to connect directly to your JSON as a datasource instead of embedded in your Hive connection as a complex field type.)

Apache ignite json field

Perhaps I am missing this in the documentation, but is it possible to store and query against json data in Apache Ignite? For example, let's say I have a "table" called "cars" with the following fields:
model
blueprint
The "blueprint" field is actually a json field that may contain data such as:
{
horsepower: 200,
mpg: 30
}
Those are not the only fields for the "blueprint" field, and it may contain many more or less fields. Is it possible to run a query such as:
SELECT model FROM cars WHERE blueprint.horsepower < 300 AND blueprint.mpg > 20
It is not known in advance what the fields will be for the "blueprint" field, and creating indexes for them is not optional.
Note: This is not a conversation about if this is the logically optimal way to store this information, or how the "blueprint" field should be stored in a separate table. This question is meant to understand if querying against a json field is trivially possible in apache ignite.
This is not supported out of the box as for now. However, you can create conversion logic between JSON and Ignite binary format and save BinaryObjects in caches. To create a BinaryObject without a Java class, you can use binary object builder: https://apacheignite.readme.io/docs/binary-marshaller#modifying-binary-objects-using-binaryobjectbuilder

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

Is it feasible to split data from differently formatted csv files in MS-SQL into several tables with one row per field of a file?

I only found answers about how to import csv files into the database, for example as blob or as 1:1 representation of the table you are importing it into.
What I need is a little different: My team and I are tracking everything we do in a database. A lot of these tasks produce logfiles, benchmark results, etc., which are stored in CSV format. The number of columns are far from consistent and also the data could be completely different from file to file, e.g. it could be a log from fraps with frametimes in it or a log of CPU temparatures over an amount of time, or even something completely different.
Long story short, I came up with an idea, but - being far from a sql pro - I am not sure if it makes sense or if there is a more elegant solution.
Does this make sense to you:
We also need to deal with a lot of data that is produced, so please give me also your opinion if that is feasible with like 200 files per day which can easyly have a couple of thousands rows.
The purpose of all this will be, that we can generate reports form the stored data and perform analysis of the data. E.g. view it on a webpage in a graph or do calculations with it.
I'm limited to MS-SQL in this case, because that's what the current (quite complex) database is and I'm just adding a new schema with that functionality to it.
Currently we just archive the files on a raid and store a link to it in the database. So everyone who wants to do magic with the data needs to download every file he needs and then use R or Excel to create a visualization of the data.
Have you considered a column of XML data type for the file data as an alternative of ColumnId -> Data structure? SQL server provides is a special dedicated XML index (over the entire XML structure) so your data can be fully indexed no matter what CSV columns you have. You will have much less records in the database to handle (as an entire CSV file will be a single XML field value). There are good XML query options to search by values & attributes of the XML type.
For that you will need to translate CSV to XML, but you will have to parse it either way ...
Not that your plan won't work, I am just giving an idea :)
=========================================================
Update with some online info:
An article from Simple talk: The XML Methods in SQL Server
Microsoft documentation for nodes() with various use case samples: nodes() Method (xml Data Type)
Microsoft document for value() with various use case samples: value() Method (xml Data Type)

Designing the filesystem and database for JSON data files

I currently have an API which accepts JSON files(which are JSON serialised objects which contains some user transaction data) and stores the same into the server. Every such JSON file has a unique global id and a unique user to which it is associated. The user should then be able to query through all JSON files that are associated to him and produce a bunch of aggregated results calculated on top of those files.
**Edits:
A typical JSON file that needs to be stored looks something like:
[{"sequenceNumber":125435,"currencyCode":"INR","vatRegistrationNumber":"10868758650","receiptNumber":{"value":"1E466GDX5X2C"},"retailTransaction":[{"otherAttributes":{},"lineItem":[{"sequenceNumber":1000,"otherAttributes":{},"sale":{"otherAttributes":{},"description":"Samsung galaxy S3","unitCostPrice":{"quantity":1,"value":35000},"discountAmount":{"value":2500,"currency":"INR"},"itemSubType":"SmartPhone"}},{"sequenceNumber":1000,"otherAttributes":{},"customerOrderForPickup":{"otherAttributes":{},"description":"iPhone5","unitCostPrice":{"quantity":1,"value":55000},"discountAmount":{"value":5000,"currency":"INR"},"itemSubType":"SmartPhone"}}],"total":[{"value":35000,"type":"TransactionGrossAmount","otherAttributes":{}}],"grandTotal":90000.0,"reason":"Delivery"},null]}]
The above JSON is the serialised version of a complex object containing single or array of Objects of other classes as attributes. So the 'receiptNumber' is the universal id of the JSON file.
To answer Sammaye's question, I would need to query stuff like quantity and value of the customerOrderForPickup or the grandTotal of the transaction, and in as an aggegate of various such transaction JSONs
**
I would like to have some suggestion as to how to go about:
1) Storing these JSON files on the server, the file system ie
2) What kind of a database should I use to query through these JSON files with such a complex structure
My research has resulted in a couple of possibilities:
1) Use a MongoDB database to store the JSON representatives of the object and query through the database. How would the JSON files be stored? What will be the best way to store the transaction JSONs in the MongoDB database?
2) Couple a SQL database containing the unique global id, user id and the address of the JSON file on the server, with an aggregating code on those files. I doubt if this can be scaled
Would be glad if someone has any insights on the problem. Thanks.
I can see 2 options:
Store in MongoDB, as you mentioned, just need to create a collection, and add each JSON file directly as a document to the collection. You may need to change the layout of the JSON a bit to improve queryability.
Store in HDFS, and layer Hive on it. There is a JSON SerDe (Serializer Deserializer) in Hive. This would also scale well.