Can a Hive custom SerDe produce multiple rows?

Can a Hive custom SerDe produce multiple rows? - hive

I am using Hive 0.13.1 and I created a custom SerDe that is able to process a special kind of xml data. So far so good.
I also created a class for the InputFormat that splits the input data.
Is it possible that I produce multiple rows (output) in the deserialize() function of my custom SerDe (or somewhere else in my SerDe)?
So that I am able to create e.g. two rows out of one split?
In the deserialize function as far as I can see (in other SerDe classes), the return value is only a List (the values of one row) and that will be displayed as one row.
Lets say I have a xml like this:
<item>
<id>0</id>
<timestamp>00:00:00</timestamp>
<subitemlist>
<subitem>1</subitem>
<subitem>2</subitem>
</subitemlist>
</item>
My SerDe gets the whole item block and what I want to do now is to create a row for each <subitem> with the id of <item> in Hive.
I can't adapt the InputFormat class because the problem is not as trivial as it is in this example :)

No, it's not possible. The SerDe interface serializes/deserializes one record at a time because that's what serialization is supposed to do.
In general, it is not a good design decision to have a SerDe to actually transform data, that's what queries, UDFs and UDTFs are for.
The purpose of a SerDe is basically to map a data format to an equivalent hive schema.
I think the best way to do it is to have a table like
create table xmltable (
id int,
ts timestamp,
subitems array<int>
)
using something with this serde and then create another table as a view
CREATE myview AS
select id, sb FROM xmltable LATERAL VIEW explode(subitems) sb1 AS sb

Ok thanks for your answer Roberto.
In general, it is not a good design decision to have a SerDe to actually transform data, that's what queries, UDFs and UDTFs are for
Yeah probably you are right. The problem is that I need to do some processing based on the data of many columns. So a UDF would increase the complexity of this too much. But still, thanks for the answer.
I now solved it by adapting the next()-method in my InputFormat-class. (I know I said I didn't want to do this, but ...).
So I'm analysing the <item> tag and for every <subitem> I return the whole item to the SerDe.

Related

how to view stats on snowflake?

I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.

You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html

not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!

The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.

How can I extract 'Memory Use Statistics' from ST03N?

I want to select following data from ST03N in a report:
After a performance trace, I noticed that the data might be stored in one of the tables:
MONI
SWNCMONI
I do not exactly know how to extract the CLUSTD data from the table.
I heard of using function module: SWNC_COLLECTOR_GET_AGGREGATES , but the data does not exactly match with the data from ST03N.

As one probably knows, the MONI and the newer SWNCMONI database tables are cluster tables and shouldn't be read directly, use new FM SWNC_COLLECTOR_GET_AGGREGATES for that.
Nevertheless, if you still want this:
TYPES: tt_memory TYPE TABLE OF swncaggmemory.
DATA: ms_monikey TYPE swncmonikey,
dummy TYPE tt_memory.
FIELD-SYMBOLS: <tab> TYPE ANY TABLE.
ASSIGN dummy TO <tab>.
ms_monikey-component = <instance_id>.
ms_monikey-comptype = 'NW Workload'.
ms_monikey-assigndsys = <host>.
ms_monikey-periodtype = 'D'.
ms_monikey-periodstrt = '20200713'.
IMPORT datatable TO <tab>
FROM DATABASE swncmoni(wj) ID ms_monikey
IGNORING STRUCTURE BOUNDARIES.
As you can see that data for PFCG differs from ST03n in spite it is called for the same date.
Answering on your second question: why it differ?
It may depends on data aggregation setting for memory profile
also try to play with aggregation period. Actually I also wasn't able to find correspondence between them.
Many useful info about ST03 is here
https://blogs.sap.com/2007/03/16/how-to-read-st03n-datasets-from-db/

Unable to load partitions in Athena with case sensitivity ON

I have data in S3 which is partitioned in YYYY/MM/DD/HH/ structure (not year=YYYY/month=MM/day=DD/hour=HH)
I set up a Glue crawler for this, which creates a table in Athena, but when I query the data in Athena it gives an error as one field has duplicate name (URL and url , which the SerDe converts to lowercase, causing a name conflict).
To fix this, I manually create another table (using the above table definition SHOW CREATE TABLE), adding 'case.insensitive'= FALSE to the SERDEPROPERTIES
WITH SERDEPROPERTIES ('paths'='deviceType,emailId,inactiveDuration,pageData,platform,timeStamp,totalTime,userId','case.insensitive'= FALSE)
I changed the s3 directory structure to the hive-compatible naming year=/month=/day=/hour= and then created the table with 'case.insensitive'= FALSE, then ran the MSCK REPAIR TABLE command for the new table, which loads all the partitions.
(Complete CREATE TABLE QUERY)
But upon querying, I can only find 1 data column (platform) and the partition columns, rest of all the columns are not parsed. But I've actually copied the Glue-generated CREATE TABLE query, with the case_insensitive=false condition.
How can I fix this?

I think you have multiple, separate issues: one with the crawler, and one with the serde, and one with duplicate keys:
Glue Crawler
If Glue Crawler delivered on what they promise they would be a fairly good solution for most situations and would save us from writing the same code over and over again. Unfortunately, if you stray outside of the (undocumented) use cases Glue Crawler was designed for, you often end up with various issues, from the strange to the completely broken (see for example this question, this question, this question, this question, this question, or this question).
I recommend that you skip Glue Crawler and instead write the table DDL by hand (you have a good template in what the crawler created, it just isn't good enough). Then you write a Lambda function (or shell script) that you run on a schedule to add new partitions.
Since your partitioning is only on time, this is a fairly simple script: it just needs to run every once in a while and add the partition for the next period.
It looks like your data is from Kinesis Data Firehose which produces a partitioned structure at hour granularity. Unless you have lots of data coming every hour I recommend you create a table that is only partitioned on date, and run the Lambda function or script once per day to add the next day's partition.
A benefit from not using Glue Crawler is that you don't have to have a one-to-one correspondence between path components and partition keys. You can have a single partition key that is typed as date, and add partitions like this: ALTER TABLE foo ADD PARTITION (dt = '2020-05-13') LOCATION 's3://some-bucket/data/2020/05/13/'. This is convenient because it's much easier to do range queries on a full date than when the components are separate.
If you really need hourly granularity you can either have two partition keys, one which is the date and one the hour, or just the one with the full timestamp, e.g. ALTER TABLE foo ADD PARTITION (ts = '2020-05-13 10:00:00') LOCATION 's3://some-bucket/data/2020/05/13/10/'. Then run the Lambda function or script every hour, adding the next hour's partition.
Having too a granular partitioning doesn't help with performance, and can instead hurt it (although the performance hit comes mostly from the small files and the directories).
SerDe config
As for the reason why you're only seeing the value of the platform column, it's because it's the only case where the column name and property have the same casing.
It's a bit surprising that the DDL you link to doesn't work, but I can confirm that it really doesn't. I tried creating a table from that DDL, but without the pagedata column (I also skipped the partitioning, but that shouldn't make a difference for the test), and indeed only the platform column had any value when I queried the table.
However, when I removed the case.insensitive serde property it worked as expected, which got me thinking that it might not work the way you think it does. I tried setting it to TRUE instead of FALSE, which made the table work as expected again. I think we can conclude from this that the Athena documentation is just wrong when it says "By default, Athena requires that all keys in your JSON dataset use lowercase". In fact, what happens is that Athena lower cases the column names, but it also lower cases the property names when reading the JSON.
With further experimentation it turned out the path property was redundant too. This is a table that worked for me:
CREATE EXTERNAL TABLE `json_case_test` (
`devicetype` string,
`timestamp` string,
`totaltime` string,
`inactiveduration` int,
`emailid` string,
`userid` string,
`platform` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
I'd say that case.insensitive seems to cause more problems than it solves.
Duplicate keys
When I added the pagedata column (as struct<url:string>) and added "pageData":{"URL":"URL","url":"url"} to the data, I got the error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
And I got the error regardless of whether the pagedata column was involved in the query or not (e.g. SELECT userid FROM json_case_test also errored). I tried the case.insensitive serde property with both TRUE and FALSE, but it had no effect.
Next, I took a look at the source documentation for the serde, which first of all is worded much better, and secondly contains the key piece of information: that you also need to provide mappings for the columns when you turn off case insensitivity.
With the following serde properties I was able to get the duplicate key issue to go away:
WITH SERDEPROPERTIES (
"case.insensitive" = "false",
"mapping.pagedata" = "pageData",
"mapping.pagedata.url" = "pagedata.url",
"mapping.pagedata.url2"= "pagedata.URL"
)
You would have to provide mappings for all the columns except for platform, too.
Alternative: use JSON functions
You mentioned in a comment to this answer that the schema of the pageData property is not constant. This is another case where Glue Crawlers unfortunately don't really work. If you're unlucky you'll end up with a flapping schema that includes some properties some days (see for example this question).
What I realised when I saw your comment is that there is another solution to your problem: set up the table manually (as described above) and use string as the type for the pagedata column. Then you can use functions like JSON_EXTRACT_SCALAR to extract the properties you want during query time.
This solution trades increased complexity of the queries for way fewer headaches trying to keep up with an evolving schema.

How best to fill classes with hierarchical data from a relational database in VB.Net?

I have some relational data in a SQL Server 2008 database split across 3 tables, which I would like to use to populate some classes that represent them.
The hierarchy is: Products -> Variants -> Options.
I have considered passing back 3 result sets and using LINQ to check if there are any related/child records in the related tables. I've also considered passing back a single de-normalised table containing all of the data from the three tables and reading through the rows, manually figuring out where a product/variant/option begins and ends. Having little to no prior experience with LINQ, I opted to go for the latter, which sort of worked but required many lines of code for something that I had hoped would be pretty straight forward.
Is there an easier way of accomplishing this?
The end goal is to serialize the resulting classes to JSON, for use in a Web Service Application.
I've searched and searched on Google for an answer, but I guess I'm not searching for the right keywords.

After a bit of playing around, I've figured out a way of accomplishing this...
Firstly, create a stored procedure in SQL Server that will return the data as XML. It's relatively easy to generate an XML document containing hierarchical data.
CREATE PROCEDURE usp_test
AS
BEGIN
SELECT
1 AS ProductID
, 'Test' AS ProductDesc
, (
SELECT 1 AS VariantID
'Test'AS VariantDesc
FOR XML PATH('ProductVariant'), ROOT('ProductVariants'), TYPE
)
FOR XML PATH('Product'), ROOT('ArrayOfProduct'), TYPE
END
This gives you an XML document with a parent-child relationship:
<ArrayOfProduct>
<Product>
<ProductID>1</ProductID>
<ProductDesc>Test</ProductDesc>
<ProductVariants>
<ProductVariant>
<VariantID>1</VariantID>
<VariantDesc>Test</VariantDesc>
</ProductVariant>
</ProductVariants>
</Product>
</ArrayOfProduct>
Next, read in the results into the VB.Net application using a SqlDataReader. Declare an empty object to hold the data and deserialize the XML into the object using an XmlSerializer.
At this point, the data that once was in SQL tables is now represented as classes in your VB.Net application.
From here, you can then serialize the object into JSON using JavaScriptSerializer.Serialize.

question on how to use sql server integrated service

I have a table called book with, the attrbutes are booked_id, yearmon, and day_01....day_31. Now i need to unpivot the table and transform day_01...day_31 into rows, I have successed in doing that, but the problem is that my yearmon is a format of 200805 and i need to append a day to it based on day_01 or day_02 etc, so that i can create a new column with date information for example, if it is day_01, it looks like 20080501. Instead of writing huge query, does anyone how to use ssis to tranform it

You should be able to use the Unpivot component and the Derived Column component to do what you need. Look into those and post back if they don't seem to do what you need.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Can a Hive custom SerDe produce multiple rows? - hive

Related

how to view stats on snowflake?

How can I extract 'Memory Use Statistics' from ST03N?

Unable to load partitions in Athena with case sensitivity ON

How best to fill classes with hierarchical data from a relational database in VB.Net?

question on how to use sql server integrated service

Categories

Resources