Compute minimum of a nested struct column while preserving the schema in Spark/Scala - dataframe

Suppose I have a dataframe df with a particular column c (that is a struct with several nested fields inside it (could be other structs, integer, string, etc). I do not know these fields beforehand, so I want a general solution.
I want to compute the minimum of this column. Currently I am doing this - val min_df = df.agg(min(c).as("min_col"))
This returns a dataframe min_df with one row and one column. Unfortunately, the schema of min_df ends up being a subset of the original schema of df(c), since some fields and values do not exist in the minimum value. I want it to be the same as the schema of df(c), since I want to compare this minimum value with some other quantities later on.
I already tried something like spark.createDataFrame(min_df.rdd, schema=df.select('c').schema), but this isn't working.
How can I go about computing the minimum/maximum so that the schema is preserved in this case?

Related

BigQuery - Inferring Datatypes of Column Values

What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!

What are best possible way to represent a tabular data in flatbuffer

I have tabular data (header, data rows/columns). One can assume it to be in csv format for representation purpose.
There is header row as column labels, may not be in same sequence always. But for each label name datatype is known upfront
Number of columns are fixed, each column is different datatype
Number of rows are variable
I am new to Flatbuffers so want to know best possible way to represent tabular data in Flatbuffers.
Something like this:
table Row {
col1:int; // These can each be their own data type.
col2:string;
..
// Fixed number of columns.
}
table Root {
rows:[Row]; // vector of rows, variable length.
}
root_type Root;
Note that in this case FlatBuffers' use of table is very different from a database table.

Hive UDF for the entire row as input

I am looking at ways to write a general data cleansing framework that cleans the entire row based on the position and the type configured for a given data set.
Sample input record from the data set as follows,
100| John | Mary | 10Sep2013 | 10,23,4
Now the configuration would be based on the position (starting from index 1). For example, at position 2 trim the spaces, at position 4 convert to the hive standard date, at position 5 remove the commas. This is configured at the data set level.
Now if these have to plugged into hive or pig, there should be a way for the hive\Pig UDFs to accept the entire row as input. The UDF should parse the row based on the configurable field separator and apply the field\column specific operations based on positions. This way it does not matter whether pig or hive or anything else is used for such row based operations. I know this is a bit more involved to abstract the hive\pig specific row types and provide generic position based getter.
It also may make sense to call the UDF for the entire row rather than for each columns to make things faster.
Is there a way for the hive\pig UDFs to accept the entire line of text as the input?
The only way to take the entire row as input is just keep the whole text as one column. But as far as treating the columns separately is concerned you can use as UDTF which takes input as 1 column but output of that UDTF will be multiple columns which can be used by Hive or Pig.
The other option is keep the values in different columns but build a UDF which us smart enough to understand the format of data and accordingly give different output. But UDF will take 1 col and output also will be 1 col

Column that shows number of elements in another col (Int Array) SQL (postgres 8.3)

I have a column of Int Array. I want to add another column to the table, that always shows the number elements in that array for that row. It should update this value automatically. Is there a way to embedd a function as default value? If so, how would this function know where to pick its argument (the int array column/row number).
In a normalized table you would not include this functionally dependent and redundant information as a separate column.
It is easy and fast enough to compute it on the fly:
SELECT array_dims ('{1,2,3}'::int[]);
Or:
SELECT array_length('{1,2,3}'::int[], 1);
array_length() has been introduced with PostgreSQL 8.4. Maybe an incentive to upgrade? 8.3 is going out of service soon.
With Postgres 8.3 you can use:
SELECT array_upper('{1,2,3}'::int[], 1);
But that's inferior, because the array index can start with any number, if entered explicitly. array_upper() would not tell the actual length then, you would have to subtract array_lower() first. Also note, that in PostgreSQL arrays can always contain multiple dimensions, regardless of how many dimensions have been declared. I quote the manual here:
The current implementation does not enforce the declared number of
dimensions either. Arrays of a particular element type are all
considered to be of the same type, regardless of size or number of
dimensions. So, declaring the array size or number of dimensions in
CREATE TABLE is simply documentation; it does not affect run-time
behavior.
(True for 8.3 and 9.1 alike.) That's why I mentioned array_dims() first, to give a complete picture.
Details about array functions in the manual.
You may want to create a view to include that functionally dependent column:
CREATE VIEW v_tbl AS
SELECT arr_col, array_length(arr_col, 1) AS arr_len
FROM tbl;

MySQL command to search CSV (or similar array)

I'm trying to write an SQL query that would search within a CSV (or similar) array in a column. Here's an example:
insert into properties set
bedrooms = 1,2,3 (or 1-3)
title = nice property
price = 500
I'd like to then search where bedrooms = 2+. Is this even possible?
The correct way to handle this in SQL is to add another table for a multi-valued property. It's against the relational model to store multiple discrete values in a single column. Since it's intended to be a no-no, there's little support for it in the SQL language.
The only workaround for finding a given value in a comma-separated list is to use regular expressions, which are in general ugly and slow. You have to deal with edge cases like when a value may or may not be at the start or end of the string, as well as next to a comma.
SELECT * FROM properties WHERE bedrooms RLIKE '[[:<:]]2[[:>:]]';
There are other types of queries that are easy when you have a normalized table, but hard with the comma-separated list. The example you give, of searching for a value that is equal to or greater than the search criteria, is one such case. Also consider:
How do I delete one element from a comma-separated list?
How do I ensure the list is in sorted order?
What is the average number of rooms?
How do I ensure the values in the list are even valid entries? E.g. what's to prevent me from entering "1,2,banana"?
If you don't want to create a second table, then come up with a way to represent your data with a single value.
More accurately, I should say I recommend that you represent your data with a single value per column, and Mike Atlas' solution accomplishes that.
Generally, this isn't how you should be storing data in a relational database.
Perhaps you should have a MinBedroom and MaxBedroom column. Eg:
SELECT * FROM properties WHERE MinBedroom > 1 AND MaxBedroom < 3;