Long schema name of hadoop pig - schema

I have a complex job.
After several steps, now the schema come to below:
K: {C::group::sig: int,C::group::sn: chararray,long,DG::sn: chararray,DG::lat: float,DG::lng: float,DG::country: chararray,DG::region: int}
I can store and load, then re-assign the schema name for each item such as (sig:int, sn:chararray....)..
is there any other to do it in memory without load and store?

At any time, you can rename a field when you GENERATE it.
DESCRIBE K;
K: {C::group::sig: int,C::group::sn: chararray,long,DG::sn: chararray,DG::lat: float,DG::lng: float,DG::country: chararray,DG::region: int}
K2 =
FOREACH K
GENERATE
sig AS sig,
C::group::sn AS sn,
$2,
DG::sn AS sn2,
lat AS lat,
lng AS lng,
country AS country,
region AS region;
DESCRIBE K2;
K2: {sig: int,sn: chararray,long,sn2: chararray,lat: float,lng: float,country: chararray,region: int}
Also note that if the name is unambiguous (e.g., sig), you do not need to use the full name when working with the field. If it is ambiguous (e.g., C::group::sn and DG::sn), you do need to use the full name.

Related

SQL Array aggregation in Haskell + Squeal

I'm using an SQL library called Squeal in Haskell.
What's the correct way to aggregate multiple text rows into one array in Squeal library?
Say I got a very simple Schema with just one table containing a column 'keyword' (of PG text type) + associated types:
import Squeal.PostgreSQL
import qualified GHC.Generics as GHC
import qualified Generics.SOP as SOP
type Constraints = '["pk_keywords" ::: 'PrimaryKey '["id"]]
type Columns
= '["id" ::: 'Def :=> 'NotNull 'PGint8, "keyword" ::: 'NoDef :=> 'NotNull 'PGtext]
type Table = 'Table (Constraints :=> KColumns)
type Schema = '["keywords" ::: Table]
type Schemas = '["public" ::: Schema]
newtype Keywords = Keywords {unKeywords :: [Text]} deriving (GHC.Generic)
instance SOP.Generic Keywords
instance SOP.HasDatatypeInfo Keywords
type instance PG Keywords = 'PGvararray ( 'NotNull 'PGtext)
This is the part I need help with:
I'm trying an aggregation query like this:
keywords :: Query_ Schemas () Keywords
keywords =
select_ ((arrayAgg (All #keyword)) `as` #fromOnly) (from (table #keywords))
However, I keep getting an error:
* Couldn't match type 'NotNull (PG [Text])
with 'Null ('PGvararray ty0)
arising from a use of `as'
From what I understand, arrayAgg can produce NULL so I need to provide a default of empty array [] somehow with fromNull from here:
https://hackage.haskell.org/package/squeal-postgresql-0.5.1.0/docs/Squeal-PostgreSQL-Expression-Null.html#v:fromNull
But I don't quite know how to provide that.
What about the value type mismatch (PG [Text] vs 'PGvararray ty0)? How to solve that?
For the record, the library's author provided a solution as follows:
keywords :: Query_ Schemas () (Only (VarArray [Text]))
keywords = select_
(fromNull (array [] & inferredtype) (arrayAgg (All #keyword)) `as` #fromOnly)
(from (table #keywords) & groupBy Nil)
The key factors here are:
Provide a default empty array with fromNull (array [] & inferredtype) .... This way we can avoid using Maybe in the return type
Provide grouping with groupBy Nil
Choose either Distinct or All rows in arrayAgg
Finally, the return type should be VarArray x

Force FsCheck to generate NonEmptyString for discriminating union fields of type string

I'm trying to achieve the following behaviour with FsCheck: I'd like to create a generator that will generate a instance of MyUnion type, with every string field being non-null/empty.
type MyNestedUnion =
| X of string
| Y of int * string
type MyUnion =
| A of int * int * string * string
| B of MyNestedUnion
My 'real' type is much larger/deeper than the MyUnion, and FsCheck is able to generate a instance without any problem, but the string fields of the union cases are sometimes empty. (For example it might generate B (Y (123, "")))
Perhaps there's some obvious way of combining FsCheck's NonEmptyString and its support for generating arbitrary union types that I'm missing?
Any tips/pointers in the right direction greatly appreciated.
Thanks!
This goes against the grain of property based testing (in that you explicitly prevent valid test cases from being generated), but you could wire up the non-empty string generator to be used for all strings:
type Alt =
static member NonEmptyString () : Arbitrary<string> =
Arb.Default.NonEmptyString()
|> Arb.convert
(fun (nes : NonEmptyString) -> nes.Get)
NonEmptyString.NonEmptyString
Arb.register<Alt>()
let g = Arb.generate<MyUnion>
Gen.sample 1 10 g
Note that you'd need to re-register the default generator after the test since the mappings are global.
A more by-the-book solution would be to use the default derived generator and then filter values that contain invalid strings (i.e. use ==>), but you might find it not feasible for particularly deep nested types.

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:
Take an input (there is a .pig_schema file in the input folder)
Take only two fields (chararray) from it and remove duplicates
Group on one of those fields
The code is as follows:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
When I do the grouped, it gives the following error:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
keywords is a chararray and Pig should be able to group on a chararray. Any ideas?
EDIT:
Input file:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema file
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
UPDATE :
Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
Would suggest to use unique alias names for better readability and maintainability.
Output :
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})

How do I specify a field name for a wrapped tuple in Pig?

I have a tuple with schema (a:int, b:int, c:int) stored in alias first. I want to convert each tuple to have a new relation second with schema like this:
(d: (a:int, b:int, c:int))
Basically, I've wrapped my initial tuple in another tuple and named the field. This is in preparation for a cross operation where I want to cross two relations but keep each one in a named field.
Here is what I would expect it to look like, except there's an error:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple);
This errors out too:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple (a:int, b:int, c:int));
Thanks!
Uri
What about:
second = FOREACH first GENERATE TOTUPLE(*) AS d;
describe second;
second: {d: (a: int,b: int,c: int)}

Partial SQL insert in haskelldb

I just started a new project and wanted to use HaskellDB in the beginning. I created a database with 2 columns:
create table sensor (
service text,
name text
);
..found out how to do the basic HaskellDB machinery (ohhh..the documentation) and wanted to do an insert. However, I wanted to do a partial insert (there are supposed to be more columns), something like:
insert into sensor (service) values ('myservice');
Translated into HaskellDB:
transaction db $ insert db SE.sensor (SE.service <<- (Just $ senService sensor))
But...that simply doesn't work. What also does not work is if I specify the column names in different order, which is not exactly conenient as well. Is there a way to do a partial insert in haskelldb?
The error codes I get are - when I just inserted a different column (the 'name') as the first one:
Couldn't match expected type `SEI.Service'
against inferred type `SEI.Name'
Expected type: SEI.Intsensor
Inferred type: Database.HaskellDB.HDBRec.RecCons
SEI.Name (Expr String) er
When using functional dependencies to combine
Database.HaskellDB.Query.InsertRec
(Database.HaskellDB.HDBRec.RecCons f (e a) r)
(Database.HaskellDB.HDBRec.RecCons f (Expr a) er),
etc..
And when I do the 'service' as the first - and only - field, I get:
Couldn't match expected type `Database.HaskellDB.HDBRec.RecCons
SEI.Name
(Expr String)
(Database.HaskellDB.HDBRec.RecCons
SEI.Time
(Expr Int)
(Database.HaskellDB.HDBRec.RecCons
SEI.Intval (Expr Int) Database.HaskellDB.HDBRec.RecNil))'
against inferred type `Database.HaskellDB.HDBRec.RecNil'
(I have a couple of other columns in the table)
This looks really like 'by design', unfortunately :(
You're right, that does look intentional. The HaskellDB.Query docs show that insert has a type of:
insert :: (ToPrimExprs r, ShowRecRow r, InsertRec r er) => Database -> Table er -> Record r -> IO ()
In particular, the relation InsertRec r er must hold. That's defined elsewhere by the recursive type program:
InsertRec RecNil RecNil
(InsertExpr e, InsertRec r er) => InsertRec (RecCons f (e a) r) (RecCons f (Expr a) er)
The first line is the base case. The second line is an inductive case. It really does want to walk every element of er, the table. There's no short-circuit, and no support for re-ordering. But in my own tests, I have seen this work, using _default:
insQ db = insert db test_tbl1 (c1 <<- (Just 5) # c2 << _default)
So if you want a partial insert, you can always say:
insC1 db x = insert db test_tbl1 (c1 <<- (Just x) # c2 << _default)
insC2 db x = insert db test_tbl2 (c1 << _default # c2 <<- (Just x))
I realize this isn't everything you're looking for. It looks like InsertRec can be re-written in the style of HList, to permit more generalization. That would be an excellent contribution.