How to filter avro records through filtering with a list of another file in pig? - apache-pig

I have a file "fileA" that is an avro with the following records:
{itemid:"Carrot"}
{itemid:"Lettuce"}
...
I have another file "fileB" that is an avro with multiple records following the same schema:
{item: "Carrot", cost: $2, ...other fields..}
{item: "Lettuce", cost: $2, ...other fields..}
{item: "Rice", cost: $2, ...other fields..}
...
How can I use pig to filter the data such that I can store all the relevant records in file "B" in a new output file?
I tried performing the following:
A = load 'fileA' using AvroStorage();
B = load 'fileB' using AvroStorage();
C = JOIN A by itemid , B by item;
STORE C into 'outputpath' using AvroStorage();
I am getting an error of "Pig Schema contains a name that is not allowed in Avro.
I want to avoid having to specify the complete schema of "B" inside of the AvroStorage() or any fields in A, as I only want to use A to filter down the records of B for storage and not add or change any of the schema output of B. Is there a way to do this?

Related

how to export bigquery to bigtable using airflow? schema issue

I'm using Airflow to extract BigQuery rows to Google Cloud Storage in Avro format.
with models.DAG(
"bigquery_to_bigtable",
default_args=default_args,
schedule_interval=None,
start_date=datetime.now(),
catchup=False,
tags=["test"],
) as dag:
data_to_gcs = BigQueryInsertJobOperator(
task_id="data_to_gcs",
project_id=project_id,
location=location,
configuration={
"extract": {
"destinationUri": gcs_uri, "destinationFormat": "AVRO",
"sourceTable": {
"projectId": project_id, "datasetId": dataset_id,
"tableId": table_id}}})
gcs_to_bt = DataflowTemplatedJobStartOperator(
task_id="gcs_to_bt",
template="gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable",
location=location,
parameters={
'bigtableProjectId': project_id,
'bigtableInstanceId': bt_instance_id,
'bigtableTableId': bt_table_id,
'inputFilePattern': 'gs://export/test.avro-*'
},
)
data_to_gcs >> gcs_to_bt
the bigquery row contains
row_key | 1_cnt | 2_cnt | 3_cnt
1#2021-08-03 | 1 | 2 | 2
2#2021-08-02 | 5 | 1 | 5
.
.
.
I'd like to use the row_key column for row key in bigtable and rest column for columns in specific column family like my_cf in bigtable.
However I got error messages while using dataflow to loads avro file to bigtable
"java.io.IOException: Failed to start reading from source: gs://export/test.avro-"
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
The docs I read is telling:
The Bigtable table must exist and have the same column families as
exported in the Avro files.
How Do I export BigQuery in Avro with same column families?
I think you have to transform AVRO to proper schema. Documentation mentioned by you also says:
Bigtable expects a specific schema from the input Avro files.
There is a link that is referring to special data schema, which has to be used.
If I understand correctly you are just importing data from table the result, although is AVRO schema, will not much the requirement schema, so you need to transform data to proper schema appropriate to your BigTable schema.

How do I setup model for table object with arrays of responses in sequelize

I am having challenge on how to setup model for table object with arrays of responses in Sequelize ORM. I use Postgres DB. I have a table say foo. Foo has columns
A
B
C
> C1_SN
C1_Name
C1_Address
C1_Phone
D
E
The column C has a boolean question, if the user select true, he will need to provide array of responses for C1. Such as we now have:
C1_SN1
C1_Name1
C1_Address1
C1_Phone1
------
C1_SN2
C1_Name2
C1_Address2
C1_Phone2
-----
C1_SN3
C1_Name3
C1_Address3
C1_Phone3
I expect multiple teams to be filling this table. How do I setup the model in sequelize? I have two options in mind.
Option 1
The first option I think of is to create an extra 1:1 table between Foo and C1. But going with this option, I don't know how to bulkCreate the array of C1 responses in the C1 table.
Option 2
I think it's also possible to make C1 column in Foo table have a nested array of values. Such that if userA submit his data, it will have the nested array of C1. But I don't know how to go about this method as well.
You need to create separate table for C if user select true then need pass array of object and then pass in bulkCreate like.
C1_SN AutoIncrement
C1_NAME
C1_Address
C1_Phone
value=[{"C1_NAME":"HELLo","C1_Address":"HELLo","C1_Phone":"987456321"},{"C1_NAME":"HELLo1","C1_Address1":"HELLo","C1_Phone":"987456321s"}]
foo.bulkCreate(value).then(result=>{
console.log(result)
}).catch(error=>{
console.log(error)
})
From the official you can check this link:
Sequelize bulkCreate

How to s3-select all data within inner array of parquet file

I have parquet files on s3 which need to be queried using S3 Select. The parquet files are generated from JSON files with inner arrays. The S3 Select query can get the first array but if i tried to query the records in the inner array it fails to return the ids. Saying its an invalid data source
What I tried:
Looking up documentation on Amazon proves no use
Multiple formats of the s3 select query
Json Structure
{
"Array": [
{
"Id": "1"
},
{
"Id": "2"
}
]
}
Query
select s.Array[*].id from s3object s
Expect to get all the ids back from the query so should return Id 1 and 2.
select s.Id from S3Object[*].Array[*] s limit 5 will return all the ID's in the Array.

FILTER ON column from another relation in PIG

Suppose, I have the following data in PIG.
DUMP raw;
(2015-09-15T22:11:00.000-07:00,1)
(2015-09-15T22:12:00.000-07:00,2)
(2015-09-15T23:11:00.000-07:00,3)
(2015-09-16T21:02:00.000-07:00,4)
(2015-09-15T00:02:00.000-07:00,5)
(2015-09-17T08:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,1)
(2015-09-17T19:02:00.000-07:00,1)
DESCRIBE raw;
raw: {process_date: chararray,id: int}
A = GROUP raw BY id;
DESCRIBE A;
A: {group: int,raw: {(process_date: chararray,id: int)}}
DUMP A;
(1,{(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
(2,{(2015-09-15T22:12:00.000-07:00,2)})
(3,{(2015-09-15T23:11:00.000-07:00,3)})
(4,{(2015-09-16T21:02:00.000-07:00,4)})
(5,{(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})
B = FOREACH A {generate raw,MAX(raw.process_date) AS max_date;}
DUMP B;
({(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)},2015-09-17T19:02:00.000-07:00)
({(2015-09-15T22:12:00.000-07:00,2)},2015-09-15T22:12:00.000-07:00)
({(2015-09-15T23:11:00.000-07:00,3)},2015-09-15T23:11:00.000-07:00)
({(2015-09-16T21:02:00.000-07:00,4)},2015-09-16T21:02:00.000-07:00)
({(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)},2015-09-17T09:02:00.000-07:00)
DESCRIBE B;
B: {raw: {(process_date: chararray,id: int)},max_date: chararray}
Now, I need to filter raw based on process_date eq max_date. I have tried the following:
C = FOREACH B {filtered = FILTER raw BY REGEX_EXTRACT(process_date,'(\\d{4}-\\d{2}-\\d{2})',1) eq REGEX_EXTRACT(max_date,'(\\d{4}-\\d{2}-\\d{2})',1)}, but its not working.
Is there any way to do such filtering? Basically, I need to filter the raw based on latest date.
The exception which I get is:
Invalid field projection. Projected field [max_date] does not exist in schema: process_date:chararray,id:int
Expected output: Latest data corresponding to latest date (not time) for each id
({(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
({(2015-09-15T22:12:00.000-07:00,2)})
({(2015-09-15T23:11:00.000-07:00,3)})
({(2015-09-16T21:02:00.000-07:00,4)})
({(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})

Pig - removing duplicate tuples from bag

I have the following loaded in a relation with this schema {group: (int,int),A: {(n1: int,n2: int)}}:
((1,1),{(0,1)})
((2,2),{(0,2)})
((3,3),{(3,0)})
((4,2),{(1,3)})
((5,1),{(2,3)})
((5,3),{(1,4)})
((7,3),{(2,5)})
((9,1),{(4,5)})
((10,2),{(4,6)})
((10,4),{(7,3)})
((11,1),{(5,6)})
((11,3),{(4,7)})
((12,4),{(4,8)})
((13,1),{(6,7)})
((19,1),{(10,9),(9,10)})
((,),{(,),(,),(,)})
I would like to extract just the first tuple from each bag, i.e.:
((19,1),{(10,9),(9,10)}) --> (10,9)
Any help is appreciated.
Can you try like this?.
C = FOREACH B {
top1 = LIMIT A 1;
GENERATE FLATTEN((top1));
}
here B is your group relation name.