Parquetloader: can't load multiple parquet files using pig - apache-pig

I'm getting the following error:
Error during parsing. repetition constraint is more restrictive: can not merge type required binary MyTime into optional binary MyTime.
Maybe one of the files is corrupted but I don't know how to skip it.
Thanks

This happens when reading multiple parquet files that have slightly different metadata in their schemas. Either you have a mixed collection of files in a single directory or you are giving the LOAD statement a glob and the resulting collection of files is mixed in this respect.
Rather than specifying the schema in an AS() clause or making a bare call to the loader function the solution is to override the schema in the loader function's argument like this:
data = LOAD 'data'
USING parquet.pig.ParquetLoader( 'n1:int, n2:float, n3:double, n4:long')
Otherwise the loader function infers the schema from the first file it encounters which then conflicts with one of the others.
If you have still have trouble try using type bytearray in the schema specification and then cast to the desired types in a subsequent FOREACH.
According to the Parquet source code there is another argument to the loader function that allows columns to be specified by position rather than name (the default) but I have not experimented with that.

Related

What does this error mean: Required column value for column index: 8 is missing in row starting at position: 0

I'm attempting to upload a CSV file (which is an output from a BCP command) to BigQuery using the gcloud CLI BQ Load command. I have already uploaded a custom schema file. (was having major issues with Autodetect).
One resource suggested this could be a datatype mismatch. However, the table from the SQL DB lists the column as a decimal, so in my schema file I have listed it as FLOAT since decimal is not a supported data type.
I couldn't find any documentation for what the error means and what I can do to resolve it.
What does this error mean? It means, in this context, a value is REQUIRED for a given column index and one was not found. (By the way, columns are usually 0 indexed, meaning a fault at column index 8 is most likely referring to column number 9)
This can be caused by myriad of different issues, of which I experienced two.
Incorrectly categorizing NULL columns as NOT NULL. After exporting the schema, in JSON, from SSMS, I needed to clean it
up for BQ and in doing so I assigned IS_NULLABLE:NO to
MODE:NULLABLE and IS_NULLABLE:YES to MODE:REQUIRED. These
values should've been reversed. This caused the error because there
were NULL columns where BQ expected a REQUIRED value.
Using the wrong delimiter The file I was outputting was not only comma-delimited but also tab-delimited. I was only able to validate this by using the Get Data tool in Excel and importing the data that way, after which I saw the error for tabs inside the cells.
After outputting with a pipe ( | ) delimiter, I was finally able to successfully load the file into BigQuery without any errors.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Pig variable storage

Pig uses variables to store the data.
When I load the data from HDFS into the variable in pig. Where is the data temporarily stored?
What exactly happens in the background when we load the data into the variable ?
Kindy help
Pig lazily evaluates most expressions. In most cases, it checks for syntax errors etc. Like,
a = load 'hdfs://I/Dont/Exist'
won't throw an error unless you use STORE or DUMP or something along those lines which result in the evaluation of a
Similarly, if a file exists and you load it to a relation and perform transformations on it, the file is spooled to /tmp folder usually and then the transformations are performed. If you look at the messages that appear when you run commands on grunt, you'll notice file paths starting with file:///tmp/xxxxxx_201706171047235. These are the files that store intermediate data.

How do I create a BigQuery view that uses a user-defined function?

I'd like to create a BigQuery view that uses a query which invokes a user-defined function. How do I tell BigQuery where to find the code files for the UDF?
Views can reference UDF resources stored in Google Cloud Storage, inline code blobs, or local files (contents will be loaded into inline code blobs).
To create a view with a UDF using the BigQuery UI, just fill out the UDF resources as you would when running the query normally, and save as a view. (In other words, no special actions are required).
To specify these during view creation from the command-line client, use the --view_udf_resource flag:
bq mk --view="SELECT foo FROM myUdf(table)" \
--view_udf_resource="gs://my-bucket/my-code.js"
In the above example, gs://my-bucket/my-code.js would contain the definition for myUdf(). You can provide multiple --view_udf_resources flags if you need to reference multiple code files in your view query.
You may specify gs:// URIs or local files. If you specify a local file, then the code will be read once and packed into an inline code resource.
Via the API, this is a repeated field named userDefinedFunctionResources. It is a sibling of the query field that contains the view SQL.