HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements - amazon-s3

I'm working on a client platform. It is a DataLake linked to AWS S3 and AWS ATHENA
I've uploaded a Dataset to an S3 Bucket using AWS GLUE.
The job ran successfully and a table was created under ATHENA.
When I try to "Preview" the content of the table, I get the following error :
HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements
P.S: The file i'm uploading contains 34 columns

Thanks to #ThomasRones comment:
This indicates that there are more columns in the header row than that there are columns in the data rows (actually the determined types for these data rows).
Probable cause for this error is that in the header row one or more names contains one or more comma's.

Related

Write 2 months historical data from S3 to aws timestream

I want to write 2 months old data from S3 bucket and each day has 4 to 5 parquet files. Now I have converted the parquet files to data frame but the amount of rows forming from 1-day data is around 3.5M.
Now I have created 100 batches to send the records but the overall execution time is too long. Please help me.

AWS Quicksight monthly summarize in percentage from different csv files

1. I have a Lambda function that is running monthly, it is running Athena query, and export the results in a CSV file to my S3 bucket.
2. Now i have a Quicksight dashboard which is using this CSV file in Dataset and visual all the rows from the report into a dashboard.
Everything is good and working until here.
3. Now every month I'm getting a new csv file in my S3 bucket, and i want to add a "Visual Type" in my main dashboard that will show me the difference in % from the previous csv file(previous month).
For example:
My dashboard is focusing on the collection of missing updates.
In May i see i have 50 missing updates.
In June i got a CSV file with 25 missing updates.
Now i want it somehow to reflect into my dashboard with a "Visual Type" that this month we have reduced the number of missing updates by 50%.
And in month July, i get a file with 20 missing updates, so i want to see that we reduced with with 60% from the month May.
Any idea how i can do it?
I'm not sure I quite understand where you're standing, but I'll assume that you have an S3 manifest that points to an S3 directory and not a different manifest (and dataset) per each file.
If that's your case you could try to tackle that comparison creating a calculated field and using the periodOverPeriodPercentDifference
Hope this helps!

AWS DMS CDC to S3 target

So I was playing around seeing what could be achieved using Data Migration Service Chance Data Capture, to take data from MSSQL to S3 & also Redshift.
The redshift testing was fine, if I delete a record in my source DB, a second or two later the record disappears from Redshift. Same with Insert/update etc ..
But S3 ...
You get the original record from the first full load.
Then if you update a record in source, S3 receives a new copy of the record, marked with an 'I'.
If I delete a record, I get another copy of the record marked with a 'D'.
So my question is - what do I do with all this ?
How would I query my S3 bucket to see the 'current' state of my data set as reflecting the source database ?
Do I have to script some code myself to pick up all these files and process them, performing the inserts/updates and deletes until I finally resolve back to a 'normal' data set ?
Any insight welcomed !
The records containing 'I', 'D' or 'U' are actually CDC data (change data capture). This is called sometimes called "history" or "historical data". This type of data has some applications in data warehousing and also this can also be used in many Machine learning uses cases.
Now coming to the next point, in order to get 'current' state of data set you have to script/code yourself. You can use AWS Glue to perform the task. For-example, This post explains something similar.
If you do not want to maintain the glue code, then a shortcut is not to use s3 target with DMS directly, but use Redshift target and once all CDC is applied offload the final copy to S3 using Redshift unload command.
As explained here about what 'I', 'U' and 'D' means.
What we do to get the current state of the db? An alternate is to first of all add this additional column to the fullload files as well, i.e. the initial loaded files before CDC should also have this additional column. How?
Now query the data in athena in such a way where we exclude the records where Op not in ("D", "U") or AR_H_OPERATION NOT IN ("DELETE", "UPDATE"). Thus you get the correct count (ONLY COUNT as 'U' would only come if there is already an I for that entry).
SELECT count(*) FROM "database"."table_name"
WHERE Op NOT IN ('D','U')
Also to get all the records you may try something in athena with a complex sql, where Op not in ('D') and records when Op IN = 'I' and count 1 or else if count 2, pick the latest one or Op = 'U'.

BigQuery: Too many total leaf fields 10852

I am importing some data from Google Cloud Datastore with about 980 columns. I have it exported first to Bucket and attempting to import it into BigQuery (using the GCP guide here). However, I get the error Too many total leaf fields: 10852.
I know for certain that none of the entities have more than 1000 fields. Is there a possibility that the import process is transforming my data and creating additional fields?
The schema's generated by the Managed Import/Export service will not contain more than 10k fields. So, it looks like you are importing into a BigQuery table that already has data. BigQuery will take the union of the existing schema and the new schema. So even if any given entity has less than 1000 fields, if the union of all field names in all your entities of a kind, plus the existing fields in the BigQuery schema.
Some options you have include:
1) Use a new table for each import into BigQuery.
2) Try using projectionFields to limit the fields loaded into BigQuery.
Jim Morrison's solution (using projectionFields) solved the issue for me.
I ended up passing a list of entity columns I was interested in and only exporting this subset to BigQuery. The following command line instruction achieves this.
bq --location=US load --source_format=DATASTORE_BACKUP --projection_fields="field1, field4, field2, field3" --replace mydataset.table gs://mybucket/2019-03-25T02:56:02_47688/default_namespace/kind_data/default_namespace_datakind.export_metadata

Not able to load 5gb csv file from cloud storage to bigquery table

I have a file size of more than 5gb on google cloud storage , I am not able to load that file to bigquery table.
errors thrown are:
1) too many errors
2) Too many values in row starting at position: 2213820542
I searched and found it could be because of max file size reached., so my question is how can i upload file having size greater than quota policy, plz help me. I have a billing account on bigquery.
5gb is OK. The error says for the row starting at position 2213820542, it has more columns than specified, e.g. your provided a schema of n columns, and that row has more than n columns after splitting.