Not able to load 5gb csv file from cloud storage to bigquery table - google-bigquery

I have a file size of more than 5gb on google cloud storage , I am not able to load that file to bigquery table.
errors thrown are:
1) too many errors
2) Too many values in row starting at position: 2213820542
I searched and found it could be because of max file size reached., so my question is how can i upload file having size greater than quota policy, plz help me. I have a billing account on bigquery.

5gb is OK. The error says for the row starting at position 2213820542, it has more columns than specified, e.g. your provided a schema of n columns, and that row has more than n columns after splitting.

Related

AWS Quicksight monthly summarize in percentage from different csv files

1. I have a Lambda function that is running monthly, it is running Athena query, and export the results in a CSV file to my S3 bucket.
2. Now i have a Quicksight dashboard which is using this CSV file in Dataset and visual all the rows from the report into a dashboard.
Everything is good and working until here.
3. Now every month I'm getting a new csv file in my S3 bucket, and i want to add a "Visual Type" in my main dashboard that will show me the difference in % from the previous csv file(previous month).
For example:
My dashboard is focusing on the collection of missing updates.
In May i see i have 50 missing updates.
In June i got a CSV file with 25 missing updates.
Now i want it somehow to reflect into my dashboard with a "Visual Type" that this month we have reduced the number of missing updates by 50%.
And in month July, i get a file with 20 missing updates, so i want to see that we reduced with with 60% from the month May.
Any idea how i can do it?
I'm not sure I quite understand where you're standing, but I'll assume that you have an S3 manifest that points to an S3 directory and not a different manifest (and dataset) per each file.
If that's your case you could try to tackle that comparison creating a calculated field and using the periodOverPeriodPercentDifference
Hope this helps!

How to store and serve coupons with Google tools and javascript

I'll get a list of coupons by mail. That needs to be stored somewhere somehow (bigquery?) where I can request and send it to the user. The user should only be able to get 1 unique code, that was not used beforehand.
I need the ability to get a code and write, that it was used, so the next request gets the next code...
I know it is a completely vague question but I'm not sure how to implement that, anyone has any ideas?
thanks in advance
Thr can be multiples solution for same requirement, one of them is given below :-
Step 1. Try to get coupons over a file (CSV, JSON, and etc) as per your preference/requirement.
Step 2. Load Source file to GCS (storage).
Step 3. Write a Dataflow code which read data from GCS (file) an load data to a different Bigquery table (tentative name: New_data). Sample code.
Step 4. Create a Dataflow code to read data from Bigquery table New_data and compare it with History_data and identify new coupons and write data to a file on GCS or Bigquery table. Sample code.
Step 5. Schedule entire process over an orchestrator/Cloud scheduler/Cron tab job.
Step 6. Once you have data you can send it to consumers through any communication channel.

Is there a way to count a number of accesses to GCS (not a function calls) by analysing audit log in generated audit table in BigQ?

I wonder if there is a way to count the actual number of accesses to a certain GCP service by analysing audit log stored in BigQ. In other words, I have audit tables sink to BigQ (no actual access to Stackdriver). I can see a number of rows were generated per single access, i.e. it was one physical access to the GCS, but about 10 rows generated due to different function calls. I'd like to be able to say how many attempts/accesses were made by the user account by looking at X number of rows. This is a data example.
Thank you
Assuming that the sink works well and the logs are being exported to BQ, then you would need to check the format on how audit logs are written to BQ and the fields that are exported/written to BQ [1].
Then filter the logs by resource_type and member, these documents [2][3] can help you with that.
[1] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#sample
[2] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#sample
[3] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#interpreting_the_sample_audit_log_entry

HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements

I'm working on a client platform. It is a DataLake linked to AWS S3 and AWS ATHENA
I've uploaded a Dataset to an S3 Bucket using AWS GLUE.
The job ran successfully and a table was created under ATHENA.
When I try to "Preview" the content of the table, I get the following error :
HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements
P.S: The file i'm uploading contains 34 columns
Thanks to #ThomasRones comment:
This indicates that there are more columns in the header row than that there are columns in the data rows (actually the determined types for these data rows).
Probable cause for this error is that in the header row one or more names contains one or more comma's.

Load order of entires in big query tables

I have some sample data that I've been loading into Google BigQueries. I have been importing the data in ndjson format. If I load the data all in one file, I see them show up in a different order in the table's preview tab than when I sequentially import them one ndjson line at a time.
When importing sequentially I wait till I see the following output:
Waiting on bqjob_XXXX ... (2s) Current status: RUNNING
Waiting on bqjob_XXXX ... (2s) Current status: DONE
The order the rows show up seems to match the order I append them as the job importing them seem to finish before I move on to the next. But when loading them all in one file, they show up in a different order than they exist in my data file.
So why do the data entries show up in a different order when loading in bulk? How are the data entries queued to be loaded and also how are they indexed into the table?
BigQuery has no notion of indexes. Data in BigQuery tables have no particular order that you can rely on. If you need to get ordered data out of BigQuery you will need to use explicit ORDER BY in your query - which btw quite not recommended for large results as it increases resource cost and can end up with Resources Exceeded error.
BigQuery internal storage can "shuffle" your data rows internally for the best / most optimal performance of querying. So again - there is no such things as physical order of data in BigQuery tables
Oficial language in docs is like this - line ordering is not guaranteed for compressed or uncompressed files.