I am segregating my pyspark dataframe based on multiple columns using the following code
data.write.option("header",True).partitionBy("Col1","Col2","Col3","Col4").mode("overwrite").csv("/../")
When I segregate by 2 columns, it works but when I mention more than 2 columns, Its stopping by showing the error : 'org.apache.spark.SparkException: Job aborted'. Can anyone please tell me how to do it?
Related
I'm currently migrating around 200 tables in Bigquery (BQ) from one dataset (FROM_DATASET) to another one (TO_DATASET). Each one of these tables has a _TABLE_SUFFIX corresponding to a date (I have three years of data for each table). Each suffix contains typically between 5 GB and 80 GB of data.
I'm doing this using a Python script that asks BQ, for each table, for each suffix, to run the following query:
-- example table=T_SOME_TABLE, suffix=20190915
CREATE OR REPLACE TABLE `my-project.TO_DATASET.T_SOME_TABLE_20190915`
COPY `my-project.FROM_DATASET.T_SOME_TABLE_20190915`
Everything works except for three tables (and all their suffixes) where the copy job fails at each _TABLE_SUFFIX with this error:
An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 4893854
Retrying the job after some time actually works but of course is slowing the process. Is there anyone who has an idea on what the problem might be?
Thanks.
It turned out that those three problematic tables were some legacy ones with lots of columns. In particular, the BQ GUI shows this warning for two of them:
"Schema and preview are not displayed because the table has too many
columns and may cause the BigQuery console to become unresponsive"
This was probably the issue.
In the end, I managed to migrate everything by implementing a backoff mechanism to retry failed jobs.
When trying to schedule a query in BQ, I am getting the following error:
Error code 3 : Query error: Not found: Dataset was not found in location EU at [2:1]
Is this a permissions issue?
This sounds like a case of the scheduled query being configured to run in a different region than either the referenced tables, or the destination table of the query.
Put another way, BigQuery requires a consistent location for reading and writing, and does not allow a query in location A to write results in location B.
https://cloud.google.com/bigquery/docs/scheduling-queries has some additional information about this.
I'm using python/pandas/a script to pull from a database, and this cell is causing me issues:
import pandas as pd
positions = pd.DataFrame.from_dict(response)
print(f'Retrieved {positions.shape[0]} records.')
positions.head()
When I run it while only pulling a few records it works, but when I try to pull thousands, I get this error:
Exception: Query failed to run by returning code of 400.
Is it the size of the pull causing issues, and if so, how do I go about resolving that?
I am trying to load some data from Google Sheets to Big Query. It's 5 sheets. I am using gapps scripts to do it.
I am trying to loading very same data as I was loading last week the problem is that all the upload jobs are failing now, for every single sheet.
I am getting:
Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1280; errors: 1. Please look into the errors[] collection for more details.
When I look in CLI, I see :-bash: syntax error near unexpected token `newline'
The problem is that my doc has 0 newline characters. I am also getting the "1280" number for 5 various sheets I am trying to upload and NONE of them has 1280 rows
The second column is the name of the sheet being uploaded. All getting the same error all of the sudden:
Big Query is quite powerful but this random crap is totally killing the experience.
Any ideas what could be wrong?
Is there a way to get all the bad records that get skipped while doing a Bigquery load job and setting --max_bad_records ?
I believe the status.errors field will have a list of errors that occurred during job processing, including non-fatal errors like bad rows that were skipped.
https://cloud.google.com/bigquery/docs/reference/v2/jobs