pyspark.sql.utils.IllegalArgumentException: requirement failed: Temporary GCS path has not been set - google-bigquery

On Google Cloud Platform, I am trying to submit a pyspark job that writes a dataframe to BigQuery.
The code that executes the writing is as the following:
finalDF.write.format("bigquery")\
.mode('overwrite')\
.option("table","[PROJECT_ID].dataset.table")\
.save()
And I get the mentioned error in the title. How can I set the GCS temporary path?

As the github repository of spark-bigquery-connector states
One can specify it when writing:
df.write
.format("bigquery")
.option("temporaryGcsBucket","some-bucket")
.save("dataset.table")
Or in a global manner:
spark.conf.set("temporaryGcsBucket","some-bucket")

Property "temporaryGcsBucket" needs to be set either at the time of writing dataframe or while creating sparkSession.
.option("temporaryGcsBucket","some-bucket")
or like .option("temporaryGcsBucket","some-bucket/optional_path")
1. finalDF.write.format("bigquery") .mode('overwrite').option("temporaryGcsBucket","some-bucket").option("table","[PROJECT_ID].dataset.table") .save()

Related

sqlite3.OperationalError: When trying to connect to S3 Airflow Hook

I'm currently exploring implementing hooks in some of my DAGs. For instance, in one dag, I'm trying to connect to s3 to send a csv file to a bucket, which then gets copied to a redshift table.
I have a custom module written which I import to run this process. I am trying to currently set up an S3Hook to undergo this process instead. But I'm a little confused in setting up the connection, and how everything works.
First, I input the hook
from airflow.hooks.S3_hook import S3Hook
Then I try to make the hook instance
s3_hook = S3Hook(aws_conn_id='aws-s3')
Next I try to set up the client
s3_client = s3_hook.get_conn()
However when I run the client line above, I received this error
OperationalError: (sqlite3.OperationalError)
no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?
LIMIT ? OFFSET ?]
[parameters: ('aws-s3', 1, 0)]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
I'm trying to diagnose the error, but the tracebook is long. I'm a little confused on why sqlite3 is involved here, when I'm trying to utilize s3 here. Can anyone unpack this? Why is this error being thrown when trying to set up the client?
Thanks
Airflow is not just a library - it's also an application.
To execute Airflow code you must have airflow instance running this mean also having a database with the needed schema.
To create the tables you must execute airflow init db.
Edit:
After the discussion in comments. Your issue is that you have working Airflow application inside docker but your DAGs are written on your local disk. Docker is closed environment if you want Airflow to recognize your dags you must move the files to the DAG folder in the docker.

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?
I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

Federated table/query not working - "Cannot read in location: us-west1"

I have a GCS bucket in US-WEST1:
That bucket has two files:
wiki_1b_000000000000.csv.gz
wiki_1b_000000000001.csv.gz
I've created a external table definition to read those files like so:
The dataset where this external table definition exists is also in the US.
When I query it with:
SELECT
*
FROM
`grey-sort-challenge.bigtable.federated`
LIMIT
100
..I get the following error:
Error: Cannot read in location: us-west1
I tested with asia-northeast1 and it works fine.
Why isn't this working for the US region?
Faced the same earlier. See G's answer - must use us-central1 for now: https://issuetracker.google.com/issues/76127552#comment11
For people from Europe
If you get an error Cannot read in location: EU while trying to read from external source - regional GCS bucket, you have to place your data in region europe-west1 as per the same comment. Unfortunately it is not reflected in the documentation yet.
I wanted to create a federation(external table) to contiually load up data from a new csv file which was imported each day.
In attempting to do so I was getting "Error: Cannot read in location: xxxx "
I solved the problem by:
I recreated a NEW bucket, this time select the US ( Multiple regions )
I then went back to BIG query and created a NEW data set with the data location as United States (US)
Presto!, I am now able to query an (constantly updating) external table!

Saving RDD to file results in _temporary path for parts

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?
I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.
Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!

google big query: export table to own bucket results in unexpected error

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel
It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*
Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.
I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json