Saving RDD to file results in _temporary path for parts - amazon-s3

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?

I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.

Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!

Related

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?
I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

pyspark.sql.utils.IllegalArgumentException: requirement failed: Temporary GCS path has not been set

On Google Cloud Platform, I am trying to submit a pyspark job that writes a dataframe to BigQuery.
The code that executes the writing is as the following:
finalDF.write.format("bigquery")\
.mode('overwrite')\
.option("table","[PROJECT_ID].dataset.table")\
.save()
And I get the mentioned error in the title. How can I set the GCS temporary path?
As the github repository of spark-bigquery-connector states
One can specify it when writing:
df.write
.format("bigquery")
.option("temporaryGcsBucket","some-bucket")
.save("dataset.table")
Or in a global manner:
spark.conf.set("temporaryGcsBucket","some-bucket")
Property "temporaryGcsBucket" needs to be set either at the time of writing dataframe or while creating sparkSession.
.option("temporaryGcsBucket","some-bucket")
or like .option("temporaryGcsBucket","some-bucket/optional_path")
1. finalDF.write.format("bigquery") .mode('overwrite').option("temporaryGcsBucket","some-bucket").option("table","[PROJECT_ID].dataset.table") .save()

dataFactory V2 - Wildcards

I am trying to move & decompress data from Azure Data Lake Storage Gen1.
I have a couple of files with ".tsv.gz" extension, and I want to decompress and move them to a different folder, which is in the same data lake.
I've tried to use the wildcard "*.tsv.gz" inside the connection configuration, so I can make this process at once.
Am I making some mistake?
Thanks
Just tested it, you should just use:
*.tsv.gz
Without ' or "
Hope this helped!
PS: also, remember to check the "Copy file recursively" when you select the dataset in the pipeline.

Using a text file as Spark streaming source for testing purpose

I want to write a test for my spark streaming application that consume a flume source.
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.
So I wish to use :
JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();
Unfortunately it does not print anything.
I tried:
dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
adding the text file in the directory BEFORE the program begins
adding the text file in the directory WHILE the program run
Nothing works.
Any suggestion to read from text file?
Thanks,
Martin
Order of start and await are indeed inversed.
In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.
Eg. to avoid the timing issues faced with the fileConsumer, you could try this:
val rdd = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()
I am so stupid, I inverted calls to start() and awaitTermination()
If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.

google big query: export table to own bucket results in unexpected error

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel
It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*
Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.
I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json