reading from hive table and updating same table in pyspark - using checkpoint - hive

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.

As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

Related

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?
I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

Load data only once on the RAM using Python

Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)

AWS SageMaker pd.read_pickle() doesn't work but read_csv() does?

I've recently been trying to train some models on an AWS SageMaker jupyter notebook instance.
Everything is worked very well until I tried to load in some custom dataset (REDD) through files.
I have the dataframes stored in Pickle (.pkl) files on an S3 bucket. I couldn't manage to read them into sagemaker so I decided to convert them to csv's as this seemed to work but I ran into a problem. This data has an index of type datetime64 and when using .to_csv() this index gets converted to pure text and it loses it's data structure (and I need to keep this specific index for correct plotting.)
So I decided to try the Pickle files again but I can't get it to work and have no idea why.
The following code for csv's works but I can't use it due to the index problem:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_csv(data_location+'house_'+str(file+1)+'.csv', index_col='Unnamed: 0')
But this code does NOT work even though it uses almost the exact same syntax:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_pickle(data_location+'house_'+str(file+1)+'.pkl')
Yes it's 100% the correct path, because the csv and pkl files are stored in the same directory (compressed_data).
It throws me this error while using the Pickle method:
FileNotFoundError: [Errno 2] No such file or directory: 's3://sagemaker-peno/compressed_data/house_1.pkl'
I hope to find someone who has dealt with this before and can solve the read_pickle() issue or as an alternative fix my datetime64 type issue with csv's.
Thanks in advance!
read_pickle() likes the full path more than a relative path from where it was run. This fixed my issue.

Nullable field is changed upon writing a Spark Dataframe

The following code reads a Spark DataFrame from parquet file and writes to another parquet file. Nullable filed in ArrayType DataType is changed after writing the DataFrame to a new Parquet file.
Code:
SparkConf sparkConf = new SparkConf();
String master = "local[2]";
sparkConf.setMaster(master);
sparkConf.setAppName("Local Spark Test");
JavaSparkContext sparkContext = new JavaSparkContext(new SparkContext(sparkConf));
SQLContext sqc = new SQLContext(sparkContext);
DataFrame dataFrame = sqc.read().parquet("src/test/resources/users.parquet");
StructField[] fields = dataFrame.schema().fields();
System.out.println(fields[2].dataType());
dataFrame.write().mode(SaveMode.Overwrite).parquet("src/test/resources/users1.parquet");
DataFrame dataFrame1 = sqc.read().parquet("src/test/resources/users1.parquet");
StructField [] fields1 = dataFrame1.schema().fields();
System.out.println(fields1[2].dataType());
Output:
ArrayType(IntegerType,false)
ArrayType(IntegerType,true)
Spark version is: 1.6.2
For Spark 2.4 or before, all the columns written from spark sql are nullable. Quoting the official guide
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Renaming an Amazon CloudWatch Alarm

I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.