How to find rejected files due to errors in apache beam java sdk - google-bigquery

I Have N number of same type files to be processed and I will be giving a wildcard input pattern(C:\\users\\*\\*).
So now how do I find the file name and record ,that has been rejected while uploading to bigquery in java.

I guess BQ writes to the temp location path that you pass to your pipeline and not to local [honestly not sure about this].
In my case, with python, I used to pass tmp location as GCS bucket, and when I error is show, they usually shows the name of the log file that contains the rejected errors in the command line logs.
And then I use gsutil cp command to copy it to my local computer and read it

BigQuery I/O (Java and Python SDK) supports deadletter pattern: https://beam.apache.org/documentation/patterns/bigqueryio/.
Java
result
.getFailedInsertsWithErr()
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
x -> {
System.out.println(" The table was " + x.getTable());
System.out.println(" The row was " + x.getRow());
System.out.println(" The error was " + x.getError());
return "";
}));
Python
errors = (
result['FailedRows']
| 'PrintErrors' >>
beam.FlatMap(lambda err: print("Error Found {}".format(err))))

Related

How can I reference an external SQL file using Airflow's BigQuery operator?

I'm currently using Airflow with the BigQuery operator to trigger various SQL scripts. This works fine when the SQL is written directly in the Airflow DAG file. For example:
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
bql='SELECT * FROM `example.table`',
destination_dataset_table='example.destination'
)
However, I'd like to store the SQL in a separate file saved to a storage bucket. For example:
bql='gs://example_bucket/sample_script.sql'
When calling this external file I recieve a "Template Not Found" error.
I've seen some examples load the SQL file into the Airflow DAG folder, however, I'd really like to access files saved to a separate storage bucket. Is this possible?
You can reference any SQL files in your Google Cloud Storage Bucket. Here's a following example where I call the file Query_File.sql in the sql directory in my airflow dag bucket.
CONNECTION_ID = 'project_name'
with DAG('dag', schedule_interval='0 9 * * *', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
battery_data_quality = BigQueryOperator(
task_id='task-id',
sql='/SQL/Query_File.sql',
destination_dataset_table='project-name.DataSetName.TableName${{ds_nodash}}',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
dag=dag
)
You can also consider using the gcs_to_gcs operator to copy things from your desired bucket into one that is accessible by composer.
download works differently in GoogleCloudStorageDownloadOperator for Airflow version 1.10.3 and 1.10.15.
def execute(self, context):
self.object = context['dag_run'].conf['job_name'] + '.sql'
logging.info('filemname in GoogleCloudStorageDownloadOperator: %s', self.object)
self.filename = context['dag_run'].conf['job_name'] + '.sql'
self.log.info('Executing download: %s, %s, %s', self.bucket,
self.object, self.filename)
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
delegate_to=self.delegate_to
)
file_bytes = hook.download(bucket=self.bucket,
object=self.object)
if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < 49344:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes.decode('utf-8'))
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)

(InternalError) when calling the SelectObjectContent operation in boto3

I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)

Flink on EMR - no output, either to console or to file

I'm trying to deploy my flink job on AWS EMR (version 5.15 with Flink 1.4.2). However, I could not get any output from my stream.
I tried to create a simple job:
object StreamingJob1 {
def main(args: Array[String]) {
val path = args(0)
val file_input_format = new TextInputFormat(
new org.apache.flink.core.fs.Path(path))
file_input_format.setFilesFilter(FilePathFilter.createDefaultFilter())
file_input_format.setNestedFileEnumeration(true)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val myStream: DataStream[String] =
env.readFile(file_input_format,
path,
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000L)
.map(s => s.split(",").toString)
myStream.print()
// execute program
env.execute("Flink Streaming Scala")
}
}
And I executed it using the following command:
HADOOP_CONF_DIR=/etc/hadoop/conf; flink run -m yarn-cluster -yn 4 -c my.pkg.StreamingJob1 /home/hadoop/flink-test-0.1.jar hdfs:///user/hadoop/data/
There was no error, but no output on the screen except flink's INFO logs.
I tried to output to a Kinesis stream, or to an S3 file. Nothing was recorded.
myStream.addSink(new BucketingSink[String](output_path))
I also tried to write to a HDFS file. In this case, a file was created, but with size = 0.
I am sure that the input file has been processed using a simple check:
myStream.map(s => {"abc".toInt})
which generated an exception.
What am I missing here?
It looks like stream.print() doesn't work on EMR.
Output to file: HDFS is used, and sometimes (or most of the time) I need to wait for the file to be updated.
Output to Kinesis: I had a typo in my stream name. I don't know why I didn't get any exception for that stream-not-exist. However, after get the name corrected, I got my expected message.

RevoScaleR can not find file, directory that exists

I am using RevoscaleR and I have successfully converted csv files to xdf files which I have saved to my local disk.
However, when I try to run functions that call these xdf files I get an error message that there is no such file or directory:
The file or directory 'P:/PROPENSITY/CL_Generic_Retail_201506' cannot be found.
Let me expose the whole process:
My working directory:
> getwd()
[1] "P:/PROPENSITY"
I used this code to convert csv file to xdf:
rx_CL_Generic_Retail_201506 <- rxImport(
inData = "CL_Generic_Retail_201506_23-05-2017.csv",
outFile = "CL_Generic_Retail_201506.xdf",
overwrite = TRUE
)
Then I used this code to check that the conversion was successful:
rxSummary(formula = ~ Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_,
data = "CL_Generic_Retail_201506.xdf"
)
Summary Statistics Results for: ~Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_
Data: "CL_Generic_Retail_201506.xdf" (RxXdfData Data Source)
File name: CL_Generic_Retail_201506.xdf
Number of valid observations: 7155413
Name Mean StdDev Min Max ValidObs MissingObs
Avg_Deposits 4562.914627 128614.5683 -325684032 69317080.0 7155413 0
Total_Num_ 7.062068 247.1506 1 224579.0 831567 6323846
Sumof_CC_AVGBAL_ 951.484138 2249.3149 0 164746.6 601304 6554109
Up to that point everything was fine.
I continued to convert files to xdf files.
Then I returned to that same file and tried to run the same function (summary) and I got the following error message:
> rxSummary(formula = ~ Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_,
+
+ data = "CL_Generic_Retail_201506.xdf"
+
+ )
The file or directory 'CL_Generic_Retail_201506.xdf' cannot be found.
In case I repeat the process and run again rxImport the rxSummary function runs again. But then after a while, the same error repeats.
Could this have to do with back slashes?
I.e.: The message is:
The file or directory 'P:\PROPENSITY\CL_Generic_Retail_201506.xdf' cannot be found.
But when I ask R to print the working directory it returns:
> getwd()
[1] "P:/PROPENSITY"
Observe that in the RevoScaleR error message the slashes are \ while R's output of getwd() has /.
If this is the problem what I could do about it?
By the way this problem occurs in a workstation where Windows and RevoScaleR are installed. In a notebook running also RevoScaleR the problem does not appear.
I would appreciate any suggestion.
---------------------------------------------------------------------------
Here is an image of the directory where it is apparent that the files exist.
Image of the PROPENSITY folder with the xdf files
Try using append = "rows". The last csv is probably empty, resulting in overwritting a xdf with an empty xdf which is no file.
rx_CL_Generic_Retail_201506 <- rxImport(inData = "CL_Generic_Retail_201506_23-05-2017.csv", outFile = "CL_Generic_Retail_201506.xdf", overwrite = TRUE,
append = "rows"
)

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9