How to use wild card character to search for s3 file using S3KeySensor in airflow - amazon-s3

Hello I am using S3KeySensor to look for parquet files created in specific partition. As the file names are generated from spark like (part-00499-e91c1af8-4352-4de9*), what should be the bucket_key ? Below code is failing.
ex:bucket_key=f"inbound/phix/empnf/datasetdate={var_ds_date}/*.parquet"
s3_data_filechk=S3KeySensor(
task_id='s3_data_filechk',
bucket_name=data_bucket_name,
bucket_key=f"inbound/phix/empnf/datasetdate={var_ds_date}/*.parquet/",
timeout=60 * 30, # timeout in 30 minutes
poke_interval=60 * 5 # (seconds); checking file in every five minutes
)

The S3KeySensor has parameter wildcard_match: whether the bucket_key should be interpreted as a Unix wildcard pattern. You can use it to set the logic you wish.
Example:
sensor=S3KeySensor(
...
bucket_key="*.parquet",
wildcard_match=True
)

Related

Deploy sql workflow with DBX

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.
There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.
I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

Jmeter non GUI mode csv report not showing latency

I am trying to get the jmeter html report for file transfer in SFTP protocol.
I am using SSH SFTP Protocol plugin and added Simple Data Writer to that thread group.
I have created my own sftp server using Apache MINA. Jmeter script will hit the server which i created and uploads the file.
Script Parameters:
Thread Group - 250
Ramp up period - 50
Loop Count - 1
After running the script in non GUI mode as nohup sh jmeter.sh -n -t Singlepart_MultipleThread_RampUp.jmx -l Singlepart_MultipleThread_RampUp.jtl. I do get a csv generated which i convert into html report command jmeter -g <csv> -o <destination_folder>.
The html report created shows Latency vs Time and Latency vs Request as zero and even the csv report shows latency column as zero.
Below is my user.properties file
user.properties
# Latencies Over Time graph definition
jmeter.reportgenerator.graph.latenciesOverTime.classname=org.apache.jmeter.report.processor.graph.impl.LatencyOverTimeGraphConsumer
jmeter.reportgenerator.graph.latenciesOverTime.title=Latencies Over Time
jmeter.reportgenerator.graph.latenciesOverTime.property.set_granularity=${jmeter.reportgenerator.overall_granularity}
# Latencies Vs Request graph definition
jmeter.reportgenerator.graph.latencyVsRequest.classname=org.apache.jmeter.report.processor.graph.impl.LatencyVSRequestGraphConsumer
jmeter.reportgenerator.graph.latencyVsRequest.title=Latencies Vs Request
jmeter.reportgenerator.graph.latencyVsRequest.exclude_controllers=true
jmeter.reportgenerator.graph.latencyVsRequest.property.set_granularity=${jmeter.reportgenerator.overall_granularity}
jmeter.properties
#---------------------------------------------------------------------------
# Results file configuration
#---------------------------------------------------------------------------
# This section helps determine how result data will be saved.
# The commented out values are the defaults.
# legitimate values: xml, csv, db. Only xml and csv are currently supported.
jmeter.save.saveservice.output_format=csv
# The below properties are true when field should be saved; false otherwise
#
# assertion_results_failure_message only affects CSV output
jmeter.save.saveservice.assertion_results_failure_message=true
#
# legitimate values: none, first, all
jmeter.save.saveservice.assertion_results=all
#
jmeter.save.saveservice.data_type=true
jmeter.save.saveservice.label=true
jmeter.save.saveservice.response_code=true
# response_data is not currently supported for CSV output
jmeter.save.saveservice.response_data=true
# Save ResponseData for failed samples
jmeter.save.saveservice.response_data.on_error=false
jmeter.save.saveservice.response_message=true
jmeter.save.saveservice.successful=true
jmeter.save.saveservice.thread_name=true
jmeter.save.saveservice.time=true
jmeter.save.saveservice.subresults=true
jmeter.save.saveservice.assertions=true
jmeter.save.saveservice.latency=true
# Only available with HttpClient4
#jmeter.save.saveservice.connect_time=true
jmeter.save.saveservice.samplerData=true
#jmeter.save.saveservice.responseHeaders=false
#jmeter.save.saveservice.requestHeaders=false
#jmeter.save.saveservice.encoding=false
jmeter.save.saveservice.bytes=true
# Only available with HttpClient4
jmeter.save.saveservice.sent_bytes=true
jmeter.save.saveservice.url=true
jmeter.save.saveservice.filename=false
jmeter.save.saveservice.hostname=false
jmeter.save.saveservice.thread_counts=true
jmeter.save.saveservice.sample_count=false
jmeter.save.saveservice.idle_time=true
# Timestamp format - this only affects CSV output files
# legitimate values: none, ms, or a format suitable for SimpleDateFormat
#jmeter.save.saveservice.timestamp_format=ms
#jmeter.save.saveservice.timestamp_format=yyyy/MM/dd HH:mm:ss.SSS
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
# Default:
#jmeter.save.saveservice.default_delimiter=,
# For TAB, one can use:
#jmeter.save.saveservice.default_delimiter=\t
# Only applies to CSV format files:
# Print field names as first line in CSV
#jmeter.save.saveservice.print_field_names=true
# Optional list of JMeter variable names whose values are to be saved in the result data files.
# Use commas to separate the names. For example:
#sample_variables=SESSION_ID,REFERENCE
# N.B. The current implementation saves the values in XML as attributes,
# so the names must be valid XML names.
# By default JMeter sends the variable to all servers
# to ensure that the correct data is available at the client.
# Optional xml processing instruction for line 2 of the file:
# Example:
#jmeter.save.saveservice.xml_pi=<?xml-stylesheet type="text/xsl" href="../extras/jmeter-results-detail-report.xsl"?>
# Default value:
#jmeter.save.saveservice.xml_pi=
# Prefix used to identify filenames that are relative to the current base
#jmeter.save.saveservice.base_prefix=~/
# AutoFlush on each line written in XML or CSV output
# Setting this to true will result in less test results data loss in case of Crash
# but with impact on performances, particularly for intensive tests (low or no pauses)
# Since JMeter 2.10, this is false by default
#jmeter.save.saveservice.autoflush=false
So basically facing issue at two places:
How to get the latency value?
When i provide Ramp up value as 1, the script with Thread Group =50 takes around 16 seconds to complete the upload, whereas if i give Ramp up something other than 1 such as 10 then the script ends after 10 secs exact, irrespective of file is getting uploaded or not and providing vague results in html report as well.
Any idea how to solve this. Or need to do anything else in script.
You cannot as the plugin you're using doesn't call SampleResult.setLatency() function anywhere
theoretically it should be possible to request the functionality from the plugin developers
Setting 10 seconds ramp-up period for 50 virtual users means that JMeter starts with 1 virtual user and gradually increases the load to 50 within 10 seconds duration. Make sure to have enough loops defined in the Thread Group as you may run into the situation when 1st user has already finished uploading the file and was terminated and 2nd hasn't need started so you have maximum 1 user concurrency (it can be checked using Active Threads Over Time listener). See JMeter Test Results: Why the Actual Users Number is Lower than Expected for more detailed explanation if needed.

How to change the max size for file upload on AOLServer/CentOS 6?

We have a portal for our customers that allow them to start new projects directly on our platform. The problem is that we cannot upload documents bigger than 10MO.
Every time I try to upload a file bigger than 10Mo, I have a "The connection was reset" error. After some research it seems that I need to change the max size for uploads but I don't know where to do it.
I'm on CentOS 6.4/RedHat with AOL Server.
Language: TCL.
Anyone has an idea on how to do it?
EDIT
In the end I could solve the problem with the command ns_limits set default -maxupload 500000000.
In your config.tcl, add the following line to the nssock module section:
set max_file_upload_mb 25
# ...
ns_section ns/server/${server}/module/nssock
# ...
ns_param maxinput [expr {$max_file_upload_mb * 1024 * 1024}]
# ...
It is also advised to constrain the upload times, by setting:
set max_file_upload_min 5
# ...
ns_section ns/server/${server}/module/nssock
# ...
ns_param recvwait [expr {$max_file_upload_min * 60}]
If running on top of nsopenssl, you will have to set those configuration values (maxinput, recvwait) in a different section.
I see that you are running Project Open. As well as setting the maxinput value for AOLserver, as described by mrcalvin, you also need to set 2 parameters in the Site Map:
Attachments package: parameter "MaximumFileSize"
File Storage package: parameter "MaximumFileSize"
These should be set to values in bytes, but not larger than the maxinput value for AOLserver. See the Project Open documentation for more info.
In the case where you are running Project Open using a reverse proxy, check the documentation here for Pound and here for Nginx. Most likely you will need to set a larger file upload limit there too.

Uploading job fails on the same file that was uploaded successfully before

I'm running regular uploading job to upload csv into BigQuery. The job runs every hour. According to recent fail log, it says:
Error: [REASON] invalid [MESSAGE] Invalid argument: service.geotab.com [LOCATION] File: 0 / Offset:268436098 / Line:218637 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I went to line 218638 (the original csv has a headline, so I assume 218638 should be the actual failed line, let me know if I'm wrong) but it seems all right. I checked according table in BigQuery, it has that line too, which means I actually successfully uploaded this line before.
Then why does it causes failure recently?
project id: red-road-574
Job ID: Job_Upload-7EDCB180-2A2E-492B-9143-BEFFB36E5BB5
This indicates that there was a problem with the data in your file, where it didn't match the schema.
The error message says it occurred at File: 0 / Offset:268436098 / Line:218637 / Field:2. This means the first file (it looks like you just had one), and then the chunk of the file starting at 268436098 bytes from the beginning of the file, then the 218637th line from that file offset.
The reason for the offset portion is that bigquery processes large files in parallel in multiple workers. Each file worker starts at an offset from the beginning of the file. The offset that we include is the offset that the worker started from.
From the rest of the error message, it looks like the string service.geotab.com showed up in the second field, but the second field was a number, and service.geotab.com isn't a valid number. Perhaps there was a stray newline?
You can see what the lines looked like around the error by doing:
cat <yourfile> | tail -c +268436098 | tail -n +218636 | head -3
This will print out three lines... the one before the error (since I used -n +218636 instead of +218637), the one that had the error, and the next line as well.
Note that if this is just one line in the file that has a problem, you may be able to work around the issue by specifying maxBadRecords.

Issue when loading data from cloud storage, at least an error message improvement is needed

When I try to load multiple files from cloud storage larger jobs almost always fail. When I try to load an individual file that works, but loading batches is really much more convenient.
Snippet:
Recent Jobs
Load 11:24am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409
Load 11:23am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409
Load 11:22am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 40 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 40 / Line:2 / Field:1, Bad character (ASCII 0) encountered: field starts with: <5C���>}�>
File: 40 / Line:3 / Field:1, Bad character (ASCII 0) encountered: field starts with: <����W�o�>
File: 40 / Line:4, Too few columns: expected 7 column(s) but got 2 column(s). For additional help:
File: 40 / Line:5, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:6, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:7, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:8 / Field:1, Bad character (ASCII 0) encountered: field starts with: <��hy�>
The worst with this problem is that I don't know which file is "File: 40" the order seems random, otherwise I could remove that file and load the data, or try to find the error in the file.
I also strongly doubt that there even is an actual file error, for example in the above case when I removed all files but _0001 and _0002 (that worked fine to load as single files) I still get this output:
Recent Jobs
Load 11:44am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 1 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 1 / Line:2 / Field:3, Bad character (ASCII 0) encountered: field starts with:
File: 1 / Line:3, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 1 / Line:4 / Field:3, Bad character (ASCII 0) encountered: field starts with:
Sometimes though the files load just fine, otherwise I'd expect that multiple file loading was all wrecked.
Info:
Average file size is around 20MB, usually a directory is 70 files somewhere between 1 and 2 GB.
It looks like you're hitting a BigQuery bug.
When BigQuery gets a load job request with a wildcard pattern (i.e. gs://foo/bar*) we first expand the pattern to the list of files. Then we read the first one to determine the compression type.
One oddity with GCS is that there isn't a real concept of a directory. That is gs://foo/bar/baz.csv is really bucket: 'foo', object: 'bar/baz.csv'. It looks like you have empty files as placeholders for your directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).
This empty file doesn't play nicely with the bigquery probe-for-compression type, since when we expand the file pattern, the directory dummy file is the first thing that gets returned. We then open the dummy file, and it doesn't appear to be a gzip file, so we assume the compression type of the entire load is uncompressed.
We've filed a bug and have a fix under testing. Hopefully the fix will be out next week. In the mean time, your options are to either expand the pattern yourself, to use a longer pattern that won't match the directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or to delete the dummy directory file.