scrapy export to JSON using FEED_URI parameter in custom_settings - scrapy

I'm trying to export scraped items to a json file. Here is the beginning of the code:
class GetId(scrapy.Spider):
name = "get_id"
path = expanduser("~").replace('\\', '/') + '/dox/Getaround/'
days = [0, 1, 3, 7, 14, 21, 26, 31]
dates = []
previous_cars_list = []
previous_cars_id_list = []
crawled_date = datetime.today().date()
for day in days:
market_date = crawled_date + timedelta(days=day)
dates.append(market_date)
# Settings
custom_settings = {
'ROBOTSTXT_OBEY' : False,
'DOWNLOAD_DELAY' : 5,
'CONCURRENT_REQUESTS' : 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'AUTOTHROTTLE_ENABLED' : True,
'AUTOTHROTTLE_START_DELAY' : 5,
'LOG_STDOUT' : True,
'LOG_FILE' : path + 'log_get_id.txt',
'FEED_FORMAT': 'json',
'FEED_URI': path + 'cars_id.json',
}
I've done this 2 years ago without any issues. Now once I input "scrapy crawl get_id" in the Anaconda console, only the log file is exported and not the json with the data. In the log file the following error arise:
2022-08-25 15:14:48 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: c
Any clue how to deal with this ? Thanks

I'm not sure what version it was introduced but I always use the FEEDS setting. Either in the settings.py file or by using the custom settings class attribute like you use in your example.
For example:
# Settings
custom_settings = {
'ROBOTSTXT_OBEY' : False,
'DOWNLOAD_DELAY' : 5,
'CONCURRENT_REQUESTS' : 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'AUTOTHROTTLE_ENABLED' : True,
'AUTOTHROTTLE_START_DELAY' : 5,
'LOG_STDOUT' : True,
'LOG_FILE' : path + 'log_get_id.txt',
'FEEDS': { # <-- added this
path + 'cars_id.json':{
'format': 'json',
'encoding': 'utf-8'
}
}
}
You can find out all of the possible fields that can be set at https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds

Scrapy 2.1.0 (2020-04-24)
The FEED_FORMAT and FEED_URI settings have been deprecated in favor of
the new FEEDS setting (issue 1336, issue 3858, issue 4507).
https://docs.scrapy.org/en/latest/news.html?highlight=FEED_URI#id30
In this case, there are two options to use an older version of the Scrapy, e.g.:
pip uninstall Scrapy
pip install Scrapy==2.0
Or use FEEDS; you can read more about it here https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds

Related

Airflow BigQueryInsertJobOperator configuration

I'm having some issue converting from the deprecated BigQueryOperator to BigQueryInsertJobOperator. I have the below task:
bq_extract = BigQueryInsertJobOperator(
dag="big_query_task,
task_id='bq_query',
gcp_conn_id='google_cloud_default',
params={'data': Utils().querycontext},
configuration={
"query": {"query": "{% include 'sql/bigquery.sql' %}", "useLegacySql": False,
"writeDisposition": "WRITE_TRUNCATE", "destinationTable": {"datasetId": bq_dataset}}
})
this line in my bigquery_extract.sql query is throwing the error:
{% for field in data.bq_fields %}
I want to use 'data' from params, which is calling a method, this method is reading from a .json file:
class Utils():
bucket = Variable.get('s3_bucket')
_qcontext = None
#property
def querycontext(self):
if self._qcontext is None:
self.load_querycontext()
return self._qcontext
def load_querycontext(self):
with open(path.join(conf.get("core", "dags"), 'traffic/bq_query.json')) as f:
self._qcontext = json.load(f)
the bq_query.json is this format, and I need to use the nested bq_fields list values:
{
"bq_fields": [
{ "name": "CONCAT(ID, '-', CAST(ID AS STRING), "alias": "new_id" },
{ "name": "TIMESTAMP(CAST(visitStartTime * 1000 AS INT64)", "alias": "new_timestamp" },
{ "name": "TO_JSON_STRING(hits.experiment)", "alias": "hit_experiment" }]
}
this file has a list which I want to use in the above mentioned query line, but its throwing this error:
jinja2.exceptions.UndefinedError: 'data' is undefined
There are two issues with your code.
First "params" is not a supported field in BigQueryInsertJobOperator. See this post where I post how to pass params to sql file when using BigQueryInsertJobOperator. How do you pass variables with BigQueryInsertJobOperator in Airflow
Second, if you happen to get an error that your file cannot be found, make sure you set the full path of your file. I have had to do this when migrating from local testing to the cloud, even though file is in same directory. You can set the path in the dag config with example below(replace path with your path):
with DAG(
...
template_searchpath = '/opt/airflow/dags',
...
) as dag:

Unexpected Token Import in cloud9 workspace linting when using dynamic imports

This is probably a pretty esoteric and specific issue to my workflow, so I don't know if anyone else has come across it in the past. I use an aws-cloud9 workspace to do development for my Vue application. I recently started using dynamic imports in my vue-router file to split up chunks and reduce initial file load size. In terms of the webpack compiler and running in the browser, it works great! However, cloud9's linter (which I believe is using eslint) is failing as soon as it gets to my first dynamic import with the error 'Parsing error: Unexpected token import'. I have a .eslintrc.js file in the directory of my project which looks like this:
// https://eslint.org/docs/user-guide/configuring
module.exports = {
root: true,
parser: "vue-eslint-parser",
parserOptions: {
parser: 'babel-eslint',
ecmaVersion: 2018,
'allowImportExportEverywhere': true
},
env: {
browser: true
},
extends: [
// https://github.com/vuejs/eslint-plugin-vue#priority-a-essential-error-prevention
// consider switching to `plugin:vue/strongly-recommended` or `plugin:vue/recommended` for stricter rules.
'plugin:vue/essential',
// https://github.com/standard/standard/blob/master/docs/RULES-en.md
'standard'
],
// required to lint *.vue files
plugins: [
'vue',
'babel'
],
// add your custom rules here
rules: {
// allow async-await
'generator-star-spacing': 'off',
// allow debugger during development
'no-debugger': process.env.NODE_ENV === 'production' ? 'error' : 'off',
'space-before-function-paren': 0,
'semi': [1, 'always'],
'quotes': 0,
'no-tabs': 0,
'allowImportExportEverywhere': true,
'no-mixed-spaces-and-tabs': 0
}
};
Other issues have mentioned edits to the eslintrc file to fix the issue. Changing the eslintrc file in my project changes what errors show up at compile time, but the aws-cloud9 ide is still presenting the error in the gutter.

Airflow won't write logs to s3

I tried different ways to configure Airflow 1.9 to write logs to s3 however it just ignores it. I found a lot of people having problems reading the Logs after doing so, however my problem is that the Logs remain local. I can read them without problem but they are not in the specified s3 bucket.
What I tried was first to write into the airflow.cfg file
# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply an Airflow connection id that provides access to the storage
# location.
remote_base_log_folder = s3://bucketname/logs
remote_log_conn_id = aws
encrypt_s3_logs = False
Then I tried to set environment variables
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucketname/logs
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=aws
AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
However it gets ignored and the log files remain local.
I run airflow from a container, I adapted https://github.com/puckel/docker-airflow to my case but it won't write logs to s3. I use the aws connection to write to buckets in dags and this works but the Logs just remain local, no matter if I run it on an EC2 or locally on my machine.
I finally found an answer using StackOverflow answer
which is most of the work I then had to add one more step. I reproduce this answer here and adapt it a bit the way I did:
Some things to check:
Make sure you have the log_config.py file and it is in the correct dir: ./config/log_config.py.
Make sure you didn't forget the __init__.py file in that dir.
Make sure you defined the s3.task handler and set its formatter to airflow.task
Make sure you set airflow.task and airflow.task_runner handlers to s3.task
Set task_log_reader = s3.task in airflow.cfg
Pass the S3_LOG_FOLDER to log_config. I did that using a variable and retrieving it as in the following log_config.py.
Here is a log_config.py that works:
import os
from airflow import configuration as conf
LOG_LEVEL = conf.get('core', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('core', 'log_format')
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
S3_LOG_FOLDER = conf.get('core', 'S3_LOG_FOLDER')
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow.task': {
'format': LOG_FORMAT,
},
'airflow.processor': {
'format': LOG_FORMAT,
},
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'airflow.task',
'stream': 'ext://sys.stdout'
},
'file.task': {
'class': 'airflow.utils.log.file_task_handler.FileTaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
},
'file.processor': {
'class': 'airflow.utils.log.file_processor_handler.FileProcessorHandler',
'formatter': 'airflow.processor',
'base_log_folder': os.path.expanduser(PROCESSOR_LOG_FOLDER),
'filename_template': PROCESSOR_FILENAME_TEMPLATE,
},
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
},
'loggers': {
'': {
'handlers': ['console'],
'level': LOG_LEVEL
},
'airflow': {
'handlers': ['console'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.processor': {
'handlers': ['file.processor'],
'level': LOG_LEVEL,
'propagate': True,
},
'airflow.task': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': True,
},
}
}
Note that this way S3_LOG_FOLDER can be specified in airflow.cfg or as environment the variable AIRFLOW__CORE__S3_LOG_FOLDER.
One more thing that leads to this behavior (Airflow 1.10):
If you look at airflow.utils.log.s3_task_handler.S3TaskHandler, you'll notice that there are a few conditions under which the logs, silently, will not be written to S3:
1) The logger instance is already close()d (not sure how this happens in practice)
2) The log file does not exist on the local disk (this is how I got to this point)
You'll also notice that the logger runs in a multiprocessing/multithreading environment, and that Airflow S3TaskHandler and FileTaskHandler do some very no-no things with the filesystem. If assumptions about log files on disk are met, S3 log files will not be written, and nothing is logged nor thrown about this event. If you have specific, well defined needs in logging it might be a good idea to implement all your own logging Handlers (see python logging docs) and disable all Airflow log handlers (see Airflow UPDATING.md).
One more thing that may lead to this behaviour - botocore may be not installed.
Make sure when installing airflow to include s3 package pip install apache-airflow[s3]
In case this helps someone else, here is what worked for me, answered in a similar post: https://stackoverflow.com/a/73652781/4187360

Airflow 1.9 logging to s3, Log files write to S3 but can't read from UI

I've been looking through various answers on this topic but haven't been able to get a working solution.
I have airflow setup to Log to s3 but the UI seems to only use File based task handler instead of the S3 one specified.
I have the s3 connection setup as follows
Conn_id = my_conn_S3
Conn_type = S3
Extra = {"region_name": "us-east-1"}
(the ECS instance use a role that has full s3 permissions)
I have created a log_config file with the following settings also
remote_log_conn_id = my_conn_S3
encrypt_s3_logs = False
logging_config_class = log_config.LOGGING_CONFIG
task_log_reader = s3.task
And in my log config I have the following setup
LOG_LEVEL = conf.get('core', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('core', 'log_format')
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
S3_LOG_FOLDER = 's3://data-team-airflow-logs/airflow-master-tester/'
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow.task': {
'format': LOG_FORMAT,
},
'airflow.processor': {
'format': LOG_FORMAT,
},
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'airflow.task',
'stream': 'ext://sys.stdout'
},
'file.processor': {
'class': 'airflow.utils.log.file_processor_handler.FileProcessorHandler',
'formatter': 'airflow.processor',
'base_log_folder': os.path.expanduser(PROCESSOR_LOG_FOLDER),
'filename_template': PROCESSOR_FILENAME_TEMPLATE,
},
# When using s3 or gcs, provide a customized LOGGING_CONFIG
# in airflow_local_settings within your PYTHONPATH, see UPDATING.md
# for details
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
},
'loggers': {
'': {
'handlers': ['console'],
'level': LOG_LEVEL
},
'airflow': {
'handlers': ['console'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.processor': {
'handlers': ['file.processor'],
'level': LOG_LEVEL,
'propagate': True,
},
'airflow.task': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': True,
},
}
}
I can see the logs on S3 but when I navigate to the UI logs all I get is
*** Log file isn't local.
*** Fetching here: http://1eb84d89b723:8793/log/hermes_pull_double_click_click/hermes_pull_double_click_click/2018-02-26T11:22:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='1eb84d89b723', port=8793): Max retries exceeded with url: /log/hermes_pull_double_click_click/hermes_pull_double_click_click/2018-02-26T11:22:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe6940fc048>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I can see in the logs that its successfully importing the log_config.py (I included a init.py as well)
Can't see why its using the FileTaskHandler here instead of the S3 one
Any help would be great, thanks
In my scenario it wasn't airflow that was at fault here.
I was able to go to the gitter channel and talk to the guys there.
After putting print statements into the python code that was running I was able to catch an exception on this line of code.
https://github.com/apache/incubator-airflow/blob/4ce4faaeae7a76d97defcf9a9d3304ac9d78b9bd/airflow/utils/log/s3_task_handler.py#L119
The exception was a recusion max depth issue on the SSLContext, which after looking around on the web seemed to be coming from using some combination of gevent with unicorn.
https://github.com/gevent/gevent/issues/903
I switched this back to sync and had to change the AWS ELB Listener to TCP but after that the logs were working fine through the UI
Hope this helps others.

Cannot run unit tests on modules with dependencies on dojo 1.x using the Intern

We are just starting out with getting some unit tests running in the Intern on our dojo-based project.
What happens is that when the intern tries to load the module under test's dependencies we get the following error:
/<path/to/dev/folder>/app/node_modules/intern/node_modules/dojo/dojo.js:406
match = mid.match(/^(.+?)\!(.*)$/);
^
TypeError: Cannot read property 'match' of null at getModule (/<path/to/dev/folder>/app/node_modules/intern/node_modules/dojo/dojo.js:406:15) at mix.amd.vendor (/<path/to/dev/folder>/app/node_modules/intern/node_modules/dojo/dojo.js:832:17) at /<path/to/dev/folder>/app/src/simplebuilding/model/ModelError.js:10:1
at exports.runInThisContext (vm.js:74:17)
at Object.vm.runInThisContext (/<path/to/dev/folder>/app/node_modules/intern/node_modules/istanbul/lib/hook.js:163:16)
at /<path/to/dev/folder>/app/node_modules/intern/node_modules/dojo/dojo.js:762:8
at fs.js:334:14
at FSReqWrap.oncomplete (fs.js:95:15)
Here is my config file - I started by copying the example one, and adding the map section to the loader.
define({
proxyPort: 9000,
proxyUrl: 'http://localhost:9000/',
capabilities: {
'selenium-version': '2.41.0'
},
{ browserName: 'chrome', version: '40', platform: [ 'OS X' ] }
],
maxConcurrency: 3,
tunnel: 'NullTunnel',
loader: {
// Packages that should be registered with the loader in each testing environment
packages: [
{ name: 'dojo', location: 'src/dojo' },
{ name: 'dojox', location: 'src/dojox' },
{ name: 'dijit', location: 'src/dijit' },
{ name: 'app', location: 'src/app' },
{ name: 'tests', location: 'tests' }
],
map: {
'*': {
'dojo' : 'dojo'
},
app : {
'dojo' : 'dojo'
},
intern : {
'dojo' : 'node_modules/intern/node_modules/dojo'
},
'tests' : {
'dojo' : 'dojo'
}
}
},
suites: [ 'tests/model/modelerror' ],
functionalSuites: [ /* 'myPackage/tests/functional' */ ],
excludeInstrumentation: /^(?:tests|test\-explore|node_modules)\//
});
The file under test has dependencies on dojo/_base/declare, dojo/_base/lang, and dojo/Stateful, and that is about it.
I created a dummy class to test where there were no dojo dependencies and it runs fine.
I've tried switching the loader to be the local dojo 1.10.3 version we have in our project, and that throws entirely different errors about not being able to find the intern (even if I give it a package definition in the config). Those errors look like this:
{ [Error: ENOENT, no such file or directory '/<path/to/dev/folder>/app/node_modules/.bin/main.js']
errno: -2,
code: 'ENOENT',
path: '/<path/to/dev/folder>/app/node_modules/.bin/main.js',
syscall: 'open' }
Our project structure is pretty straight-forward:
root
|--src
|--dojo (dijit/dojox/dgrid/etc)
|--app
|--tests
|--intern.js (config file)
I've tried several variations besides changing the loader, like trying to make sure the base-path is correct. I've tried running it in Node 0.10.36, and 0.12.2. But every time I debug this with node-inspector when it gets to load the module for my file under test and the mid is null, and jumping back up the stack trace it looks fine, but something is lost in the vm.runInThisContext() call, and the mid disappears by the time getModule() is called.
Any help is appreciated - Thanks!
So I figured this out - we had modules we were loading inside of our project that used an old style of the define() function. We had moved from the old define('my.module.namespace', ['deps'], function(deps){ ... }); to replacing the dot namespace for the module in the first argument with null. We were doing this as a transitionary phase to removing that argument completely, but hadn't ever finished that transition. This was causing the dojo2 loader to think the "id" of the module was null, and that was causing the loader to not find a Module ID.
This was a completely silly mistake on our part, and this will help us modernize to the updated signature for future-dojo-readiness.