Typo3 9.5: Crawl pages with front end login - typo3-9.x

I undertook the work on an intranet based on Typo3 v9.5.23. They're using indexed_search v9.5.23 and crawler v9.1.5 to crawl the pages but in the database table tx_crawler_queue in the column result_data I get {"content":"\"403 Access denied\""} for every page.
To see the pages you have to login as a front end user. What do I have to do to crawl those pages?
I'm using the console with the following commands:
vendor/bin/typo3 crawler:flushQueue all
vendor/bin/typo3 crawler:buildQueue 69 intranet --depth=2
vendor/bin/typo3 crawler:processQueue

In the crawler config you find a field Crawl with FE user groups.
As any FE-login needs a FE user group you should select the possible groups for crawling.


Auth0. How to retrieve over 1000 users (and make this call via a python script than be run as a cron job)

I am trying to use Auth0 to get a list of users when my user list is >1000 (approx 2000)
So I understand a bit better now how this works after following the steps at:
There are three steps:
Use a POST call to the https://MY_DOMAIN/oauth/token endpoint to get an auth token (done)
Then take this token and insert it into the next POST call to the endpoint: https://MY_DOMAIN/api/v2/jobs/users-exports
Then take the job_id and insert it into the 3rd GET call to the endpoint: https://MY_DOMAIN/api/v2/jobs/MY_JOB_ID
But this just gives me a link to a document that I download. Essentially is the same end result as using the User Import / Export extension.
This is NOT what I want. I want to be able to call an endpoint and have it return a list of all the users (similar to the Retrieve Users with the Get Users Endpoint). I require it is done this way, so I can write a python script and run it as a cron job.
However, since I have over 1000 users, I am getting the below error when I call the GET /API/v2/users endpoint.
auth0.v3.exceptions.Auth0Error: 400: You can only page through the first 1000 records. See https://auth0.com/docs/users/search/v3/view-search-results-by-page#limitation
Can anyone help? Can this be done all the way I wish it to be?

How to get a Workday worker / employee web profile URL?

I wish to retrieve a Workday worker (aka employee) web profile URL via the Workday API. The use case is that I'm building a chatbot to retrieve user information and I want to be able to deep link to the worker (employee) web profile.
The issue is that I cannot do either of the following:
get a web profile URL from the API
create a web profile URL from data in the API
A web profile URL looks like the following. The userId looks like 1234 right before the .htmld extension as that is the only number that changes between employee profiles.
A search URL in the webUI returns a slightly different URL but has the same numerical userId at the end, e.g. the 1234 before .htmld here:
A worker API call is like the following with a 32 byte hexadecimal workerId like deadbeefdeadbeefdeadbeefdeadbeef. Searching for the API workerId in the web UI returns no results.
The API result does not have the web profile userId, e.g. 1234, any where in it, or a URL that can render a web page.
"descriptor":"Joe Cool",
"descriptor":"Santa Rosa, California",
"descriptor":"Peanuts (Charles 'Sparky' Schulz)",
Can anyone help provide info on how to get a web profile URL from the Workday API?
The ID returned from workday's API is actually the Workday ID, not Worker ID. The Workday ID or WID is a direct reference to any object in Workday. This is often referred to as an "Integration ID". Workday doesn't document this very well, but workday's URLs do have an interesting thing you can take advantage of for deep linking to any Workday Object:
As long as you have the Workday ID (WID) of an object, you can deeplink directly. The sourceReferenceWID is just for logging purposes, so you can enter any text you want. I tested this in my own tenant with the text "deeplink" replacing {sourceReferenceWID} just for fun. For your example, the following URL should work for Joe Cool:
This is not officially documented, so Workday may change how this works and your mileage may vary.
It's not a delivered REST API, but you could create a RaaS with the Business Object "Worker from Prompt". There is a field called "Worker Instance URL". When you call the endpoint, you can use the WID (Workday ID), the Employee_ID, or Contingent_Worker_ID for the filter.
https://wd2-impl-services1.workday.com/ccx/service/customreport2/{tenant}/{report owner}/{report name}?Worker!WID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
https://wd2-impl-services1.workday.com/ccx/service/customreport2/{tenant}/{report owner}/{report name}?Worker!Employee_ID=x

Does Import.io api support status of the extractor?

I've just created an extractor with import.io. This extractor uses chaining. Firstly I'm extracting some urls from one page and with these extracted urls, I'm extracting detail pages. When detail pages' extraction finish, I want to get the results. But how can I be sure that extraction is completed. Is there any api endpoint for checking the status of extraction?
I found "GET /store/connector/{id}" endpoint from legacy. But when I try this, I got 404. You can take a look at the screenshot.
Another question is, I want to schedule my extractor twice a day. Is this possible?
Associated with each Extractor are Crawl Runs. A crawl run represents the running of an extractor with a specific configuration (training, list of URLs, etc). The state of each of a crawl run can have one of the following values:
STARTED => Currently running
CANCELLED => Started but cancelled by the user
FINISHED => Run was complete
Additional metadata that is included is as follows:
Started At - When the run started
Stopped At - When the run finished
Total URL Count - Total number of URLs in the run
Success URL Count - # of successful URLs queried
Failed URL Count - # of failed URLs queried
Row Count - Total number of rows returned in the run
The REST API to get the list of craw runs associated with an extractor is as follows:
curl -s X GET "https://store.import.io/store/crawlrun/_search?_sort=_meta.creationTimestamp&_page=1&_perPage=30&extractorId=$EXTRACTOR_ID&_apikey=$IMPORT_IO_API_KEY"
$EXTRACTOR_ID - Extractor to list crawl runs
$IMPORT_IO_API_KEY - Import.io API from your account

How to update the fetch status in crawldb in apache nutch?

I did web crawling using apache nutch..... I have fetched for two rounds. It generated a crawl db containg 21 urls as fetched status and 537 url as unfetched status. I want to update the status of all the links in crawldb as fetched for some reason. Is there any way to update the status?
I found answer to my question and wanted to share with you all. After fetching two rounds I have updated the db with command 'bin/nutch updatedb crawl/crawldb $s2'. Then the db will be updated with new urls and with status as 'unfetched'. But if do 'bin/nutch updatedb crawl/crawldb $s2 -noAdditions', it will not add new urls to the db and make already existing urls status as 'fetched'.

Multi Login ZF2 with multi session

I have an application with 3 different logins (3 different dashboard). Not to write duplicate code I created an adapter and a plugin to login.
Now how can I manage 3 different sessions. If I run in to login Login 1 must also be signed on dashboard 2 dashboard 3, but only on dashboard 1.
How can I handle this? multi session for multi login.
This has nothing to do with authentication (or login: know what the identity of the user is) but authorization (or access: has the user the right to access this page).
You should not manage authorization with different logins, different sessions and so on. Just use a single identity for a user and use authorization for access. Take an example with ACL or RBAC, both inside Zend\Permission.
With these permission systems, you can say: this user X is allowed to access dashboard 1 and 3. The user Y is allowed to access 1 and 2. The user Z is only allowed to visit dashboard 1.
You should use Zend\Permissions\Acl. Check section "Multiple Inheritance among Roles".
use Zend\Permissions\Acl\Acl;
use Zend\Permissions\Acl\Role\GenericRole as Role;
use Zend\Permissions\Acl\Resource\GenericResource as Resource;
$acl = new Acl();
$acl->addRole(new Role('guest'))
->addRole(new Role('member'))
->addRole(new Role('admin'));
$parents = array('guest', 'member', 'admin');
$acl->addRole(new Role('someUser'), $parents);
$acl->addResource(new Resource('someResource'));
$acl->deny('guest', 'someResource');
$acl->allow('member', 'someResource');
echo $acl->isAllowed('someUser', 'someResource') ? 'allowed' : 'denied';
But in case you don't want to use ACL. then why don't you add into your login table a permission column an integer(1,2,3...up to 7 I think) on login add this integer to a session and on each dashboard you check for permission number if not allowed access then you redirect to login or home page.