Make search engines distinguish website chronological updates over time (like in forums)

Make search engines distinguish website chronological updates over time (like in forums) - optimization

I see that search engines are prominently capable of finding pages chronologically for forum websites and the like, offering the option to show the results for the last 24 hours, last week, last month, last year, etc.
I understand that these sites need to be continuously crawled to provide those updates, but I have technical doubts about what structure, tags or whatever I need to do to achieve it for my website.
I see that at the client side (which is also the side search engines are at) content appears basically as static data, already processed by the server, so the question is:
If I have a website for which I update and add content constantly to the index page to make it easily visible, and for which I even add links, times and dates as text for the new pages, why don't these updates show at all in search engines?
Do I need to add XML/RSS feeds, or what else?
How do forums and sites with heavy updates with a chronological mark achieve the capability to allow search engines to list results separated by hours, days, etc.?
What specific set of tags and overall structure do I need to add for this feature?
I also see that search engines, mainly Googlebot, usually take a minimum of 3 days to crawl those new pages, but still, they aren't organized persistently (or at all) in a chronological way in search results.
I am not using any forum, blog or other kind of web publishing software, just raw HTML and PHP written by hand, and the minimum I mentioned above, of pointing to new documents from the index page of the website along with a description.

Do I need to add XML/RSS feeds, or what else?
Yes. Atom or one of the RSS formats (or several formats at the same time, so you could offer Atom and RSS).
Search engines will know about new blog posts, microblog post, forum threads, forum thread answers etc., because they subscribe to the feed. So sometimes you'll notice that a page is indexed by a search engines only minutes after it was published. But for smaller sites, search engines probably don't check for updates every few minutes, instead it might take even days until a new page is indexed.
A sitemap might help, too.

Related

Lists the most viewed pages Wikimedia projects (Wikipedia) with more than 1000 results

I have seen that there are various APIs and various tools that allow you to see the most visited pages of the Wikimedia projects such as Wikipedia, but all these services have a limit, they do not allow to show more than 1000 pages, while I would like to have the list of 5000-10000(or more) most visited pages in order of traffic.
these are all the services that I checked and with which I found this limit:
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bmostviewed
https://stats.wikimedia.org/#/en.wikipedia.org/reading/top-viewed-articles/normal|table|last-month|~total|monthly
https://pageviews.toolforge.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes=
https://wikimedia.org/api/rest_v1/#/Pageviews%20data
I have also found services like https://quarry.wmflabs.org/ or
https://query.wikidata.org/ where you can run a query, technically perhaps through this service you could but I don't know the query to be performed to show the pages with most visits.
I also found an interesting article here: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/ where it is explained that it is possible to use Google's BigQuery but it is an external service and before using it I wanted to know if it existed a simpler method.

If the REST API doesn't suit your purpose, you'd need to parse the raw data yourself. That's because all the tools you've linked just consume the REST API.
The raw data are available at https://dumps.wikimedia.org/other/pageviews/. There are two groups of files there. One starts with pageviews-, which lists the number of views of individual pages, the second starts with projectviews-, which lists the number of views of individual projects.
For your target, you need the pageviews ones. Download the files for your timespan, and then analyze them using a script.
The file is space-separated. Each row represents one page that was visited in that hour. First column represents the project (en is English Wikipedia, for instance), second is the page title (spaces are represented by underscores) and then there are total pageviews.
The technical documentation is available at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews.

How to get the most searched words in Solr? [duplicate]

I'm trying to organize a solr search engine. I've already set up the misspelling system and the suggestions.
However I can't seem to find how to retrieve the top 10 most searched words/terms/keywords in solr/lucene. How can I get this? I want to display those on my homepage.

Solr does not provide this kind of feature out of the box. There is the StatsComponent, that provides you with all kind of statistics, but all of those are numeric only.
Depending on how you access solr (directly or via your own app) you could intercept all calls an log the query string. I did this in a recent project where I logged a queries to a database. If you submit all keywords to an other core on your solr server, you can faceting queries on your search terms as described by Hyque

You could use a facet for retrieving the Top X words like this:
http://yourservergoeshere/solr/select?q=*&wt=xml&indent=true&facet=true&facet.query=*&facet.field=message&facet.limit=10&facet.minCount=1
The value of facet.field depends on the field you like to search in. With facet.limit you'll (obviously) limit the amount of results to 10. You'll find the facet results at the end of the results, starting with "facet_counts"
Edit: I really should go to bed earlier. I didn't see the "most searched" in your question. Sorry for that.

Apache Solr does not provide any such capability as of today. There is a desire for this and a JIRA ticket corresponding to it. You can vote for it if you'd like to see it in Solr some day: https://issues.apache.org/jira/browse/SOLR-10359.
The stats component provides information around statistics, but it's mostly numeric in nature. You could parse server logs and come up with a way to build a Frequently Searched Terms (e.g. pump those logs in SiLK or Kibana for visualization).
If you have the ability to change the front end and add some javascript code to the UI or can intercept the search request and make an async or batch calls to APIs for tracking, you can use SearchStax Analytics that provides Search Analytics that tracks searches, clicks, cart actions, revenue, etc.

Best practice for joining AdWords API placement data with AdWords ValueTrack placement data?

We've been working successfully with the AdWords API (Version: 201708 -
Google Ads Python Client Library) for a good while building internal reports for our application. Until, that is, we hit placements…
I define placements as anywhere an AdWords ad is shown. The placement might be a domain, page, ad unit, app you name it! Placements are a very broad definition.
For our app to work for placements we need to join API spend data with activity on our website.
To do this we are running AdWords API reports and then collecting session data using AdWords ValueTrack parameters.
The ValueTrack parameters are easy enough, as there seems to be only 1 option: {placement}.
However, it's on the API where things get interesting, the API has numerous options for getting placement data. For example:
https://developers.google.com/adwords/api/docs/reference/v201708/AdGroupCriterionService.MobileApplication
https://developers.google.com/adwords/api/docs/appendix/reports/url-performance-report
https://developers.google.com/adwords/api/docs/appendix/reports/placement-performance-report#criteria
https://developers.google.com/adwords/api/docs/appendix/reports/automatic-placements-performance-report#domain
https://developers.google.com/adwords/api/docs/reference/v201708/AdGroupCriterionService
After spending some time going back and forth on the various options, and burning lots of dev time, we've come to the conclusion that there must be some best practice advice out there for joining placement data from the API and ValueTrack. One that works for all types of placements, including:
Websites
Apps
AdSense
Blogspot
AMP
An example of where we are running into a matching problem is "10060.android.com.nytimes.android.adsenseformobileapps.com"... this is a placement we see coming in from ValueTrack but has no match in any of our spend reports. (In fact there are many many adsenseformobileapps.com traffic sources for which there are no spend items).
Also seeing strings like "mobileapp::2-com.mobilesrepublic.appy". These show up on our spend side but only appear in our ValueTrack around 10% of the time. Some match. The vast majority don't.
A definitive workflow on this would be SO useful for ourselves and no doubt other users…
Thanks!

According to https://developers.google.com/adwords/api/docs/guides/valuetrack-mapping
the incoming ValueTrack placement should map to the following report fields:
PlacementPerformanceReport.Criteria
CriteriaPerformanceReport.Criteria
AutomaticPlacementsPerformanceReport.DisplayName
In addition to this I have also found this report useful:
UrlPlacementPerformanceReport.Domain and .Url
But I have found it is not so clear in practice. For one thing each of these reports return a slightly different subset of results..and none of these subsets exactly match the ValueTrack data set.
Here are the exceptions I have found:
subdomains
ValueTrack placements have urls with www on them... some of the time. None of the other reports do, so you will either have to strip www from ValueTrack or add www to your report data in order to match them. But be careful, other subdomains are preserved (like edition.cnn.com) and not all urls have a subdomain, so you can't just strip all the subdomains from Valuetrack and you can't just add www to all the urls in the reports. What I have found actually matches the best is the url field from the UrlPlacementPerformanceReport... but for this field you just need to strip everything after the / to get a best case matching subset. To use the other reports you would need to strip all the subdomain information from ValueTrack and sum the totals from those records. This means you would lose potentially useful data such as differences between espn.com, scores.espn.com, insider.espn.com and games.espn.com. Using the UrlPlacementPerformanceReport.url is the only way to preserve that info.
mobileapp::
ValueTrack reports on mobileapp:: placements. Many of the reports return these values too but I have found that each report just gives a subset of the whole. In particular the CriteriaPerformanceReport.Criteria report gives you many mobileapp:: values that none of the other reports do, but the other reports give you at least a few values that the CriteriaPerformanceReport doesn't. To be complete you would have to take a Union of the mobileapps: returned by the criteria performance report and another report such as the UrlPlacementPerformanceReport.url.
anonymous.google
ValueTrack provides sudomains to anonymous.google that look like a8122ac7e5da8e49.anonymous.google. If you want to match this information to your spend the only report that has this detail is UrlPlacementPerformanceReport.url.
adsenseformobileapps.com
ValueTrack provides detailed domains such as 1.iphone.com.localtvllc.fox2.adsenseformobileapps.com. None of the adwords reports can match this. The best you can get is a single summation record for the entire adsenseformobileapps.com group.

Multipage Bootstrap and Google Analytics

I have sort of a problem how to use Google Analytics properly with Boostrap.
My page has 3 level deep subpages and the last subpage has it's own subdomain. In GA I see I can use max. 50 tracking codes within one service. What if I need more than that?

You are limited to 50 properties not 50 pages. Each property can track many pages and (up to 10 million hits a month for the free version) and events.
Typically you would use the same property code on all pages on the same site so you can see all that data together (though with option to drill down).
You would only use a new property code for a new site (though your subdomain might qualify for that if you want to track it separately).
So the two questions you want to ask yourself are:
Do you want to be able to report on two pages together? E.g. To see that your site gets 10,000 hits and 20% are for this page and 5% are for that page. Or people start at this page and then go to that page and then on to this page. If so it should be the same analytics property.
Do different people need to see these page stats? And is it a problem if they do? If so put as a separate property so you can permission separately.
It sounds like these are part of the same site so I'd be veering towards tracking them together on same property.
On a different note you should set one page as the main version (with a rel canonical tag) and redirect other version to that page to avoid confusing search engines thinking you have duplicated content. Do you have a reason for having the same content on two different addresses? It can cause SEO and other problems.

New site going on to an old domain

I have a client who over the years has managed to get their product to the top of Google for many different search terms. They're adamant that the new site shouldn't have a detrimental effect to their google ranking.
The site will be replacing the site that is on there current domain, as well as going up on to 5 further domains.
Will any of this lose the client there current ranking on google?

Google re-ranks the sites it has regularly. If the site changes, the ranking very well could... if more or fewer people link to it or if the terms on the site (the content) is different.
The effect might be good or bad, but uploading different content isn't going to make their rank go away overnight or anything like that.

Page Rank is most about incoming links. So if the incoming links won't be broken page rank will not be affected that much.
Though, overall ranking is not just Page Rank, so... further discussion is needed

if they retain current link structure they should be fine

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas