I'm working on a project for which I need to extract some data from Google Scholar. My PHP program takes a string from my local machine, passes it to the Google Scholar and on the search results page it takes out the first result and saves it to the database.
I have to do this for almost 90 thousand strings/queries. The problem is that after a few hundred entries the program stops as the Google Scholar asks for captcha verification. What can I do about that?
Because Google Scholar does not have an API, there is no documented way to do what you want. You are not supposed to scrape data like this, which is why you are running into Google's bot-protection features. I think that your only real option is to wait for Google to create an API.
You can try to use Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
It bypasses blocks from search engines via dedicated proxies, CAPTCHA solving service, can scale to enterprise, no need to create a parser from scratch and maintain it.
Code and example to integrate with PHP in the online IDE:
<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
require __DIR__ . '/vendor/autoload.php';
$queries = array(
"moon",
"pandas",
"python",
"data science",
"ML",
"AI",
"animals",
"amd",
"nvidia",
"intel",
"asus",
"robbery pi",
"latex, tex",
"amg",
"blizzard",
"world of warcraft",
"cs go",
"antarctica",
"fifa",
"amsterdam",
"usa",
"tesla",
"economy",
"ecology",
"biology"
);
foreach ($queries as $query) {
$params = [
"engine" => "google_scholar",
"q" => $query,
"hl" => "en"
];
$client = new GoogleSearch(getenv("API_KEY"));
$response = $client->get_json($params);
print_r("Extracting search query: {$query}\n");
foreach ($response->organic_results as $result) {
print_r("{$result->title}\n");
}
}
?>
Code and example code to integrate with Python in the online IDE:
from serpapi import GoogleScholarSearch
import os
queries = ["moon",
"pandas",
"python",
"data science",
"ML",
"AI",
"animals",
"amd",
"nvidia",
"intel",
"asus",
"robbery pi",
"latex, tex",
"amg",
"blizzard",
"world of warcraft",
"cs go",
"antarctica",
"fifa",
"amsterdam",
"usa",
"tesla",
"economy",
"ecology",
"biology"]
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
"hl": "en"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
print(f"Extracting search query: {query}")
for result in results["organic_results"]:
print(result["title"])
Output:
Extracting search query: moon
Cellulose nanomaterials review: structure, properties and nanocomposites
Reflection in learning and professional development: Theory and practice
...
Extracting search query: biology
A new biology for a new century
The biology of mycorrhiza.
Disclaimer, I work for SerpApi.
Related
Possible Replication of How to change the type of a field?
I am currently newly learning MongoDB and I am facing problem while converting Data type of field value to another data type.
Below is an example of my document
[
{
"Name of Restaurant": "Briyani Center",
"Address": " 336 & 338, Main Road",
"Location": "XYZQWE",
"PriceFor2": "500.0",
"Dining Rating": "4.3",
"Dining Rating Count": "1500",
},
{
"Name of Restaurant": "Veggie Conner",
"Address": " New 14, Old 11/3Q, Railway Station Road",
"Location": "ABCDEF",
"PriceFor2": "1000.0",
"Dining Rating": "4.4",
}]
Like above I have 12k documents. Notice the datatype of PriceFor2 is a string. I would like to convert the data type to Integer data type.
I have referred many amazing answers given in the above link. But when I try to run the query, I get .save() is not a function error. Please advice what is the problem.
Below is the code I used
db.chennaiData.find().forEach( function(x){ x.priceFor2= new NumberInt(x.priceFor2);
db.chennaiData.save(x);
db.chennaiData.save(x);});
This is the error I am getting..
TypeError: db.chennaiData.save is not a function
From MongoDB's save documentation:
Starting in MongoDB 4.2, the
db.collection.save()
method is deprecated. Use db.collection.insertOne() or db.collection.replaceOne() instead.
Likely you are having a MongoDB with version 4.2+, so the save function is no longer available. Consider migrate to the usage of insertOne and replaceOne as suggested.
For your specific scenario, it is actually preferred to do with a single update as mentioned in another SO answer. It only does one db call(while your approach fetches all documents in the collection to the application level) and performs n db call to save them back.
db.collection.update({},
[
{
$set: {
PriceFor2: {
$toDouble: "$PriceFor2"
}
}
}
],
{
multi: true
})
Mongo Playground
Assume I am in possession of a SERP API, which given a keyword, returns me the Google results of that keyword in JSON format (for example: https://serpapi.com/):
{
"organic_results": [
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "https://en.wikipedia.org › wiki › Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. From the coffee fruit, the seeds are ...",
"sitelinks":{/*snip*/}
,
"rich_snippet":
{
"bottom":
{
"extensions":
[
"Region of origin: Horn of Africa and South Ara...",
"Color: Black, dark brown, light brown, beige",
"Introduced: 15th century"
]
,
"detected_extensions":
{
"introduced_th_century": 15
}
}
}
,
"about_this_result":
{
"source":
{
"description": "Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. Individual contributors, also called editors, are known as Wikipedians.",
"source_info_link": "https://en.wikipedia.org/wiki/Wikipedia",
"security": "secure",
"icon": "https://serpapi.com/searches/6165916694c6c7025deef5ab/images/ed8bda76b255c4dc4634911fb134de53068293b1c92f91967eef45285098b61516f2cf8b6f353fb18774013a1039b1fb.png"
}
,
"keywords":
[
"coffee"
]
,
"languages":
[
"English"
]
,
"regions":
[
"the United States"
]
}
,
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=4&hl=en&ct=clnk&gl=us",
"related_pages_link": "https://www.google.com/search?q=related:https://en.wikipedia.org/wiki/Coffee+Coffee"
},
/* Results 2,3,4... */
]}
What is a good way to get new results from the past 24h? I added the &tbs=qdr:d query parameter, which only shows the results from the past day. That's a good first step.
The 2nd step is to filter out only relevant results. When there are no relevant results, Google shows this message box:
What is their algorithm to show this box?
Idea 1: "grep -i {exact_keywords}"
For example, if I search a keyword like "Alexander Pope", the 24h Google query might return results about the pope, written by a guy called Alexander. That's not super relevant. My naive idea is to grep (case insensitive) the exact keyword "Alexander Pope".
But that might leave out some good results.
Any other ideas?
{
"movies": {
"movie1": {
"genre": "comedy",
"name": "As good as it gets",
"lead": "Jack Nicholson"
},
"movie2": {
"genre": "Horror",
"name": "The Shining",
"lead": "Jack Nicholson"
},
"movie3": {
"genre": "comedy",
"name": "The Mask",
"lead": "Jim Carrey"
}
}
}
I am a Firebase newbie. How can I retrieve a result from the data above where genre = 'comedy' AND lead = 'Jack Nicholson'?
What options do I have?
Using Firebase's Query API, you might be tempted to try this:
// !!! THIS WILL NOT WORK !!!
ref
.orderBy('genre')
.startAt('comedy').endAt('comedy')
.orderBy('lead') // !!! THIS LINE WILL RAISE AN ERROR !!!
.startAt('Jack Nicholson').endAt('Jack Nicholson')
.on('value', function(snapshot) {
console.log(snapshot.val());
});
But as #RobDiMarco from Firebase says in the comments:
multiple orderBy() calls will throw an error
So my code above will not work.
I know of three approaches that will work.
1. filter most on the server, do the rest on the client
What you can do is execute one orderBy().startAt()./endAt() on the server, pull down the remaining data and filter that in JavaScript code on your client.
ref
.orderBy('genre')
.equalTo('comedy')
.on('child_added', function(snapshot) {
var movie = snapshot.val();
if (movie.lead == 'Jack Nicholson') {
console.log(movie);
}
});
2. add a property that combines the values that you want to filter on
If that isn't good enough, you should consider modifying/expanding your data to allow your use-case. For example: you could stuff genre+lead into a single property that you just use for this filter.
"movie1": {
"genre": "comedy",
"name": "As good as it gets",
"lead": "Jack Nicholson",
"genre_lead": "comedy_Jack Nicholson"
}, //...
You're essentially building your own multi-column index that way and can query it with:
ref
.orderBy('genre_lead')
.equalTo('comedy_Jack Nicholson')
.on('child_added', function(snapshot) {
var movie = snapshot.val();
console.log(movie);
});
David East has written a library called QueryBase that helps with generating such properties.
You could even do relative/range queries, let's say that you want to allow querying movies by category and year. You'd use this data structure:
"movie1": {
"genre": "comedy",
"name": "As good as it gets",
"lead": "Jack Nicholson",
"genre_year": "comedy_1997"
}, //...
And then query for comedies of the 90s with:
ref
.orderBy('genre_year')
.startAt('comedy_1990')
.endAt('comedy_2000')
.on('child_added', function(snapshot) {
var movie = snapshot.val();
console.log(movie);
});
If you need to filter on more than just the year, make sure to add the other date parts in descending order, e.g. "comedy_1997-12-25". This way the lexicographical ordering that Firebase does on string values will be the same as the chronological ordering.
This combining of values in a property can work with more than two values, but you can only do a range filter on the last value in the composite property.
A very special variant of this is implemented by the GeoFire library for Firebase. This library combines the latitude and longitude of a location into a so-called Geohash, which can then be used to do realtime range queries on Firebase.
3. create a custom index programmatically
Yet another alternative is to do what we've all done before this new Query API was added: create an index in a different node:
"movies"
// the same structure you have today
"by_genre"
"comedy"
"by_lead"
"Jack Nicholson"
"movie1"
"Jim Carrey"
"movie3"
"Horror"
"by_lead"
"Jack Nicholson"
"movie2"
There are probably more approaches. For example, this answer highlights an alternative tree-shaped custom index: https://stackoverflow.com/a/34105063
If none of these options work for you, but you still want to store your data in Firebase, you can also consider using its Cloud Firestore database.
Cloud Firestore can handle multiple equality filters in a single query, but only one range filter. Under the hood it essentially uses the same query model, but it's like it auto-generates the composite properties for you. See Firestore's documentation on compound queries.
I've written a personal library that allows you to order by multiple values, with all the ordering done on the server.
Meet Querybase!
Querybase takes in a Firebase Database Reference and an array of fields you wish to index on. When you create new records it will automatically handle the generation of keys that allow for multiple querying. The caveat is that it only supports straight equivalence (no less than or greater than).
const databaseRef = firebase.database().ref().child('people');
const querybaseRef = querybase.ref(databaseRef, ['name', 'age', 'location']);
// Automatically handles composite keys
querybaseRef.push({
name: 'David',
age: 27,
location: 'SF'
});
// Find records by multiple fields
// returns a Firebase Database ref
const queriedDbRef = querybaseRef
.where({
name: 'David',
age: 27
});
// Listen for realtime updates
queriedDbRef.on('value', snap => console.log(snap));
var ref = new Firebase('https://your.firebaseio.com/');
Query query = ref.orderByChild('genre').equalTo('comedy');
query.addValueEventListener(new ValueEventListener() {
#Override
public void onDataChange(DataSnapshot dataSnapshot) {
for (DataSnapshot movieSnapshot : dataSnapshot.getChildren()) {
Movie movie = dataSnapshot.getValue(Movie.class);
if (movie.getLead().equals('Jack Nicholson')) {
console.log(movieSnapshot.getKey());
}
}
}
#Override
public void onCancelled(FirebaseError firebaseError) {
}
});
Frank's answer is good but Firestore introduced array-contains recently that makes it easier to do AND queries.
You can create a filters field to add you filters. You can add as many values as you need. For example to filter by comedy and Jack Nicholson you can add the value comedy_Jack Nicholson but if you also you want to by comedy and 2014 you can add the value comedy_2014 without creating more fields.
{
"movies": {
"movie1": {
"genre": "comedy",
"name": "As good as it gets",
"lead": "Jack Nicholson",
"year": 2014,
"filters": [
"comedy_Jack Nicholson",
"comedy_2014"
]
}
}
}
For Cloud Firestore
https://firebase.google.com/docs/firestore/query-data/queries#compound_queries
Compound queries
You can chain multiple equality operators (== or array-contains) methods to create more specific queries (logical AND). However, you must create a composite index to combine equality operators with the inequality operators, <, <=, >, and !=.
citiesRef.where('state', '==', 'CO').where('name', '==', 'Denver');
citiesRef.where('state', '==', 'CA').where('population', '<', 1000000);
You can perform range (<, <=, >, >=) or not equals (!=) comparisons only on a single field, and you can include at most one array-contains or array-contains-any clause in a compound query:
Firebase doesn't allow querying with multiple conditions.
However, I did find a way around for this:
We need to download the initial filtered data from the database and store it in an array list.
Query query = databaseReference.orderByChild("genre").equalTo("comedy");
databaseReference.addValueEventListener(new ValueEventListener() {
#Override
public void onDataChange(#NonNull DataSnapshot dataSnapshot) {
ArrayList<Movie> movies = new ArrayList<>();
for (DataSnapshot dataSnapshot1 : dataSnapshot.getChildren()) {
String lead = dataSnapshot1.child("lead").getValue(String.class);
String genre = dataSnapshot1.child("genre").getValue(String.class);
movie = new Movie(lead, genre);
movies.add(movie);
}
filterResults(movies, "Jack Nicholson");
}
}
#Override
public void onCancelled(#NonNull DatabaseError databaseError) {
}
});
Once we obtain the initial filtered data from the database, we need to do further filter in our backend.
public void filterResults(final List<Movie> list, final String genre) {
List<Movie> movies = new ArrayList<>();
movies = list.stream().filter(o -> o.getLead().equals(genre)).collect(Collectors.toList());
System.out.println(movies);
employees.forEach(movie -> System.out.println(movie.getFirstName()));
}
The data from firebase realtime database is as _InternalLinkedHashMap<dynamic, dynamic>.
You can also just convert this it to your map and query very easily.
For example, I have a chat app and I use realtime database to store the uid of the user and the bool value whether the user is online or not. As the picture below.
Now, I have a class RealtimeDatabase and a static method getAllUsersOnineStatus().
static getOnilineUsersUID() {
var dbRef = FirebaseDatabase.instance;
DatabaseReference reference = dbRef.reference().child("Online");
reference.once().then((value) {
Map<String, bool> map = Map<String, bool>.from(value.value);
List users = [];
map.forEach((key, value) {
if (value) {
users.add(key);
}
});
print(users);
});
}
It will print [NOraDTGaQSZbIEszidCujw1AEym2]
I am new to flutter If you know more please update the answer.
ref.orderByChild("lead").startAt("Jack Nicholson").endAt("Jack Nicholson").listner....
This will work.
I am confused by the Balanced Payment documentation, specifically for creating customers:
The Balanced docs say to create a customer with this code:
$marketplace = Balanced\Marketplace::mine();
$customer = $marketplace->customers->create(array(
'address' => array(
'postal_code' => '48120',
),
'dob_month' => '7',
'dob_year' => '1963',
'name' => 'Henry Ford',
));
The goal is to get a json response:
{
"customers": [
{
"address": {
"postal_code": "48120",
//more key -> value pairs
},
//more key -> value pairs
"href": "/customers/CU3SSJgvA5Z69kt05MusbPeE",
}
The problem that I am having is that I cannot find any reference as to how to send the info to Balanced. Do I use balanced.js to tokenize it the same way I tokenize a credit card?
Take a look at https://docs.balancedpayments.com/1.1/api/customers/#create-a-customer . You would use one of the clients, such as the Python or PHP clients found at https://docs.balancedpayments.com/1.1/overview/getting-started/#client-libraries .
Balanced.js is just for card tokenizations to aid with PCI-compliance -- using it, sensitive card information is posted directly to Balanced and never has to touch your own servers.
i searched a while but found nothing, thats simular to my problem.
i'm trying to use the YAHOO Weather API, for example: http://weather.yahooapis.com/forecastrss?w=4097
i don't know the WOEID in my case, but i got latitude and longitude points.
so my question is:
is there a way to get the WOEID of a place by using lat and long points?
This is now available through the recently released PlaceFinder API. Kudos to Yahoo! for providing yet another important piece of the Geo puzzle.
Yahoo! PlaceFinder API allows you to find a corresponding WOEID for a latitude/longitude pair. Consider this example web service method call:
http://where.yahooapis.com/geocode?location=37.42,-122.12&flags=J&gflags=R&appid=zHgnBS4m
You can play with request parameters according to your needs, see Yahoo! PlaceFinder API documentation for more.
And you should replace appid with your Yahoo! appid, you can create one here.
This request returns a response like that, which includes a lot of useful data along with the WOEID:
{
"ResultSet": {
"version": "1.0",
"Error": 0,
"ErrorMessage": "No error",
"Locale": "us_US",
"Quality": 99,
"Found": 1,
"Results": [
{
"quality": 99,
"latitude": "37.420000",
"longitude": "-122.120000",
"offsetlat": "37.420000",
"offsetlon": "-122.120000",
"radius": 500,
"name": "37.42,-122.12",
"line1": "3589 Bryant St",
"line2": "Palo Alto, CA 94306-4207",
"line3": "",
"line4": "United States",
"house": "3589",
"street": "Bryant St",
"xstreet": "",
"unittype": "",
"unit": "",
"postal": "94306-4207",
"neighborhood": "",
"city": "Palo Alto",
"county": "Santa Clara County",
"state": "California",
"country": "United States",
"countrycode": "US",
"statecode": "CA",
"countycode": "",
"hash": "",
"woeid": 12797284,
"woetype": 11,
"uzip": "94306"
}
]
}
}
This is not using Yahoo's API but I found this blog post:
http://geomojo.org/?p=38
Mentioning this service:
http://www.geomojo.org/cgi-bin/reversegeocoder.cgi?long=-117.699444&lat=35.4775
Perhaps you can use that? It solved my problem, I hope it helps in solving yours.
It is somewhat ridiculous that Yahoo doesn't provide a lookup method for WOEIDs via lat/lon--it's been on their todo list since 2008--but that's the state of things.
I would caution you against using the suggested workaround implemented at Geomojo. If it works for your data, great, but the Yahoo service that Geomojo relies on is unpredictable. Geomojo uses Yahoo's PlaceMaker, which extracts location information from unstructured text to get a WOEID. It does this by creating a microformat containing your lat/lon pair and submitting it to PlaceMaker. However, since PlaceMaker returns WOEIDs for zip codes there's a loss of resolution and you will sometimes not be able to identify even the town for submitted coordinates. I have a number of example points on the east coast of the U.S. where the PlaceMaker WOEIDs do not correspond to the submitted lat/lon pairs.
Strangely, as HD writes, only Flickr's API provides a simple way to lookup a WOEID from lat/lon. Flickr's findByLatLon method has great resolution. It will usually return a neighborhood (one level below town) for a pair of coordinates.
First get city name from lat/long using this code.
CLGeocoder *geocoder = [[CLGeocoder alloc] init] ;
[geocoder reverseGeocodeLocation:location
completionHandler:^(NSArray *placemarks, NSError *error) {
if (error){
NSLog(#"Geocode failed with error: %#", error);
return;
}
CLPlacemark *placemark = [placemarks objectAtIndex:0];
NSLog(#"cityname - %#",placemark.locality);
}];
Then use that cityname in below url
https://search.yahoo.com/sugg/gossip/gossip-gl-location/?appid=weather&output=sd1&p2=pt&command=YOURCITYNAME
Example - https://search.yahoo.com/sugg/gossip/gossip-gl-location/?appid=weather&output=sd1&p2=pt&command=sydney
This will return json and you find get WOEID in this...
{ "l" : { "gprid" : "eIL89mltSzSfgDWdP7uyBA" },
"q" : "sydney",
"r" : [ { "d" : "pt:iso=AU&woeid=1105779&lon=151.021&lat=-33.8563&s=New South Wales&c=Australia&sc=NSW&n=Sydney, Australia",
"k" : "Sydney"
} ]
}
Seems like you got it the wrong way around. This is the URL on weather.yahoo.com:
weather.yahoo.com/united-states/illinois/chicago-2379574/
The last bit is the WOEID for Chicago, i.e. 2379574
WOEIDs are described in the GeoPlanet docs:
http://developer.yahoo.com/geo/geoplanet/guide/concepts.html#woeids
You can use Flickr's reverse geocoding API through YQL.
Here is a link to the YQL with an example query to find the WOEID for a given lat/lon:
http://developer.yahoo.com/yql/console/#h=select%20place.woeid%20from%20flickr.places%20where%20lat%3D43%20and%20lon%3D-94
The above query can be called directly from your app with this URL (XML/JSON formats available):
http://query.yahooapis.com/v1/public/yql?q=select%20place.woeid%20from%20flickr.places%20where%20lat%3D43%20and%20lon%3D-94&format=xml
There is a topic about this issue at the YDN forums http://developer.yahoo.net/forum/index.php?showtopic=69
Looks like it's buried in the to-do list, from 2008 "The ability to map a set of longitude and latitude coordinates to a WOEID, from which information such as ZIP and State may be derived, has already been identified as a valuable feature and it is on our enhancement request list."
Other quotes;
"Flickr has a method: flickr.places.findByLatLon which returns a WOEID, but they truncate coordinates to three decimal places."
In this topic a Yahoo dev also suggests using the advice at http://geomojo.org/?p=38 as an interim solution.
Instead using lng & lat, you can use the current online IP Address to get the City Name then use Yahoo GeoPlanet web service to get the WOEID.
Follow this tutorial to get the detail: http://4rapiddev.com/php/get-woeid-of-a-city-name-from-ip-address-with-php/
The former Yahoo Weather API has became deprecated. The new Yahoo Weather API requires a query string to get weather.
Use the following query string to get weather data by latitude and longitude -
"https://query.yahooapis.com/v1/public/yql?q=select * from weather.forecast where woeid in (select woeid from geo.places(1) where text=\"(" + latitude + "," + longitude + ")\")&format=json"
Eg - https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20in%20(select%20woeid%20from%20geo.places(1)%20where%20text%3D%22(31.63%2C74.87)%22)&format=json
Yahoo api uses weather.com actually, so go to weather.com and search for your local weather. I'm in Chicago so I entered 'Chicago, IL' and here's the link in my browser bar showing my weather:
http://www.weather.com/weather/today/Chicago+IL+USIL0225?lswe=chicago,%20il&from=searchbox_localwx
In the link is the woeid - which is USIL0225
You can get yours the same way.