scholarly
scholarly is a Python module designed to programmatically retrieve author and publication information from Google Scholar, effectively bypassing CAPTCHA challenges. Currently at version 1.7.11, the library maintains an active development cycle with frequent updates to adapt to changes in Google Scholar's structure and anti-bot measures.
Common errors
-
scholarly.exceptions.MaxTriesExceededException: Exceeded maximum number of tries to fetch url. Check if your connection is good.
cause Google Scholar has detected automated access and is blocking requests, often due to rate-limiting or CAPTCHA challenges.fixImplement a robust proxy strategy using `scholarly.use_proxy(ProxyGenerator())`. For persistent scraping, consider a paid proxy service (e.g., ScraperAPI) or ensure `pg.FreeProxies()` is working correctly. Add delays between requests if scraping in a loop. -
AttributeError: 'Author' object has no attribute 'publications'
cause The `author` object was not fully 'filled' with detailed information, including publications. By default, initial search results provide only summary data to avoid overloading Google Scholar.fixAfter getting an initial author object (e.g., `next(search_query)`), call `scholarly.fill(author_object)` to retrieve comprehensive details like publications, co-authors, and citation counts. For specific sections, use `scholarly.fill(author_object, sections=['publications'])`. -
ImportError: cannot import name 'scholarly' from 'scholarly'
cause This error can occur if you have a local file named `scholarly.py` in your working directory, which shadows the installed library.fixRename your local `scholarly.py` file to something else (e.g., `my_script.py`) or run your script from a directory where no such file exists. -
TypeError: 'builtin_function_or_method' object is not subscriptable (when accessing `pub.bib['title']`)
cause In older versions of `scholarly` (e.g., pre-v0.4.1), the method to access publication titles was `pub.bib['title']`. In some newer versions or during transition, it might have been `pub.bib.title` or `pub.title`. Also, `pub.citedby` was sometimes a method, sometimes an attribute.fixEnsure you are using the correct access pattern for your `scholarly` version. The current recommended way to access bibliographic data is `pub['bib']['title']` (for dictionary-like access) and `scholarly.citedby(pub)` (for generator). Upgrade to the latest `scholarly` version for consistency.
Warnings
- breaking Version 1.7.7 introduced a breaking change by switching the underlying HTTP client from `requests` to `httpx`. Code relying on `requests`-specific functionalities or its session objects will break.
- gotcha Google Scholar employs aggressive anti-bot measures, including CAPTCHAs and rate-limiting. Without proper proxy configuration, your IP address may be temporarily or permanently blocked, leading to `exceeding maximum number of tries` errors.
- deprecated Tor-related proxy methods (`Tor_External`, `Tor_Internal`) have been deprecated since v1.5 and are no longer actively tested or supported.
- breaking Version 1.7.7 introduced an incompatibility with ScraperAPI which was fixed in v1.7.8. Users on v1.7.7 will experience issues when trying to use ScraperAPI.
- gotcha The `search_author_id` function now handles redirects that occur when using approximate or outdated `scholar_id` values. Previously, this might have led to incorrect or failed searches.
Install
-
pip install scholarly -
pip install -U git+https://github.com/scholarly-python-package/scholarly.git
Imports
- scholarly
from scholarly import scholarly
- ProxyGenerator
from scholarly import ProxyGenerator
Quickstart
from scholarly import scholarly, ProxyGenerator
import os
# It is recommended to set up a proxy from the start of your application.
# scholarly is designed to intelligently use proxies only when necessary.
pg = ProxyGenerator()
# For using free proxies (often less reliable for continuous scraping)
# success = pg.FreeProxies()
# if not success: print("Could not set up free proxies. Continuing without.")
# Example for ScraperAPI (recommended for reliability, requires API key)
# Set SCAPERAPI_API_KEY environment variable
scraperapi_key = os.environ.get('SCAPERAPI_API_KEY', '')
if scraperapi_key:
print("Using ScraperAPI for proxies.")
pg.ScraperAPI(scraperapi_key)
scholarly.use_proxy(pg)
else:
print("SCAPERAPI_API_KEY not found. Using default connection (may hit limits).
Consider setting up a proxy for robust scraping.")
# Search for an author
search_query = scholarly.search_author('Steven A Cholewiak')
author = scholarly.fill(next(search_query))
print(f"Author Name: {author['name']}")
print(f"Author Affiliation: {author['affiliation']}")
print(f"Author Interests: {author['interests']}")
# Print the titles of the author's publications
publication_titles = [pub['bib']['title'] for pub in author['publications']]
print(f"First 3 publication titles: {publication_titles[:3]}")
# Take a closer look at the first publication
if author['publications']:
first_publication = scholarly.fill(author['publications'][0])
print(f"\nFirst Publication Title: {first_publication['bib']['title']}")
print(f"First Publication Abstract: {first_publication['bib']['abstract'][:100]}...")
# Which papers cited that publication?
citations = [citation['bib']['title'] for citation in scholarly.citedby(first_publication)]
print(f"First 3 papers citing this publication: {citations[:3]}")