Tweet Harvest (Twitter Crawler)
Tweet Harvest is an active command-line interface (CLI) tool designed for scraping tweets from Twitter search results. Utilizing Playwright, it automates browser interactions to retrieve data based on specified keywords and date ranges, exporting the results to CSV or XLSX formats. The current stable version is 2.7.1, with frequent minor releases addressing bug fixes, performance improvements, and new export functionalities (e.g., XLSX export in v2.7.0). A key differentiator is its reliance on a valid Twitter `auth_token` cookie for authentication, as Twitter prohibits unauthenticated search. While primarily a CLI, it also exposes programmatic APIs for integration into Node.js applications, offering functions to initiate the scraping process and process tweet data. Its continuous updates ensure compatibility with Twitter's evolving interface and provide enhanced data quality features like ISO 8601 timestamps.
Common errors
-
Error: Playwright browser has not been installed.
cause The Playwright browser binaries (e.g., Chromium) required by `tweet-harvest` have not been downloaded.fixRun `npx playwright install` in your project directory to download the necessary browser binaries for Playwright. -
Error: Auth token is not valid. Please make sure you enter a valid Twitter auth token.
cause The provided Twitter `auth_token` is either expired, invalid, or incorrectly formatted, preventing successful authentication with Twitter.fixLog into Twitter in a browser, extract a fresh `auth_token` cookie, and update your configuration or environment variable. Ensure there are no leading/trailing spaces or other characters. -
Error: Cannot read properties of undefined (reading 'page')
cause This often indicates that Playwright failed to launch the browser or navigate to the Twitter page, potentially due to network issues, an unsupported environment, or conflicting browser processes.fixCheck your internet connection, ensure no other processes are interfering with Playwright, and try running in a headful mode (if available via options) to debug browser launch issues. Ensure your Node.js version is compatible with Playwright. -
CSV output does not match expected format / Missing columns.
cause Breaking changes in CSV header order (v2.5.3), delimiter (v2.4.2), or added fields can alter the structure of the output CSV.fixCheck the `tweet-harvest` changelog for recent versions to identify changes in export format. Adjust your CSV parsing logic to account for new delimiters, header order, or additional columns.
Warnings
- breaking The short option for the `--to` flag (`-t`) was removed due to ambiguity with other short options. Users relying on `-t` for the 'to date' will need to update their scripts.
- gotcha Tweet Harvest requires a valid Twitter `auth_token` cookie for authentication. This token can expire or become invalid, leading to failed scrapes. Twitter actively prohibits unauthenticated search, making this token essential.
- breaking The default CSV delimiter was changed from `;` to `,`. This will affect any scripts or tools parsing the output that expected the semicolon delimiter.
- gotcha Changes in Twitter's cookie domains or internal structure can cause authentication or scraping failures. Version 2.6.1 specifically addressed a fix for 'cookie domain changes'.
- breaking Consistency of CSV headers order was fixed, and support for Gephi format was added. Existing scripts that rely on a specific, potentially inconsistent, header order might break or yield incorrect data.
Install
-
npm install tweet-harvest -
yarn add tweet-harvest -
pnpm add tweet-harvest
Imports
- harvest
const { harvest } = require('tweet-harvest');import { harvest } from 'tweet-harvest'; - Options
import { type Options } from 'tweet-harvest'; - cleanTweet
const cleanTweet = require('tweet-harvest').cleanTweet;import { cleanTweet } from 'tweet-harvest';
Quickstart
import { harvest } from 'tweet-harvest';
import type { Options } from 'tweet-harvest';
const twitterAuthToken = process.env.TWITTER_AUTH_TOKEN ?? ''; // Get this from your browser cookies
if (!twitterAuthToken) {
console.error('TWITTER_AUTH_TOKEN environment variable is not set. Please provide a valid Twitter auth token from your browser cookies.');
process.exit(1);
}
const options: Options = {
keyword: 'AI ethics',
from: '2023-01-01',
to: '2023-12-31',
filename: 'ai-ethics-tweets',
limit: 100, // Limit to 100 tweets for this example
exportFormat: 'csv',
auth_token: twitterAuthToken,
withReplies: false,
withImages: false,
withVideos: false
};
async function runHarvest() {
console.log('Starting tweet harvest...');
try {
await harvest(options);
console.log(`Successfully harvested tweets to ${options.filename}.csv`);
} catch (error) {
console.error('Error during tweet harvest:', error);
if (error instanceof Error && error.message.includes('auth_token')) {
console.error('Ensure your TWITTER_AUTH_TOKEN is valid and up-to-date.');
}
}
}
runHarvest();