web-scraping

Installation
Summary

Reliable web scraping with cascading fallbacks, anti-bot bypass, and poison pill detection.

  • Implements a scraping cascade architecture with four strategies: trafilatura for fast article extraction, requests with rotating user agents, Playwright with stealth mode for JavaScript-heavy sites, and async Playwright for Jupyter notebooks
  • Includes poison pill detection to identify paywalls, CAPTCHAs, rate limits, Cloudflare blocks, and login walls using pattern matching and status code analysis
  • Covers reverse-engineering undocumented APIs through browser dev tools, with examples for autocomplete endpoints and parameter stripping
  • Provides ready-to-use patterns for YouTube (yt-dlp), Instagram (instaloader), and TikTok scraping, including metadata extraction, video downloads, and transcript retrieval
  • Emphasizes ethical scraping: robots.txt compliance, rate limiting, request delays per domain, and respectful User-Agent headers
SKILL.md

Web scraping methodology

Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.

Scraping cascade architecture

Implement multiple extraction strategies with automatic fallback:

from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura

#for .py files
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
Related skills

More from jamditis/claude-skills-journalism

Installs
4.6K
GitHub Stars
201
First Seen
Jan 21, 2026