web-scraping
Reliable web scraping with cascading fallbacks, anti-bot bypass, and poison pill detection.
- Implements a scraping cascade architecture with four strategies: trafilatura for fast article extraction, requests with rotating user agents, Playwright with stealth mode for JavaScript-heavy sites, and async Playwright for Jupyter notebooks
- Includes poison pill detection to identify paywalls, CAPTCHAs, rate limits, Cloudflare blocks, and login walls using pattern matching and status code analysis
- Covers reverse-engineering undocumented APIs through browser dev tools, with examples for autocomplete endpoints and parameter stripping
- Provides ready-to-use patterns for YouTube (yt-dlp), Instagram (instaloader), and TikTok scraping, including metadata extraction, video downloads, and transcript retrieval
- Emphasizes ethical scraping: robots.txt compliance, rate limiting, request delays per domain, and respectful User-Agent headers
Web scraping methodology
Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.
Scraping cascade architecture
Implement multiple extraction strategies with automatic fallback:
from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura
#for .py files
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
More from jamditis/claude-skills-journalism
academic-writing
Academic writing, research methodology, and scholarly communication workflows. Use when writing papers, literature reviews, grant proposals, conducting research, managing citations, preparing for peer review, choosing OA routes under Plan S / 2026 OSTP Nelson Memo, posting preprints, working with persistent identifiers (ORCID, DOI, ROR), assigning CRediT contributor roles, preregistering analyses on OSF / AsPredicted, or disclosing LLM use to journals and funders. Essential for researchers, graduate students, and academics across disciplines.
1.8Kpage-monitoring
Web page monitoring, change detection, and availability tracking. Use when tracking content changes, detecting when pages go down, monitoring for updates, preserving content before deletion, or generating feeds for pages without RSS. Covers Visualping, ChangeTower, Distill.io, and self-hosted monitoring solutions.
471pdf-design
Design and edit professional PDF reports and proposals with live preview
259social-media-intelligence
Social media monitoring, narrative tracking, and open-source intelligence for journalists. Use when tracking viral content spread, analyzing coordinated campaigns, monitoring breaking news on social platforms, investigating accounts for authenticity, or detecting misinformation patterns. Essential for reporters covering online narratives and digital investigations.
168source-verification
Journalism source verification and fact-checking workflows. Use when verifying claims, checking source credibility, investigating social media accounts, reverse image searching, detecting AI-generated content, or building verification trails. For reporters, fact-checkers, and researchers working with unverified information.
160fact-check-workflow
Structured workflow for fact-checking claims in journalism. Use when verifying statements for publication, rating claims for fact-check articles, or building pre-publication verification processes. Includes claim extraction, evidence gathering, rating scales, and correction protocols.
143