A lightweight Python library with sync & async support, automatic Playwright fallback for JS-heavy pages, and built-in output formatting. Zero boilerplate.
Handles the edge cases — retries, JavaScript, timeouts, redirects — so you don't have to wire them up yourself.
Sync & Async support
Both sync_scraper and async_scraper built in. Use whichever fits your stack — no wrappers needed.
Playwright fallback
Automatically falls back to a headless Chromium browser when plain HTTP gets a 403 or returns blank JS-rendered content.
Smart retries
Built-in retry logic with intelligent fallbacks. Transient failures don't crash your pipeline.
4 output formats
Get back HTML, plain text, Markdown, or JSON from any page — selected per-request, zero extra parsing.
Custom headers & agents
Pass any HTTP header or custom User-Agent list. Control redirect following and timeouts per request.
File saving & .env config
Save scraped output directly to disk. Configure defaults via .env — timeouts, log level, user agents, format.
How it works
Three steps to clean data
scrapesome picks the fastest path automatically — fallback only fires when needed.
01
Step 01
Call the scraper
Import sync_scraper or async_scraper and pass a URL. That's the minimum. All other options are optional.
Input: URL + optional params
02
Step 02
Auto-select engine
scrapesome tries a fast HTTP fetch first. If it gets a 403, blank body, or detects JS-rendered content, it seamlessly retries with Playwright — or you can force it.
Auto-detects: HTTP or Playwright
03
Step 03
Receive formatted output
Get back clean HTML, text, Markdown, or JSON — ready to store, search, embed into an LLM pipeline, or save to a file.
Output: HTML · Text · Markdown · JSON
Output formats
Data exactly how you need it
Choose per request with the output_format_type parameter. Switch freely between formats.
HTML
Full raw markup — ideal for DOM parsing, archiving, or feeding into an HTML parser
JSON
Structured key-value output — title, description, URL — perfect for APIs and databases
Markdown
Readable structured text — great for LLM context, documentation, or readable exports
Text
Stripped plain text — minimal noise, ideal for search indexing, NLP, or summarisation
Usage
Sync and async, your choice
Both modes are first-class citizens. Drop into an async pipeline or use the synchronous API for scripts — same options, same output.
Most alternatives require you to wire up retries, fallbacks, and output parsing yourself. scrapesome ships all of that.
Feature
scrapesome
Playwright (py)
Requests-HTML
Scrapy
JS rendering
✓ Auto fallback
✓ Always
~ Partial
~ Plugin
Auto 403/blank fallback
✓ Seamless
✗ Manual
✗ No
✗ Manual
Sync + Async
✓ Built-in
✗ Async only
✗ Sync only
✗ Async only
JSON / Markdown / Text output
✓ Built-in
✗ Manual
✗ Basic
✗ Pipeline
Retries + smart defaults
✓ Built-in
✗ Manual
✗ Limited
~ Settings
pip install & go
✓ < 1 min
✗ Browser setup
✓ Yes
✗ Project setup
.env configuration
✓ Optional
✗ Code only
✗ Hardcoded
~ Project
Installation
Up and running in under a minute
Core library works out of the box. Playwright is only needed if you want JavaScript rendering.
1
Install the library
Run pip install scrapesome. That's enough to start scraping static pages immediately.
2
Install Playwright (optional)
For JS-rendered pages: pip install playwright then playwright install to download browsers.
3
System deps on Linux
Run playwright install-deps on Ubuntu/Debian to install required system libraries for Playwright.
4
Optional: .env config
Create a .env file to set LOG_LEVEL, OUTPUT_FORMAT, FETCH_PAGE_TIMEOUT, and USER_AGENTS as defaults.
Web Scraper
Live
Enter a URL, pick your format, and run. Results appear below.
Output format
Timeout
10s
Force Playwright
Skip HTTP — go straight to headless browser
Allow redirects
Follow 3xx responses automatically
Output
HTML
About scrapesome
scrapesome is a lightweight, flexible Python web scraping library with both synchronous and asynchronous support. It handles the common pain points — JavaScript rendering, retries, redirects, timeouts, and output formatting — so you can focus on the data, not the plumbing.
The library uses plain HTTP requests first for speed. When a page returns a 403 or blank JS-rendered body, it automatically falls back to Playwright (headless Chromium). You can also force Playwright directly for SPAs.
Output can be returned as raw HTML, structured JSON, Markdown, or plain text — selected per request. A CLI is available for one-off scrapes from the terminal, and all defaults can be configured via a .env file.
scrapesome is MIT licensed and built for developers who need robust scraping utilities with minimal setup. Contributions, bug reports, and feature requests are welcome on GitHub.