menu
Navigate
Actions GitHub
Preferences
Light mode
Open source · MIT · pip install scrapesome

Scrape any page.
Get clean data out.

A lightweight Python library with sync & async support, automatic Playwright fallback for JS-heavy pages, and built-in output formatting. Zero boilerplate.

GitHub Docs
$ pip install scrapesome
example.py
from scrapesome import sync_scraper # returns clean Markdown, no setup needed content = sync_scraper( "https://example.com", output_format_type="markdown", timeout=10, force_playwright=False, allow_redirects=True ) print(content)
2
Fetch engines (HTTP + Playwright)
4
Output formats
0
Required config to get started
MIT
Open-source license
Core features

Robust scraping.
Minimal setup.

Handles the edge cases — retries, JavaScript, timeouts, redirects — so you don't have to wire them up yourself.

Sync & Async support
Both sync_scraper and async_scraper built in. Use whichever fits your stack — no wrappers needed.
Playwright fallback
Automatically falls back to a headless Chromium browser when plain HTTP gets a 403 or returns blank JS-rendered content.
Smart retries
Built-in retry logic with intelligent fallbacks. Transient failures don't crash your pipeline.
4 output formats
Get back HTML, plain text, Markdown, or JSON from any page — selected per-request, zero extra parsing.
Custom headers & agents
Pass any HTTP header or custom User-Agent list. Control redirect following and timeouts per request.
File saving & .env config
Save scraped output directly to disk. Configure defaults via .env — timeouts, log level, user agents, format.
How it works

Three steps to clean data

scrapesome picks the fastest path automatically — fallback only fires when needed.

01
Step 01

Call the scraper

Import sync_scraper or async_scraper and pass a URL. That's the minimum. All other options are optional.

Input: URL + optional params
02
Step 02

Auto-select engine

scrapesome tries a fast HTTP fetch first. If it gets a 403, blank body, or detects JS-rendered content, it seamlessly retries with Playwright — or you can force it.

Auto-detects: HTTP or Playwright
03
Step 03

Receive formatted output

Get back clean HTML, text, Markdown, or JSON — ready to store, search, embed into an LLM pipeline, or save to a file.

Output: HTML · Text · Markdown · JSON
Output formats

Data exactly how you need it

Choose per request with the output_format_type parameter. Switch freely between formats.

HTML
Full raw markup — ideal for DOM parsing, archiving, or feeding into an HTML parser
JSON
Structured key-value output — title, description, URL — perfect for APIs and databases
Markdown
Readable structured text — great for LLM context, documentation, or readable exports
Text
Stripped plain text — minimal noise, ideal for search indexing, NLP, or summarisation
Usage

Sync and async,
your choice

Both modes are first-class citizens. Drop into an async pipeline or use the synchronous API for scripts — same options, same output.

  • Force Playwright for JS-rendered SPAs
  • Set custom User-Agent strings
  • Toggle redirect following
  • Save output directly to file
  • Configure defaults via .env
Read the docs
Synchronous
from scrapesome import sync_scraper # basic html = sync_scraper("https://example.com") # with options result = sync_scraper( "https://example.com", output_format_type="json", force_playwright=True, allow_redirects=True, user_agents=["Mozilla/5.0"] ) # save to file result = sync_scraper( "https://example.com", output_format_type="markdown", save_to_file=True, file_name="output" ) print(result["file"]) # output.md
Asynchronous
import asyncio from scrapesome import async_scraper html = asyncio.run( async_scraper("https://example.com") ) # or inside an async function async def scrape(): content = await async_scraper( "https://example.com", output_format_type="markdown", force_playwright=False ) return content
CLI

Scrape from the
command line too

Install with CLI extras and use scrapesome scrape from any terminal. Same options as the Python API.

$ pip install scrapesome[cli]
--urlTarget URL to scrape
--output-formathtml · json · markdown · text
--force-playwrightForce JS rendering via Playwright
--async-modeUse asynchronous scraping
--save-to-file -sSave output to file
--file-name -nBase filename (extension auto-added)
Terminal
# basic scrape scrapesome scrape --url https://example.com # markdown output scrapesome scrape \ --url https://example.com \ --output-format markdown # save to file scrapesome scrape \ --url https://example.com \ --output-format markdown \ --save-to-file \ --file-name output # → saves output.md # force JS rendering + async scrapesome scrape \ --url https://example.com \ --force-playwright \ --async-mode \ --output-format json
Why scrapesome

How it compares

Most alternatives require you to wire up retries, fallbacks, and output parsing yourself. scrapesome ships all of that.

Feature scrapesome Playwright (py) Requests-HTML Scrapy
JS rendering✓ Auto fallback✓ Always~ Partial~ Plugin
Auto 403/blank fallback✓ Seamless✗ Manual✗ No✗ Manual
Sync + Async✓ Built-in✗ Async only✗ Sync only✗ Async only
JSON / Markdown / Text output✓ Built-in✗ Manual✗ Basic✗ Pipeline
Retries + smart defaults✓ Built-in✗ Manual✗ Limited~ Settings
pip install & go✓ < 1 min✗ Browser setup✓ Yes✗ Project setup
.env configuration✓ Optional✗ Code only✗ Hardcoded~ Project
Installation

Up and running
in under a minute

Core library works out of the box. Playwright is only needed if you want JavaScript rendering.

1

Install the library

Run pip install scrapesome. That's enough to start scraping static pages immediately.

2

Install Playwright (optional)

For JS-rendered pages: pip install playwright then playwright install to download browsers.

3

System deps on Linux

Run playwright install-deps on Ubuntu/Debian to install required system libraries for Playwright.

4

Optional: .env config

Create a .env file to set LOG_LEVEL, OUTPUT_FORMAT, FETCH_PAGE_TIMEOUT, and USER_AGENTS as defaults.

Web Scraper

Live

Enter a URL, pick your format, and run. Results appear below.

Output format
Timeout
10s
Force Playwright
Skip HTTP — go straight to headless browser
Allow redirects
Follow 3xx responses automatically

About scrapesome

scrapesome is a lightweight, flexible Python web scraping library with both synchronous and asynchronous support. It handles the common pain points — JavaScript rendering, retries, redirects, timeouts, and output formatting — so you can focus on the data, not the plumbing.

The library uses plain HTTP requests first for speed. When a page returns a 403 or blank JS-rendered body, it automatically falls back to Playwright (headless Chromium). You can also force Playwright directly for SPAs.

Output can be returned as raw HTML, structured JSON, Markdown, or plain text — selected per request. A CLI is available for one-off scrapes from the terminal, and all defaults can be configured via a .env file.

scrapesome is MIT licensed and built for developers who need robust scraping utilities with minimal setup. Contributions, bug reports, and feature requests are welcome on GitHub.